如何检测字符串中是否存在URL

Question

如何检测字符串中是否存在URL

30

我有一个输入字符串，例如 Please go to http://stackoverflow.com。许多浏览器/IDE/应用程序可以检测到字符串中的URL部分，并自动添加锚标签<a href=""></a>。所以它变成了Please go to <a href='http://stackoverflow.com'>http://stackoverflow.com</a>。

我需要使用Java实现同样的功能。

- Rakesh N

12个回答

14

虽然这不是特定于Java的，但Jeff Atwood最近发布了一篇关于在任意文本中查找和匹配URL时可能遇到的陷阱的文章：

问题在于URL

它提供了一个良好的正则表达式，可与代码片段一起使用，您需要使用它来正确（或多或少地）处理括号。

正则表达式：

\(?\bhttp://[-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|]

参数清理：

if (s.StartsWith("(") && s.EndsWith(")"))
{
    return s.Substring(1, s.Length - 2);
}

- Mike B

1

Jeff Atwood的博客文章正确的URL是：The Problem With URLs。 - Sonson123

5

您可以像这样做（调整正则表达式以适应您的需求）：

String originalString = "Please go to http://www.stackoverflow.com";
String newString = originalString.replaceAll("http://.+?(com|net|org)/{0,1}", "<a href=\"$0\">$0</a>");

- Jason Coco

2

以下代码对“Atwood Approach”进行了如下修改:

检测https和http(添加其他方案很容易)
使用CASE_INSENSTIVE标志，因为HtTpS://是有效的
剥离匹配的括号集(它们可以嵌套到任何级别)。此外，任何剩余的未匹配左括号都将被剥离，但尾随的右括号将保持不变(以尊重维基百科风格的URL)
在链接文本中对URL进行HTML编码
通过方法参数传递目标属性。可以根据需要添加其他属性。
在匹配URL之前，它不使用\b来识别单词断点。URL可以以左括号或http[s]://开头，没有其他要求。

注:

代码中使用了Apache Commons Lang的StringUtils
下面调用的HtmlUtil.encode()是一个工具，最终调用了一些Tomahawk代码来HTML编码链接文本，但任何类似的实用程序都可以。
有关在JSF或其他默认输出为HTML编码的环境中使用的用法，请参见方法注释。

这是针对我们客户的需求编写的，并且我们认为它代表了从RFC和常见用法中允许的字符之间的合理折衷。我们希望它对其他人有用。

进一步扩展可以允许输入任何Unicode字符(即不使用%XX(两位十六进制数)转义)，并进行超链接，但这需要接受所有Unicode字母加上有限的标点符号，然后在“可接受”的分隔符(例如.,%,|,#等)上分割，对每个部分进行URL编码，然后将它们粘合在一起。例如，http://en.wikipedia.org/wiki/Björn_Andrésen(堆栈溢出生成器无法检测到)将在href中为"http://en.wikipedia.org/wiki/Bj%C3%B6rn_Andr%C3%A9sen"，但在页面上的链接文本中包含Björn_Andrésen。

// NOTES:   1) \w includes 0-9, a-z, A-Z, _
//          2) The leading '-' is the '-' character. It must go first in character class expression
private static final String VALID_CHARS = "-\\w+&@#/%=~()|";
private static final String VALID_NON_TERMINAL = "?!:,.;";

// Notes on the expression:
//  1) Any number of leading '(' (left parenthesis) accepted.  Will be dealt with.  
//  2) s? ==> the s is optional so either [http, https] accepted as scheme
//  3) All valid chars accepted and then one or more
//  4) Case insensitive so that the scheme can be hTtPs (for example) if desired
private static final Pattern URI_FINDER_PATTERN = Pattern.compile("\\(*https?://["+ VALID_CHARS + VALID_NON_TERMINAL + "]*[" +VALID_CHARS + "]", Pattern.CASE_INSENSITIVE );

/**
 * <p>
 * Finds all "URL"s in the given _rawText, wraps them in 
 * HTML link tags and returns the result (with the rest of the text
 * html encoded).
 * </p>
 * <p>
 * We employ the procedure described at:
 * http://www.codinghorror.com/blog/2008/10/the-problem-with-urls.html
 * which is a <b>must-read</b>.
 * </p>
 * Basically, we allow any number of left parenthesis (which will get stripped away)
 * followed by http:// or https://.  Then any number of permitted URL characters
 * (based on http://www.ietf.org/rfc/rfc1738.txt) followed by a single character
 * of that set (basically, those minus typical punctuation).  We remove all sets of 
 * matching left & right parentheses which surround the URL.
 *</p>
 * <p>
 * This method *must* be called from a tag/component which will NOT
 * end up escaping the output.  For example:
 * <PRE>
 * <h:outputText ... escape="false" value="#{core:hyperlinkText(textThatMayHaveURLs, '_blank')}"/>
 * </pre>
 * </p>
 * <p>
 * Reason: we are adding <code>&lt;a href="..."&gt;</code> tags to the output *and*
 * encoding the rest of the string.  So, encoding the outupt will result in
 * double-encoding data which was already encoded - and encoding the <code>a href</code>
 * (which will render it useless).
 * </p>
 * <p>
 * 
 * @param   _rawText  - if <code>null</code>, returns <code>""</code> (empty string).
 * @param   _target   - if not <code>null</code> or <code>""</code>, adds a target attributed to the generated link, using _target as the attribute value.
 */
public static final String hyperlinkText( final String _rawText, final String _target ) {

    String returnValue = null;

    if ( !StringUtils.isBlank( _rawText ) ) {

        final Matcher matcher = URI_FINDER_PATTERN.matcher( _rawText );

        if ( matcher.find() ) {

            final int originalLength    =   _rawText.length();

            final String targetText = ( StringUtils.isBlank( _target ) ) ? "" :  " target=\"" + _target.trim() + "\"";
            final int targetLength      =   targetText.length();

            // Counted 15 characters aside from the target + 2 of the URL (max if the whole string is URL)
            // Rough guess, but should keep us from expanding the Builder too many times.
            final StringBuilder returnBuffer = new StringBuilder( originalLength * 2 + targetLength + 15 );

            int currentStart;
            int currentEnd;
            int lastEnd     = 0;

            String currentURL;

            do {
                currentStart = matcher.start();
                currentEnd = matcher.end();
                currentURL = matcher.group();

                // Adjust for URLs wrapped in ()'s ... move start/end markers
                //      and substring the _rawText for new URL value.
                while ( currentURL.startsWith( "(" ) && currentURL.endsWith( ")" ) ) {
                    currentStart = currentStart + 1;
                    currentEnd = currentEnd - 1;

                    currentURL = _rawText.substring( currentStart, currentEnd );
                }

                while ( currentURL.startsWith( "(" ) ) {
                    currentStart = currentStart + 1;

                    currentURL = _rawText.substring( currentStart, currentEnd );
                }

                // Text since last match
                returnBuffer.append( HtmlUtil.encode( _rawText.substring( lastEnd, currentStart ) ) );

                // Wrap matched URL
                returnBuffer.append( "<a href=\"" + currentURL + "\"" + targetText + ">" + currentURL + "</a>" );

                lastEnd = currentEnd;

            } while ( matcher.find() );

            if ( lastEnd < originalLength ) {
                returnBuffer.append( HtmlUtil.encode( _rawText.substring( lastEnd ) ) );
            }

            returnValue = returnBuffer.toString();
        }
    } 

    if ( returnValue == null ) {
        returnValue = HtmlUtil.encode( _rawText );
    }

    return returnValue;

}

- Jacob Zwiers

0

   public static List<String> extractURL(String text) {
    List<String> list = new ArrayList<>();
    Pattern pattern = Pattern
            .compile(
                    "(http://|https://){1}[\\w\\.\\-/:\\#\\?\\=\\&\\;\\%\\~\\+]+",
                    Pattern.CASE_INSENSITIVE);
    Matcher matcher = pattern.matcher(text);
    while (matcher.find()) {
        list.add(matcher.group());
    }
    return list;
}

- Bayram Binbir

0

我制作了一个小型库，正好可以做到这一点：

https://github.com/robinst/autolink-java

一些棘手的例子和它检测到的链接：

http://example.com. → http://example.com。
http://example.com, → http://example.com,
(http://example.com) → (http://example.com)
(... (see http://example.com)) → (... (see http://example.com))
https://en.wikipedia.org/wiki/Link_(The_Legend_of_Zelda) → https://en.wikipedia.org/wiki/Link_(The_Legend_of_Zelda)
http://üñîçøðé.com/ → http://üñîçøðé.com/

- robinst

有没有计划支持不以http/https开头的URL，例如www.test.com？ - mark

@mark 目前还没有，但我很乐意接受拉取请求。它可以作为一个可选功能添加。 - robinst

@levacjeep 你为什么这样说？ - robinst

@levacjeep 你看，现在有新的提交和新版本发布了！ - robinst

@robinst 哈哈，不错！我会看一下的 :)。 - levacjeep

显示剩余2条评论

0

您提出了两个不同的问题。

如何最好地识别字符串中的URL？请参见此线程
如何在Java中编写上述解决方案？其他使用String.replaceAll的示例已经解决了这个问题。

- ykaganovich

0

PhiLho的答案可以做出改进: msg.replaceAll("(?:https?|ftps?)://[\w/%.-][/\??\w=?\w?/%.-]?[/\?&\w=?\w?/%.-]*", "$0");

- Sérgio Nunes

0

原始类型：

String msg = "Please go to http://stackoverflow.com";
String withURL = msg.replaceAll("(?:https?|ftps?)://[\\w/%.-]+", "<a href='$0'>$0</a>");
System.out.println(withURL);

这需要进行改进，以匹配正确的URL，特别是GET参数（?foo=bar&x=25）

- PhiLho

0

我编写了自己的URI/URL提取器，并认为有人可能会发现它很有用，因为在我看来，它比其他答案更好，原因如下：

它是基于流的，可以用于大型文档
它可扩展以处理各种"Atwood Paren"问题，通过策略链。

由于代码对于一个帖子来说有点长（尽管只有一个Java文件），所以我将其放在了gist github上。

这里是其中一个主要方法的签名，以展示它如何符合上述要点：

public static Iterator<ExtractedURI> extractURIs(
    final Reader reader,
    final Iterable<ToURIStrategy> strategies,
    String ... schemes);

有一个默认的策略链来处理大多数Atwood问题。

public static List<ToURIStrategy> DEFAULT_STRATEGY_CHAIN = ImmutableList.of(
    new RemoveSurroundsWithToURIStrategy("'"),
    new RemoveSurroundsWithToURIStrategy("\""),
    new RemoveSurroundsWithToURIStrategy("(", ")"),
    new RemoveEndsWithToURIStrategy("."),
    DEFAULT_STRATEGY,
    REMOVE_LAST_STRATEGY);

享受！

- Adam Gent

你的解决方案可能很好，但需要更多的解释。我没有读者和方案数组，也不知道它是用来做什么的。我只想转换一个字符串值。 - tak3shi

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Oscar Reyes · Accepted Answer

使用java.net.URL实现！

嘿，为什么不使用Java核心类"java.net.URL"来验证URL呢？让它来验证URL。

虽然以下代码违反了黄金原则“仅在异常情况下使用异常”，但对我来说，试图为在Java平台上已经非常成熟的东西重新发明轮子是没有意义的。

以下是代码：

import java.net.URL;
import java.net.MalformedURLException;

// Replaces URLs with html hrefs codes
public class URLInString {
    public static void main(String[] args) {
        String s = args[0];
        // separate input by spaces ( URLs don't have spaces )
        String [] parts = s.split("\\s+");

        // Attempt to convert each item into an URL.   
        for( String item : parts ) try {
            URL url = new URL(item);
            // If possible then replace with anchor...
            System.out.print("<a href=\"" + url + "\">"+ url + "</a> " );    
        } catch (MalformedURLException e) {
            // If there was an URL that was not it!...
            System.out.print( item + " " );
        }

        System.out.println();
    }
}

使用以下输入：

"Please go to http://stackoverflow.com and then mailto:oscarreyes@wordpress.com to download a file from    ftp://user:pass@someserver/someFile.txt"

产生以下输出：

Please go to <a href="http://stackoverflow.com">http://stackoverflow.com</a> and then <a href="mailto:oscarreyes@wordpress.com">mailto:oscarreyes@wordpress.com</a> to download a file from    <a href="ftp://user:pass@someserver/someFile.txt">ftp://user:pass@someserver/someFile.txt</a>

当然，不同的协议可以以不同的方式处理。例如，您可以使用URL类的“getters”获取所有信息。

 url.getProtocol();

或者其他属性：spec、port、file、query、ref等等。

处理所有协议（至少是Java平台知道的所有协议），并且作为额外的好处，如果有任何Java当前不识别的URL，并最终被库更新并纳入URL类中，您将获得透明的处理！详情请参考：http://java.sun.com/javase/6/docs/api/java/net/URL.html。请注意保留HTML标签。