我有一个输入字符串,例如 Please go to http://stackoverflow.com
。许多浏览器/IDE/应用程序可以检测到字符串中的URL部分,并自动添加锚标签<a href=""></a>
。所以它变成了Please go to <a href='http://stackoverflow.com'>http://stackoverflow.com</a>
。
我需要使用Java实现同样的功能。
嘿,为什么不使用Java核心类"java.net.URL"来验证URL呢?让它来验证URL。
虽然以下代码违反了黄金原则“仅在异常情况下使用异常”,但对我来说,试图为在Java平台上已经非常成熟的东西重新发明轮子是没有意义的。
以下是代码:
import java.net.URL;
import java.net.MalformedURLException;
// Replaces URLs with html hrefs codes
public class URLInString {
public static void main(String[] args) {
String s = args[0];
// separate input by spaces ( URLs don't have spaces )
String [] parts = s.split("\\s+");
// Attempt to convert each item into an URL.
for( String item : parts ) try {
URL url = new URL(item);
// If possible then replace with anchor...
System.out.print("<a href=\"" + url + "\">"+ url + "</a> " );
} catch (MalformedURLException e) {
// If there was an URL that was not it!...
System.out.print( item + " " );
}
System.out.println();
}
}
使用以下输入:
"Please go to http://stackoverflow.com and then mailto:oscarreyes@wordpress.com to download a file from ftp://user:pass@someserver/someFile.txt"
Please go to <a href="http://stackoverflow.com">http://stackoverflow.com</a> and then <a href="mailto:oscarreyes@wordpress.com">mailto:oscarreyes@wordpress.com</a> to download a file from <a href="ftp://user:pass@someserver/someFile.txt">ftp://user:pass@someserver/someFile.txt</a>
url.getProtocol();
虽然这不是特定于Java的,但Jeff Atwood最近发布了一篇关于在任意文本中查找和匹配URL时可能遇到的陷阱的文章:
它提供了一个良好的正则表达式,可与代码片段一起使用,您需要使用它来正确(或多或少地)处理括号。
正则表达式:
\(?\bhttp://[-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|]
参数清理:
if (s.StartsWith("(") && s.EndsWith(")"))
{
return s.Substring(1, s.Length - 2);
}
String originalString = "Please go to http://www.stackoverflow.com";
String newString = originalString.replaceAll("http://.+?(com|net|org)/{0,1}", "<a href=\"$0\">$0</a>");
注:
这是针对我们客户的需求编写的,并且我们认为它代表了从RFC和常见用法中允许的字符之间的合理折衷。我们希望它对其他人有用。
进一步扩展可以允许输入任何Unicode字符(即不使用%XX(两位十六进制数)转义),并进行超链接,但这需要接受所有Unicode字母加上有限的标点符号,然后在“可接受”的分隔符(例如.,%,|,#等)上分割,对每个部分进行URL编码,然后将它们粘合在一起。例如,http://en.wikipedia.org/wiki/Björn_Andrésen(堆栈溢出生成器无法检测到)将在href中为"http://en.wikipedia.org/wiki/Bj%C3%B6rn_Andr%C3%A9sen",但在页面上的链接文本中包含Björn_Andrésen。
// NOTES: 1) \w includes 0-9, a-z, A-Z, _
// 2) The leading '-' is the '-' character. It must go first in character class expression
private static final String VALID_CHARS = "-\\w+&@#/%=~()|";
private static final String VALID_NON_TERMINAL = "?!:,.;";
// Notes on the expression:
// 1) Any number of leading '(' (left parenthesis) accepted. Will be dealt with.
// 2) s? ==> the s is optional so either [http, https] accepted as scheme
// 3) All valid chars accepted and then one or more
// 4) Case insensitive so that the scheme can be hTtPs (for example) if desired
private static final Pattern URI_FINDER_PATTERN = Pattern.compile("\\(*https?://["+ VALID_CHARS + VALID_NON_TERMINAL + "]*[" +VALID_CHARS + "]", Pattern.CASE_INSENSITIVE );
/**
* <p>
* Finds all "URL"s in the given _rawText, wraps them in
* HTML link tags and returns the result (with the rest of the text
* html encoded).
* </p>
* <p>
* We employ the procedure described at:
* http://www.codinghorror.com/blog/2008/10/the-problem-with-urls.html
* which is a <b>must-read</b>.
* </p>
* Basically, we allow any number of left parenthesis (which will get stripped away)
* followed by http:// or https://. Then any number of permitted URL characters
* (based on http://www.ietf.org/rfc/rfc1738.txt) followed by a single character
* of that set (basically, those minus typical punctuation). We remove all sets of
* matching left & right parentheses which surround the URL.
*</p>
* <p>
* This method *must* be called from a tag/component which will NOT
* end up escaping the output. For example:
* <PRE>
* <h:outputText ... escape="false" value="#{core:hyperlinkText(textThatMayHaveURLs, '_blank')}"/>
* </pre>
* </p>
* <p>
* Reason: we are adding <code><a href="..."></code> tags to the output *and*
* encoding the rest of the string. So, encoding the outupt will result in
* double-encoding data which was already encoded - and encoding the <code>a href</code>
* (which will render it useless).
* </p>
* <p>
*
* @param _rawText - if <code>null</code>, returns <code>""</code> (empty string).
* @param _target - if not <code>null</code> or <code>""</code>, adds a target attributed to the generated link, using _target as the attribute value.
*/
public static final String hyperlinkText( final String _rawText, final String _target ) {
String returnValue = null;
if ( !StringUtils.isBlank( _rawText ) ) {
final Matcher matcher = URI_FINDER_PATTERN.matcher( _rawText );
if ( matcher.find() ) {
final int originalLength = _rawText.length();
final String targetText = ( StringUtils.isBlank( _target ) ) ? "" : " target=\"" + _target.trim() + "\"";
final int targetLength = targetText.length();
// Counted 15 characters aside from the target + 2 of the URL (max if the whole string is URL)
// Rough guess, but should keep us from expanding the Builder too many times.
final StringBuilder returnBuffer = new StringBuilder( originalLength * 2 + targetLength + 15 );
int currentStart;
int currentEnd;
int lastEnd = 0;
String currentURL;
do {
currentStart = matcher.start();
currentEnd = matcher.end();
currentURL = matcher.group();
// Adjust for URLs wrapped in ()'s ... move start/end markers
// and substring the _rawText for new URL value.
while ( currentURL.startsWith( "(" ) && currentURL.endsWith( ")" ) ) {
currentStart = currentStart + 1;
currentEnd = currentEnd - 1;
currentURL = _rawText.substring( currentStart, currentEnd );
}
while ( currentURL.startsWith( "(" ) ) {
currentStart = currentStart + 1;
currentURL = _rawText.substring( currentStart, currentEnd );
}
// Text since last match
returnBuffer.append( HtmlUtil.encode( _rawText.substring( lastEnd, currentStart ) ) );
// Wrap matched URL
returnBuffer.append( "<a href=\"" + currentURL + "\"" + targetText + ">" + currentURL + "</a>" );
lastEnd = currentEnd;
} while ( matcher.find() );
if ( lastEnd < originalLength ) {
returnBuffer.append( HtmlUtil.encode( _rawText.substring( lastEnd ) ) );
}
returnValue = returnBuffer.toString();
}
}
if ( returnValue == null ) {
returnValue = HtmlUtil.encode( _rawText );
}
return returnValue;
}
public static List<String> extractURL(String text) {
List<String> list = new ArrayList<>();
Pattern pattern = Pattern
.compile(
"(http://|https://){1}[\\w\\.\\-/:\\#\\?\\=\\&\\;\\%\\~\\+]+",
Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
list.add(matcher.group());
}
return list;
}
https://github.com/robinst/autolink-java
一些棘手的例子和它检测到的链接:
http://example.com.
→ http://example.com。http://example.com,
→ http://example.com,(http://example.com)
→ (http://example.com)(... (see http://example.com))
→ (... (see http://example.com))https://en.wikipedia.org/wiki/Link_(The_Legend_of_Zelda)
→
https://en.wikipedia.org/wiki/Link_(The_Legend_of_Zelda) http://üñîçøðé.com/
→ http://üñîçøðé.com/您提出了两个不同的问题。
String.replaceAll
的示例已经解决了这个问题。msg.replaceAll("(?:https?|ftps?)://[\w/%.-][/\??\w=?\w?/%.-]?[/\?&\w=?\w?/%.-]*", "$0");
原始类型:
String msg = "Please go to http://stackoverflow.com";
String withURL = msg.replaceAll("(?:https?|ftps?)://[\\w/%.-]+", "<a href='$0'>$0</a>");
System.out.println(withURL);
这需要进行改进,以匹配正确的URL,特别是GET参数(?foo=bar&x=25)
我编写了自己的URI/URL提取器,并认为有人可能会发现它很有用,因为在我看来,它比其他答案更好,原因如下:
由于代码对于一个帖子来说有点长(尽管只有一个Java文件),所以我将其放在了gist github上。
这里是其中一个主要方法的签名,以展示它如何符合上述要点:
public static Iterator<ExtractedURI> extractURIs(
final Reader reader,
final Iterable<ToURIStrategy> strategies,
String ... schemes);
有一个默认的策略链来处理大多数Atwood问题。
public static List<ToURIStrategy> DEFAULT_STRATEGY_CHAIN = ImmutableList.of(
new RemoveSurroundsWithToURIStrategy("'"),
new RemoveSurroundsWithToURIStrategy("\""),
new RemoveSurroundsWithToURIStrategy("(", ")"),
new RemoveEndsWithToURIStrategy("."),
DEFAULT_STRATEGY,
REMOVE_LAST_STRATEGY);
享受!