我写了一个 Ruby 脚本来处理大量文档,并使用以下 URI 从文档的字符串表示中提取 URI:
#Taken from: http://daringfireball.net/2010/07/improved_regex_for_matching_urls
URI_REGEX = /
( # Capture 1: entire matched URL
(?:
[a-z][\w-]+: # URL protocol and colon
(?:
\/{1,3} # 1-3 slashes
| # or
[a-z0-9%] # Single letter or digit or '%'
)
| # or
www\d{0,3}[.] # "www.", "www1.", "www2." … "www999."
| # or
[a-z0-9.\-]+[.][a-z]{2,4}\/ # looks like domain name followed by a slash
)
(?: # One or more:
[^\s()<>]+ # Run of non-space, non-()<>
| # or
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
)+
(?: # End with:
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
| # or
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct chars
)
)/xi
它对于99.9%的文档都能够很好地运行,但是在遇到以下令牌时总是使我的脚本停止工作:token = "synsem:local:cat:(subcat:SubMot,adjuncts:Adjs,subj:Subj),"
我正在使用标准的ruby正则表达式操作符:token =~ URI_REGEX
,但是我没有收到任何异常或错误消息。
首先,我尝试解决问题,将正则表达式评估封装到Timeout::timeout
块中,但这会大大降低性能。
有没有其他想法来解决这个问题?