Ruby正则表达式 - 提取URL的部分

3

I have a URL like

https://endpoint/v1.0/album/id/photo/id/

其中endpoint是一个变量。我想提取"/v1.0/album/id/photo/id/"。

如何使用Ruby正则表达式提取"endpoint"后面的所有内容?


看到这篇文章的任何人,请提供一个正则表达式解决方案。如果您知道如何使用正则表达式解决它,请不要路过。我真的很弱使用正则表达式。楼主需要正则表达式解决方案。 - Arup Rakshit
这个模式很复杂,而且在RFC中有很好的文档记录。 - the Tin Man
4个回答

5

这里我们开始:

2.0.0-p451 :001 > require 'uri'
 => true
2.0.0-p451 :002 > URI('https://endpoint/v1.0/album/id/photo/id/').path
 => "/v1.0/album/id/photo/id/"
2.0.0-p451 :003 >

阅读此基本示例


1
@Pavan 很高兴听到这个消息..哈哈。你在 RoR 上非常擅长,而我不是.. 难过 - Arup Rakshit
感谢您的回答。只是好奇想知道如何使用正则表达式来实现相同的功能。 - user1788294
@user1788294 正则表达式不是我的强项。我对它很菜.. :( - Arup Rakshit
有人可以在这里帮忙吗? - user1788294
1
@user1788294 我建议使用这个答案,而不是自己编写正则表达式。 - Mark Thomas

1
完整的正则表达式解决方案就是URI库在后台执行的操作。自己尝试实现基本上是徒劳无功的...
无论如何,使用命名捕获组(?<name>)和在末尾加上/x标志以允许格式中的空格的简单正则表达式即可。
url = 'https://endpoint/v1.0/album/id/photo/id/'

re = /
              ^                    # beginning of string
  (?<scheme>  https?             ) # http or s
              :\/\/                # seperator
  (?<domain>  [[a-zA-Z0-9]\.-]+? ) # many alnum, -'s or .'s
  (?<path>    \/.+               ) # forward slash on is the path
/x

res = url.match re
res[:path] if res

这相比于URI而言相形见绌。


0
这是一个正则表达式的解决方案:
domain = 'endpoint'
link = "https://#{domain}/v1.0/album/id/photo/id/"
path = link.gsub("https://#{domain}", '')
# => "/v1.0/album/id/photo/id/"

您可以通过更改“domain”变量来调整域名。 我使用了String.gsub函数将您链接的第一部分替换为空字符串(第3行完成的正则表达式部分实际上非常简单!它只是http:// endpoint),这意味着路径是字符串中唯一保留的部分。


0

URI RFC文档用于解析URL的模式

Appendix B.  Parsing a URI Reference with a Regular Expression

   As the "first-match-wins" algorithm is identical to the "greedy"
   disambiguation method used by POSIX regular expressions, it is
   natural and commonplace to use a regular expression for parsing the
   potential five components of a URI reference.

   The following line is the regular expression for breaking-down a
   well-formed URI reference into its components.



Berners-Lee, et al.         Standards Track                    [Page 50]
 
RFC 3986                   URI Generic Syntax               January 2005


      ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
       12            3  4          5       6  7        8 9

   The numbers in the second line above are only to assist readability;
   they indicate the reference points for each subexpression (i.e., each
   paired parenthesis).  We refer to the value matched for subexpression
   <n> as $<n>.  For example, matching the above expression to

      http://www.ics.uci.edu/pub/ietf/uri/#Related

   results in the following subexpression matches:

      $1 = http:
      $2 = http
      $3 = //www.ics.uci.edu
      $4 = www.ics.uci.edu
      $5 = /pub/ietf/uri/
      $6 = <undefined>
      $7 = <undefined>
      $8 = #Related
      $9 = Related

   where <undefined> indicates that the component is not present, as is
   the case for the query component in the above example.  Therefore, we
   can determine the value of the five components as

      scheme    = $2
      authority = $4
      path      = $5
      query     = $7
      fragment  = $9

基于此:

URL_REGEX = %r!^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?!
'https://endpoint/v1.0/album/id/photo/id/'.match(URL_REGEX).captures
# => ["https:",
#     "https",
#     "//endpoint",
#     "endpoint",
#     "/v1.0/album/id/photo/id/",
#     nil,
#     nil,
#     nil,
#     nil]

'https://endpoint/v1.0/album/id/photo/id/'.match(URL_REGEX).captures[4]
# => "/v1.0/album/id/photo/id/"

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接