获取URL的部分（正则表达式）

Question

获取URL的部分（正则表达式）

regexlanguage-agnosticurl

164

给定以下URL（单行）：
http://test.example.com/dir/subdir/file.html

如何使用正则表达式提取以下部分：

子域名（test）
域名（example.com）
没有文件的路径（/dir/subdir/）
文件名（file.html）
包含文件的路径（/dir/subdir/file.html）
不包括路径的URL (http://test.example.com)
(添加任何你认为有用的内容)

即使我输入以下URL，正则表达式也应该能够正确工作：

http://example.example.com/example/example/example.html

- pek

1

这不是一个直接的答案，但大多数网络库都有一个完成此任务的函数。该函数通常被称为类似于“CrackUrl”的东西。如果存在这样的函数，请使用它，它几乎可以保证比任何手工编写的代码更可靠和更高效。 - Konrad Rudolph

9

请解释一下为什么需要使用正则表达式来完成这项任务。如果这是作业的话，请说明，因为这是你的限制条件。否则，使用与特定语言相关的更好解决方案比使用正则表达式更好。 - Andy Lester

1

第一个和最后一个样本的链接已经失效。 - the Tin Man

在这里，您可以找到如何提取方案、域、TLD、端口和查询路径的方法：https://dev59.com/WWkw5IYBdhLWcg3wn7-O#31952097 - Paolo Rovelli

30个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- MattWeiler · Answer 1

我需要一些REGEX来解析Java中URL的组成部分。这是我正在使用的：

"^(?:(http[s]?|ftp):/)?/?" +    // METHOD
"([^:^/^?^#\\s]+)" +            // HOSTNAME
"(?::(\\d+))?" +                // PORT
"([^?^#.*]+)?" +                // PATH
"(\\?[^#.]*)?" +                // QUERY
"(#[\\w\\-]+)?$"                // ID

Java 代码片段：

final Pattern pattern = Pattern.compile(
        "^(?:(http[s]?|ftp):/)?/?" +    // METHOD
        "([^:^/^?^#\\s]+)" +            // HOSTNAME
        "(?::(\\d+))?" +                // PORT
        "([^?^#.*]+)?" +                // PATH
        "(\\?[^#.]*)?" +                // QUERY
        "(#[\\w\\-]+)?$"                // ID
);
final Matcher matcher = pattern.matcher(url);

System.out.println("     URL: " + url);

if (matcher.matches())
{
    System.out.println("  Method: " + matcher.group(1));
    System.out.println("Hostname: " + matcher.group(2));
    System.out.println("    Port: " + matcher.group(3));
    System.out.println("    Path: " + matcher.group(4));
    System.out.println("   Query: " + matcher.group(5));
    System.out.println("      ID: " + matcher.group(6));
    
    return matcher.group(2);
}

System.out.println();
System.out.println();

- mohan mu · Answer 2

//USING REGEX
/**
 * Parse URL to get information
 *
 * @param   url     the URL string to parse
 * @return  parsed  the URL parsed or null
 */
var UrlParser = function (url) {
    "use strict";

    var regx = /^(((([^:\/#\?]+:)?(?:(\/\/)((?:(([^:@\/#\?]+)(?:\:([^:@\/#\?]+))?)@)?(([^:\/#\?\]\[]+|\[[^\/\]@#?]+\])(?:\:([0-9]+))?))?)?)?((\/?(?:[^\/\?#]+\/+)*)([^\?#]*)))?(\?[^#]+)?)(#.*)?/,
        matches = regx.exec(url),
        parser = null;

    if (null !== matches) {
        parser = {
            href              : matches[0],
            withoutHash       : matches[1],
            url               : matches[2],
            origin            : matches[3],
            protocol          : matches[4],
            protocolseparator : matches[5],
            credhost          : matches[6],
            cred              : matches[7],
            user              : matches[8],
            pass              : matches[9],
            host              : matches[10],
            hostname          : matches[11],
            port              : matches[12],
            pathname          : matches[13],
            segment1          : matches[14],
            segment2          : matches[15],
            search            : matches[16],
            hash              : matches[17]
        };
    }

    return parser;
};

var parsedURL=UrlParser(url);
console.log(parsedURL);

- Steve K · Answer 3

进行完整解析的正则表达式相当可怕。我已经包含了命名反向引用以提高可读性，并将每个部分分成单独的行，但它仍然看起来像这样：

^(?:(?P<protocol>\w+(?=:\/\/))(?::\/\/))?
(?:(?P<host>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^\/?#:]+)(?::(?P<port>[0-9]+))?)\/)?
(?:(?P<path>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)\/)?
(?P<file>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)
(?:\?(?P<querystring>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^#])+))?
(?:#(?P<fragment>.*))?$

需要冗长表述的原因是除了协议或端口，任何一部分都可能包含HTML实体，这使得对片段的划分非常棘手。因此，在最后几种情况下 - 主机、路径、文件、查询字符串和片段，我们允许任何HTML实体或任何不是?或#的字符。用于HTML实体的正则表达式如下：

$htmlentity = "&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);"

当它被提取出来时（我使用了mustache语法来表示它），它变得更加易读：

^(?:(?P<protocol>(?:ht|f)tps?|\w+(?=:\/\/))(?::\/\/))?
(?:(?P<host>(?:{{htmlentity}}|[^\/?#:])+(?::(?P<port>[0-9]+))?)\/)?
(?:(?P<path>(?:{{htmlentity}}|[^?#])+)\/)?
(?P<file>(?:{{htmlentity}}|[^?#])+)
(?:\?(?P<querystring>(?:{{htmlentity}};|[^#])+))?
(?:#(?P<fragment>.*))?$

在JavaScript中，当然不能使用命名的反向引用，因此正则表达式变为：

^(?:(\w+(?=:\/\/))(?::\/\/))?(?:((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^\/?#:]+)(?::([0-9]+))?)\/)?(?:((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)\/)?((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)(?:\?((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^#])+))?(?:#(.*))?$

在每一次匹配中，协议是\1，主机是\2，端口是\3，路径是\4，文件是\5，查询字符串是\6，片段是\7。

- Exo Flame · Answer 4

对于浏览器/Node.js环境，似乎有一个内置的URL类，其签名相同。但是请注意针对您的情况的各自重点。

https://nodejs.org/api/url.html#urlhost

https://developer.mozilla.org/en-US/docs/Web/API/URL

这是如何使用的。

let url = new URL('https://test.example.com/cats?name=foofy')
url.protocall; // https:
url.hostname; // test.example.com
url.pathname; // /cats
url.search; // ?name=foofy

let params = url.searchParams
let name = params.get('name');// always string I think so parse accordingly

更多关于参数的信息，请参见https://developer.mozilla.org/zh-CN/docs/Web/API/URL/searchParams

- rodrigo · Answer 5

正则表达式获取不带文件的URL路径。

url = 'http://domain/dir1/dir2/somefile' url.scan(/^(http:\/\/[^\/]+)((?:\/[^\/]+)+(?=\/))?\/?(?:[^\/]+)?$/i).to_s

它可以用于将相对路径添加到此URL中。

- pek · Answer 6

使用http://www.fileformat.info/tool/regex.htm hometoast的正则表达式非常好用。

但是问题在于，我想在程序中的不同情况下使用不同的正则表达式模式。

例如，我有这个URL，并且我有一个枚举列出了程序中支持的所有URL。枚举中的每个对象都有一个getRegexPattern方法，返回将用于与URL进行比较的正则表达式模式。如果特定的正则表达式模式返回true，则我知道此URL由我的程序支持。因此，每个枚举都有自己的正则表达式，具体取决于它应该在URL中查找的位置。

hometoast的建议非常好，但在我的情况下，我认为它没有帮助（除非我将相同的正则表达式复制粘贴到所有枚举中）。

这就是为什么我希望答案分别给出每种情况的正则表达式。虽然+1给hometoast。;)

- Brian Warshaw · Answer 7

我知道你说这个与语言无关，但是你能告诉我们你使用的是什么正则表达式吗？这样我们就知道你有哪些正则表达式的功能了。

如果你有非捕获匹配的功能，你可以修改hometoast的表达式，使得你不感兴趣的子表达式设置为：

(?:SOMESTUFF)

你仍然需要将正则表达式复制并粘贴到多个地方（稍作修改），但这是有意义的——你不仅仅是检查子表达式是否存在，而是检查它是否存在于URL的一部分中。对子表达式使用非捕获修饰符可以给你所需的内容，而不会给你更多，如果我理解你的意思正确的话，这就是你想要的。

只是一个小小的注意事项，hometoast的表达式不需要在“https”的“s”周围加上括号，因为他只有一个字符。量词直接量化它们之前的一个字符（或字符类或子表达式）。所以：

https?

可以很好地匹配“http”或“https”。

- Hritik Soni · Answer 8

这里建议的最佳答案对我不起作用，因为我的URL也包含一个端口。但将其修改为以下正则表达式对我起作用：

^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:\d+)?((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(.*)?(#[\w\-]+)?$

- Bilal Demir · Answer 9

我尝试使用这个正则表达式来解析URL分区：

^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:([^\/]*))?((\/?(?:[^\/\?#]+\/+)*)([^\?#]*))(\?([^#]*))?(#(.*))?$

网址：https://www.google.com/my/path/sample/asd-dsa/this?key1=value1&key2=value2

匹配项：

Group 1.    0-7 https:/
Group 2.    0-5 https
Group 3.    8-22    www.google.com
Group 6.    22-50   /my/path/sample/asd-dsa/this
Group 7.    22-46   /my/path/sample/asd-dsa/
Group 8.    46-50   this
Group 9.    50-74   ?key1=value1&key2=value2
Group 10.   51-74   key1=value1&key2=value2

- ylev · Answer 10

String s = "https://www.thomas-bayer.com/axis2/services/BLZService?wsdl";

String regex = "(^http.?://)(.*?)([/\\?]{1,})(.*)";

System.out.println("1: " + s.replaceAll(regex, "$1"));
System.out.println("2: " + s.replaceAll(regex, "$2"));
System.out.println("3: " + s.replaceAll(regex, "$3"));
System.out.println("4: " + s.replaceAll(regex, "$4"));

将提供以下输出：
1: https://
2: www.thomas-bayer.com
3: /
4: axis2/services/BLZService?wsdl

如果您更改URL为
String s = "https://www.thomas-bayer.com?wsdl=qwerwer&ttt=888"; 输出将如下所示：
1: https://
2: www.thomas-bayer.com
3: ?
4: wsdl=qwerwer&ttt=888

享受..
Yosi Lev