在shell脚本中解析URL

Question

在shell脚本中解析URL

33

我有一个类似的URL：

sftp://user@host.net/some/random/path

我想从这个字符串中提取用户、主机和路径。任何部分都可以是随机长度。

- umpirsky

哪个shell？你到目前为止尝试过什么方法？ - johnsyweb

你必须使用shell脚本吗？我假设是BASH。你可以用Python代替吗？ - Flukey

1

我正试图撰写自定义Nautilus外壳脚本，从Ubuntu上Nautilus的当前SFTP会话中打开新的终端SSH会话。这个URL是$NAUTILUS_SCRIPT_CURRENT_URI全局变量。但实际上，你是正确的，也许我可以使用Python或PHP。 - umpirsky

1

我同意上面的评论 - 使用perl/python/php会使事情变得更容易。(在提供bash解决方案后发布此内容) - Shirkrin

问题的第二部分：http://stackoverflow.com/questions/6174906/set-current-working-directory-on-ssh - umpirsky

16个回答

32

上面的代码已经过修改（添加了密码和端口分析），并且可以在/bin/sh工作：

# extract the protocol
proto="`echo $DATABASE_URL | grep '://' | sed -e's,^\(.*://\).*,\1,g'`"
# remove the protocol
url=`echo $DATABASE_URL | sed -e s,$proto,,g`

# extract the user and password (if any)
userpass="`echo $url | grep @ | cut -d@ -f1`"
pass=`echo $userpass | grep : | cut -d: -f2`
if [ -n "$pass" ]; then
    user=`echo $userpass | grep : | cut -d: -f1`
else
    user=$userpass
fi

# extract the host -- updated
hostport=`echo $url | sed -e s,$userpass@,,g | cut -d/ -f1`
port=`echo $hostport | grep : | cut -d: -f2`
if [ -n "$port" ]; then
    host=`echo $hostport | grep : | cut -d: -f1`
else
    host=$hostport
fi

# extract the path (if any)
path="`echo $url | grep / | cut -d/ -f2-`"

发帖是因为我需要这个功能，所以我写了它（基于@Shirkin的答案，显然），并且我想其他人可能也会喜欢。

- pjz

我基于回答函数创建了一个帮助函数，用于设置环境变量：https://gist.github.com/maersu/2e050f6399e11348804bf162a301fb82。 - maersu

17

原则上，这个解决方案与Adam Ryczkowski在本主题中的解决方案基本相同，但它具有改进的正则表达式，基于RFC3986（带有一些更改）并修复了一些错误（例如，userinfo可以包含“_”字符）。它还可以理解相对URI（例如提取查询或片段）。

# !/bin/bash

# Following regex is based on https://www.rfc-editor.org/rfc/rfc3986#appendix-B with
# additional sub-expressions to split authority into userinfo, host and port
#
readonly URI_REGEX='^(([^:/?#]+):)?(//((([^:/?#]+)@)?([^:/?#]+)(:([0-9]+))?))?(/([^?#]*))(\?([^#]*))?(#(.*))?'
#                    ↑↑            ↑  ↑↑↑            ↑         ↑ ↑            ↑ ↑        ↑  ↑        ↑ ↑
#                    |2 scheme     |  ||6 userinfo   7 host    | 9 port       | 11 rpath |  13 query | 15 fragment
#                    1 scheme:     |  |5 userinfo@             8 :…           10 path    12 ?…       14 #…
#                                  |  4 authority
#                                  3 //…

parse_scheme () {
    [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[2]}"
}

parse_authority () {
    [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[4]}"
}

parse_user () {
    [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[6]}"
}

parse_host () {
    [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[7]}"
}

parse_port () {
    [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[9]}"
}

parse_path () {
    [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[10]}"
}

parse_rpath () {
    [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[11]}"
}

parse_query () {
    [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[13]}"
}

parse_fragment () {
    [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[15]}"
}

- Patryk Obara

2

这不是在尝试解析URL的每个部分时都运行那个正则表达式吗？Adam的方法可能没有完美的正则表达式，但它只匹配一次模式。 - Auspex

当然，如果你想从URI中获取多个值（就像原问题中一样），那么从BASH_REMATCH数组中提取确切的字符串是合适的（如果你更关心速度而不是可读性），就像@adam-ryczkowski所做的那样。 - Patryk Obara

谢谢。无论如何，我在一个运行在Docker容器中的应用程序中使用了正则表达式，在那里我不想修改别人的Docker镜像只是为了得到sed... - Auspex

URI的语法和语义因方案而异，每个方案都有其定义规范所描述的特点。实现可以使用方案特定的规则，以进一步处理成本来降低错误否定的概率。例如，“http”方案利用授权组件，具有默认端口“80”，并将空路径定义为等效于“/”。因此，对我来说，您的正则表达式中的路径应该是可选的（请参见https://www.rfc-editor.org/rfc/rfc3986#section-6.2.3）。 - Максим Шатов

可能在这些函数中想要使用"$*"而不是"$@"。 - glenn jackman

7

使用Python（在我看来是最好的工具）：

#!/usr/bin/env python

import os
from urlparse import urlparse

uri = os.environ['NAUTILUS_SCRIPT_CURRENT_URI']
result = urlparse(uri)
user, host = result.netloc.split('@')
path = result.path
print('user=', user)
print('host=', host)
print('path=', path)

更多阅读：

os.environ（操作系统相关）
urlparse.urlparse()（URL解析相关）

- johnsyweb

@umpiresky：很高兴听到这个消息。我已经更新了我的答案，并提供了一些链接，以防您需要自定义它。 - johnsyweb

@umpiresky：那不是问题的一部分！在这种情况下，您可以使用 [...]ssh result.netloc 来避免将 user 从 host 中分离出来再重新拼接它们...（并且省去 print 调用）。 - johnsyweb

@Johnsyweb 当然，我会做到的。我只是在想如何在ssh时设置当前目录.. - umpirsky

这是一个新问题。不要忘记指定您想在本地主机还是远程主机上设置CWD。 - johnsyweb

4

非常老旧但不可接受的答案。“在shell脚本中”已经明确说明了。Python解决方案不是一个答案，就像Java解决方案也不是一样。 - Javier Palacios

显示剩余3条评论

5

你可以使用bash字符串操作。它很容易学习。如果你在正则表达式上遇到困难，请尝试使用它。由于它来自NAUTILUS_SCRIPT_CURRENT_URI，我猜那个URI可能有端口。因此，我也将这个选项保留了下来。

#!/bin/bash

#You can also use environment variable $NAUTILUS_SCRIPT_CURRENT_URI
X="sftp://user@host.net/some/random/path"

tmp=${X#*//};usr=${tmp%@*}
tmp=${X#*@};host=${tmp%%/*};[[ ${X#*://} == *":"* ]] && host=${host%:*}
tmp=${X#*//};path=${tmp#*/}
proto=${X%:*}
[[ ${X#*://} == *":"* ]] && tmp=${X##*:} && port=${tmp%%/*}

echo "Potocol:"$proto" User:"$usr" Host:"$host" Port:"$port" Path:"$path

- Abdullah Al Farooq

1

proto表达式需要进行最长匹配而不是最短匹配，因此${X%%:*}（双百分号）应该使用。否则，对于像ssh://user@some.domain:1234/some/path这样的有效（但令人困惑）输入，第二个冒号将会匹配而不是第一个，协议将被报告为ssh://user@some.domain。 - Ti Strga

5

我没有足够的声望来评论，但我对@patryk-obara的答案进行了小修改。

RFC3986 § 6.2.3. 基于协议的规范化 处理

http://example.com
http://example.com/

我发现他的正则表达式不能匹配像http://example.com这样的URL，尽管它们是等效的。但是带斜杠的http://example.com/可以匹配。

我插入了11个字符，将/改为(/|$)。这个正则表达式可以匹配/或字符串结尾。现在http://example.com可以匹配了。

readonly URI_REGEX='^(([^:/?#]+):)?(//((([^:/?#]+)@)?([^:/?#]+)(:([0-9]+))?))?((/|$)([^?#]*))(\?([^#]*))?(#(.*))?$'
#                    ↑↑            ↑  ↑↑↑            ↑         ↑ ↑            ↑↑    ↑        ↑  ↑        ↑ ↑
#                    ||            |  |||            |         | |            ||    |        |  |        | |
#                    |2 scheme     |  ||6 userinfo   7 host    | 9 port       ||    12 rpath |  14 query | 16 fragment
#                    1 scheme:     |  |5 userinfo@             8 :...         ||             13 ?...     15 #...
#                                  |  4 authority                             |11 / or end-of-string
#                                  3  //...                                   10 path

- scott

4

我也需要做同样的事情，所以我很好奇能否只用一行代码实现，以下是我得到的结果:

#!/bin/bash

parse_url() {
  eval $(echo "$1" | sed -e "s#^\(\(.*\)://\)\?\(\([^:@]*\)\(:\(.*\)\)\?@\)\?\([^/?]*\)\(/\(.*\)\)\?#${PREFIX:-URL_}SCHEME='\2' ${PREFIX:-URL_}USER='\4' ${PREFIX:-URL_}PASSWORD='\6' ${PREFIX:-URL_}HOST='\7' ${PREFIX:-URL_}PATH='\9'#")
}

URL=${1:-"http://user:pass@example.com/path/somewhere"}
PREFIX="URL_" parse_url "$URL"
echo "$URL_SCHEME://$URL_USER:$URL_PASSWORD@$URL_HOST/$URL_PATH"

工作原理：

使用疯狂的sed正则表达式捕获URL的所有部分，当它们都是可选的时候（除了主机名）
使用这些捕获组，sed输出相关部分的环境变量名称及其值（如URL_SCHEME或URL_USER）
eval执行该输出，使这些变量被导出并在脚本中可用
可选地，可以传递PREFIX以控制输出环境变量名称

注意：在对任意输入使用此代码时，请小心，因为该代码容易受到脚本注入攻击。

- Stam

不幸的是，端口部分不受支持。 - ñull

eval "$(sed -e "s#^((.*)://)?(([^:@]*)(:(.*))?@)?([^/?]*)?(:([0-9]*))(/(.*))?#${PREFIX:-URL_}HOST='\7' ${PREFIX:-URL_}PORT='\9'#" <<< "$URL")"问题是https://www.gnu.org/software/sed/manual/sed.html#index-Backreferences_002c-in-regular-expressions只支持从1到9。 - undefined

4

如果你真的想用 shell 实现，你可以使用 awk 来做出以下简单的操作。这需要知道你实际上将传递多少个字段（例如，有时没有密码，而其他时候则有）。

#!/bin/bash

FIELDS=($(echo "sftp://user@host.net/some/random/path" \
  | awk '{split($0, arr, /[\/\@:]*/); for (x in arr) { print arr[x] }}'))
proto=${FIELDS[1]}
user=${FIELDS[2]}
host=${FIELDS[3]}
path=$(echo ${FIELDS[@]:3} | sed 's/ /\//g')

如果您没有awk，但是有grep，并且可以要求每个字段至少有两个字符并且格式相对可预测，则可以执行以下操作：

#!/bin/bash

FIELDS=($(echo "sftp://user@host.net/some/random/path" \
   | grep -o "[a-z0-9.-][a-z0-9.-]*" | tr '\n' ' '))
proto=${FIELDS[1]}
user=${FIELDS[2]}
host=${FIELDS[3]}
path=$(echo ${FIELDS[@]:3} | sed 's/ /\//g')

- relistan

4

以下是我的看法，它基于一些已有回答的基础上，但也可以处理GitHub SSH克隆URL：

#!/bin/bash

PROJECT_URL="git@github.com:heremaps/here-aaa-java-sdk.git"

# Extract the protocol (includes trailing "://").
PARSED_PROTO="$(echo $PROJECT_URL | sed -nr 's,^(.*://).*,\1,p')"

# Remove the protocol from the URL.
PARSED_URL="$(echo ${PROJECT_URL/$PARSED_PROTO/})"

# Extract the user (includes trailing "@").
PARSED_USER="$(echo $PARSED_URL | sed -nr 's,^(.*@).*,\1,p')"

# Remove the user from the URL.
PARSED_URL="$(echo ${PARSED_URL/$PARSED_USER/})"

# Extract the port (includes leading ":").
PARSED_PORT="$(echo $PARSED_URL | sed -nr 's,.*(:[0-9]+).*,\1,p')"

# Remove the port from the URL.
PARSED_URL="$(echo ${PARSED_URL/$PARSED_PORT/})"

# Extract the path (includes leading "/" or ":").
PARSED_PATH="$(echo $PARSED_URL | sed -nr 's,[^/:]*([/:].*),\1,p')"

# Remove the path from the URL.
PARSED_HOST="$(echo ${PARSED_URL/$PARSED_PATH/})"

echo "proto: $PARSED_PROTO"
echo "user: $PARSED_USER"
echo "host: $PARSED_HOST"
echo "port: $PARSED_PORT"
echo "path: $PARSED_PATH"

这提供了

proto:
user: git@
host: github.com
port:
path: :heremaps/here-aaa-java-sdk.git

对于PROJECT_URL="ssh://sschuberth@git.eclipse.org:29418/jgit/jgit"，您将获得以下结果

proto: ssh://
user: sschuberth@
host: git.eclipse.org
port: :29418
path: /jgit/jgit

- sschuberth

3

如果您可以访问Bash >= 3.0，您也可以在纯Bash中执行此操作，这要归功于重新匹配运算符=~:

pattern='^(([[:alnum:]]+)://)?(([[:alnum:]]+)@)?([^:^@]+)(:([[:digit:]]+))?$'
if [[ "http://us@cos.com:3142" =~ $pattern ]]; then
        proto=${BASH_REMATCH[2]}
        user=${BASH_REMATCH[4]}
        host=${BASH_REMATCH[5]}
        port=${BASH_REMATCH[7]}
fi

它应该比之前的所有示例更快，更节省资源，因为不会生成任何外部进程。

- Adam Ryczkowski

1

很遗憾，这将把路径段作为主机名的一部分。 - Yan Foto

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Shirkrin · Accepted Answer

[编辑于2019年] 这个答案并不是一个万能的适用于所有情况的解决方案，它旨在提供一个简单的替代Python版本的方法，并且最终具有比原始方法更多的功能。

它以纯Bash方式回答了基本问题，然后我进行了多次修改，以包括评论者需求的一些要求。但我认为，现在添加更多复杂性将使其难以维护。我知道不是所有事情都很明显（例如，检查有效端口需要比较“hostport”和“host”），但我宁愿不要增加更多复杂性。

[原始回答]

假设你的URL作为脚本的第一个参数传递：

#!/bin/bash

# extract the protocol
proto="$(echo $1 | grep :// | sed -e's,^\(.*://\).*,\1,g')"
# remove the protocol
url="$(echo ${1/$proto/})"
# extract the user (if any)
user="$(echo $url | grep @ | cut -d@ -f1)"
# extract the host and port
hostport="$(echo ${url/$user@/} | cut -d/ -f1)"
# by request host without port    
host="$(echo $hostport | sed -e 's,:.*,,g')"
# by request - try to extract the port
port="$(echo $hostport | sed -e 's,^.*:,:,g' -e 's,.*:\([0-9]*\).*,\1,g' -e 's,[^0-9],,g')"
# extract the path (if any)
path="$(echo $url | grep / | cut -d/ -f2-)"

echo "url: $url"
echo "  proto: $proto"
echo "  user: $user"
echo "  host: $host"
echo "  port: $port"
echo "  path: $path"

我必须承认，这不是最干净的解决方案，但它不依赖于像Perl或Python这样的其他脚本语言。（使用其中之一提供解决方案将产生更干净的结果;）

使用您的示例，结果如下：

url: user@host.net/some/random/path
  proto: sftp://
  user: user
  host: host.net
  port:
  path: some/random/path

这也适用于没有协议/用户名或路径的URL。在这种情况下，相应的变量将包含空字符串。 [编辑] 如果您的bash版本无法处理替换（${1/$proto/}），请尝试此操作：

#!/bin/bash

# extract the protocol
proto="$(echo $1 | grep :// | sed -e's,^\(.*://\).*,\1,g')"

# remove the protocol -- updated
url=$(echo $1 | sed -e s,$proto,,g)

# extract the user (if any)
user="$(echo $url | grep @ | cut -d@ -f1)"

# extract the host and port -- updated
hostport=$(echo $url | sed -e s,$user@,,g | cut -d/ -f1)

# by request host without port
host="$(echo $hostport | sed -e 's,:.*,,g')"
# by request - try to extract the port
port="$(echo $hostport | sed -e 's,^.*:,:,g' -e 's,.*:\([0-9]*\).*,\1,g' -e 's,[^0-9],,g')"

# extract the path (if any)
path="$(echo $url | grep / | cut -d/ -f2-)"