如何在Bash中解析XML？

Question

如何在Bash中解析XML？

169

理想情况下，我希望能够做到以下几点：

cat xhtmlfile.xhtml |
getElementViaXPath --path='/html/head/title' |
sed -e 's%(^<title>|</title>$)%%g' > titleOfXHTMLPage.txt

- asdfasdfasdf

2

将以下与编程相关的内容从英文翻译成中文。只返回翻译后的文本：http://unix.stackexchange.com/questions/83385/parse-xml-to-get-node-value-in-bash-script || http://superuser.com/questions/369996/scripting-what-is-the-easiest-to-extact-a-value-in-a-tag-of-a-xml-file - Ciro Santilli OurBigBook.com

输出 Example 的命令是：echo '<html><head><title>Example</title></body></html>' | yq -p xml '.html.head.title'。参见链接：yq，一些例子。 - jpseng

echo '<html><head><title>Example</title></body></html>' | yq -p xml '.html.head.title' 输出 Example。参见：yq，一些示例 - undefined

17个回答

76

可从shell脚本中调用的命令行工具包括：

4xpath - 围绕Python的4Suite软件包的命令行封装
XMLStarlet
xpath - 围绕Perl的XPath库的命令行封装

sudo apt-get install libxml-xpath-perl

Xidel - 可以处理URL和文件，还可以处理JSON数据格式。

我也使用 xmllint 和 xsltproc ，配合一些小的 XSL 转换脚本来从命令行或 shell 脚本中进行XML处理。

- Nat

2

我在哪里可以下载 'xpath' 或 '4xpath'？ - Opher

4

是的，第二个投票/请求 - 在哪里下载这些工具，还是您的意思是必须手动编写一个封装器？除非必要，我宁愿不浪费时间去做那件事。请给我提供相关翻译信息。 - David

4

请执行命令 "sudo apt-get install libxml-xpath-perl"，以安装 libxml-xpath-perl。 - Andrew Wagner

XPath很棒！使用它非常简单，只需执行 xpath -e 'xpath/expression/here' $filename 命令，并加上 -q 参数以仅显示输出结果，这样你可以将其导出到其他地方或保存到变量中。 - phyatt

4xpath的链接损坏了。 - sean

72

你可以很容易地只使用Bash实现这个功能。你只需要添加这个函数：

rdom () { local IFS=\> ; read -d \< E C ;}

现在你可以像读取文本文件一样使用rdom来读取HTML文档。

当调用rdom时，它将把元素分配给变量E，将内容分配给变量C。

例如，要执行您想要的操作：

while rdom; do
    if [[ $E = title ]]; then
        echo $C
        exit
    fi
done < xhtmlfile.xhtml > titleOfXHTMLPage.txt

- Yuzem

你能详细解释一下吗？我敢打赌这对你来说非常清楚...如果我知道你在那里做什么，这可能是一个很好的答案。你能再分解一下吗？可能生成一些示例输出？ - Alex Gray

2

这个一行代码真是太优雅和惊人了，原作者功不可没。 - maverick

2

很棒的技巧，但我必须使用双引号，例如echo "$C"来防止shell扩展和正确解释行尾（取决于编码）。 - user311174

16

使用grep和awk解析XML是不可取的。如果XML文件足够简单并且时间充足，这可能是一种可接受的妥协，但它永远不能被称为一个好的解决方案。 - peterh

26

你可以使用xpath工具。它是通过Perl XML-XPath包安装的。

用法：

/usr/bin/xpath [filename] query

或者XMLStarlet。在opensuse上安装它使用以下命令：

sudo zypper install xmlstarlet

或者在其他平台上尝试使用 cnf xml。

- Grisha

6

使用xml starlet绝对是比编写自己的序列化器更好的选择（正如其他答案中所建议的）。 - Bruno von Paris

1

在许多系统上，预安装的 xpath 不适合用作脚本组件。请参见例如 https://dev59.com/FmUp5IYBdhLWcg3wLVOn 以获取详细信息。 - tripleee

2

在Ubuntu/Debian上，执行apt-get install xmlstarlet命令。 - rubo77

17

这就足够了...

xpath xhtmlfile.xhtml '/html/head/title/text()' > titleOfXHTMLPage.txt

- teknopaul

3

在 Debian 中，执行 apt-get install libxml-xpath-perl。 - tres.14159

8

请查看来自http://www.ofb.net/~egnor/xml2/的XML2，它可以将XML转换为面向行的格式。

- simon04

1

非常有用的工具。链接已经失效（请参见 https://web.archive.org/web/20160312110413/https://dan.egnor.name/xml2/），但在 GitHub 上有一个可用的冻结克隆：https://github.com/clone/xml2。 - Joshua Goldberg

6

另一个命令行工具是我的新Xidel。与已提到的xpath/xmlstarlet相反，它也支持XPath 2和XQuery。

标题可以这样读：

xidel xhtmlfile.xhtml -e /html/head/title > titleOfXHTMLPage.txt

它还具有一个很酷的功能，可以将多个变量导出到bash。例如：

eval $(xidel xhtmlfile.xhtml -e 'title := //title, imgcount := count(//img)' --output-format bash )

将$title设置为标题，将$imgcount设置为文件中图像的数量，应该与在bash中直接解析一样灵活。

- BeniBela

5

从Chad的回答开始，这里是解析UML的完整工作解决方案，包括适当处理注释，只需使用2个小函数（超过2个但可以混合使用）。我不是说Chad的方法完全无法工作，但它在处理格式不良的XML文件时存在太多问题：因此，您必须更加巧妙地处理注释和错位的空格/ CR / TAB等。

本回答的目的是为任何需要解析UML而不使用perl、python或其他复杂工具的人提供即用型的bash函数。至于我，我无法在我正在工作的旧生产OS上安装cpan或perl模块，而python也不可用。

首先，本文中使用的UML单词的定义：

<!-- comment... -->
<tag attribute="value">content...</tag>

编辑：更新函数，可处理以下内容：

Websphere xml (xmi和xmlns属性)
必须使用支持256种颜色的兼容终端
24种灰度
增加了IBM AIX bash 3.2.16(1)的兼容性

这些函数中，首先是递归调用xml_read_dom的xml_read函数：

xml_read_dom() {
# https://dev59.com/8HNA5IYBdhLWcg3wmfO5
local ENTITY IFS=\>
if $ITSACOMMENT; then
  read -d \< COMMENTS
  COMMENTS="$(rtrim "${COMMENTS}")"
  return 0
else
  read -d \< ENTITY CONTENT
  CR=$?
  [ "x${ENTITY:0:1}x" == "x/x" ] && return 0
  TAG_NAME=${ENTITY%%[[:space:]]*}
  [ "x${TAG_NAME}x" == "x?xmlx" ] && TAG_NAME=xml
  TAG_NAME=${TAG_NAME%%:*}
  ATTRIBUTES=${ENTITY#*[[:space:]]}
  ATTRIBUTES="${ATTRIBUTES//xmi:/}"
  ATTRIBUTES="${ATTRIBUTES//xmlns:/}"
fi

# when comments sticks to !-- :
[ "x${TAG_NAME:0:3}x" == "x!--x" ] && COMMENTS="${TAG_NAME:3} ${ATTRIBUTES}" && ITSACOMMENT=true && return 0

# http://tldp.org/LDP/abs/html/string-manipulation.html
# INFO: oh wait it doesn't work on IBM AIX bash 3.2.16(1):
# [ "x${ATTRIBUTES:(-1):1}x" == "x/x" -o "x${ATTRIBUTES:(-1):1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:(-1)}"
[ "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x/x" -o "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:${#ATTRIBUTES} -1}"
return $CR
}

第二个：

xml_read() {
# https://dev59.com/8HNA5IYBdhLWcg3wmfO5
ITSACOMMENT=false
local MULTIPLE_ATTR LIGHT FORCE_PRINT XAPPLY XCOMMAND XATTRIBUTE GETCONTENT fileXml tag attributes attribute tag2print TAGPRINTED attribute2print XAPPLIED_COLOR PROSTPROCESS USAGE
local TMP LOG LOGG
LIGHT=false
FORCE_PRINT=false
XAPPLY=false
MULTIPLE_ATTR=false
XAPPLIED_COLOR=g
TAGPRINTED=false
GETCONTENT=false
PROSTPROCESS=cat
Debug=${Debug:-false}
TMP=/tmp/xml_read.$RANDOM
USAGE="${C}${FUNCNAME}${c} [-cdlp] [-x command <-a attribute>] <file.xml> [tag | \"any\"] [attributes .. | \"content\"]
${nn[2]}  -c = NOCOLOR${END}
${nn[2]}  -d = Debug${END}
${nn[2]}  -l = LIGHT (no \"attribute=\" printed)${END}
${nn[2]}  -p = FORCE PRINT (when no attributes given)${END}
${nn[2]}  -x = apply a command on an attribute and print the result instead of the former value, in green color${END}
${nn[1]}  (no attribute given will load their values into your shell; use '-p' to print them as well)${END}"

! (($#)) && echo2 "$USAGE" && return 99
(( $# < 2 )) && ERROR nbaram 2 0 && return 99
# getopts:
while getopts :cdlpx:a: _OPT 2>/dev/null
do
{
  case ${_OPT} in
    c) PROSTPROCESS="${DECOLORIZE}" ;;
    d) local Debug=true ;;
    l) LIGHT=true; XAPPLIED_COLOR=END ;;
    p) FORCE_PRINT=true ;;
    x) XAPPLY=true; XCOMMAND="${OPTARG}" ;;
    a) XATTRIBUTE="${OPTARG}" ;;
    *) _NOARGS="${_NOARGS}${_NOARGS+, }-${OPTARG}" ;;
  esac
}
done
shift $((OPTIND - 1))
unset _OPT OPTARG OPTIND
[ "X${_NOARGS}" != "X" ] && ERROR param "${_NOARGS}" 0

fileXml=$1
tag=$2
(( $# > 2 )) && shift 2 && attributes=$*
(( $# > 1 )) && MULTIPLE_ATTR=true

[ -d "${fileXml}" -o ! -s "${fileXml}" ] && ERROR empty "${fileXml}" 0 && return 1
$XAPPLY && $MULTIPLE_ATTR && [ -z "${XATTRIBUTE}" ] && ERROR param "-x command " 0 && return 2
# nb attributes == 1 because $MULTIPLE_ATTR is false
[ "${attributes}" == "content" ] && GETCONTENT=true

while xml_read_dom; do
  # (( CR != 0 )) && break
  (( PIPESTATUS[1] != 0 )) && break

  if $ITSACOMMENT; then
    # oh wait it doesn't work on IBM AIX bash 3.2.16(1):
    # if [ "x${COMMENTS:(-2):2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:(-2)}" && ITSACOMMENT=false
    # elif [ "x${COMMENTS:(-3):3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:(-3)}" && ITSACOMMENT=false
    if [ "x${COMMENTS:${#COMMENTS} - 2:2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 2}" && ITSACOMMENT=false
    elif [ "x${COMMENTS:${#COMMENTS} - 3:3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 3}" && ITSACOMMENT=false
    fi
    $Debug && echo2 "${N}${COMMENTS}${END}"
  elif test "${TAG_NAME}"; then
    if [ "x${TAG_NAME}x" == "x${tag}x" -o "x${tag}x" == "xanyx" ]; then
      if $GETCONTENT; then
        CONTENT="$(trim "${CONTENT}")"
        test ${CONTENT} && echo "${CONTENT}"
      else
        # eval local $ATTRIBUTES => eval test "\"\$${attribute}\"" will be true for matching attributes
        eval local $ATTRIBUTES
        $Debug && (echo2 "${m}${TAG_NAME}: ${M}$ATTRIBUTES${END}"; test ${CONTENT} && echo2 "${m}CONTENT=${M}$CONTENT${END}")
        if test "${attributes}"; then
          if $MULTIPLE_ATTR; then
            # we don't print "tag: attr=x ..." for a tag passed as argument: it's usefull only for "any" tags so then we print the matching tags found
            ! $LIGHT && [ "x${tag}x" == "xanyx" ] && tag2print="${g6}${TAG_NAME}: "
            for attribute in ${attributes}; do
              ! $LIGHT && attribute2print="${g10}${attribute}${g6}=${g14}"
              if eval test "\"\$${attribute}\""; then
                test "${tag2print}" && ${print} "${tag2print}"
                TAGPRINTED=true; unset tag2print
                if [ "$XAPPLY" == "true" -a "${attribute}" == "${XATTRIBUTE}" ]; then
                  eval ${print} "%s%s\ " "\${attribute2print}" "\${${XAPPLIED_COLOR}}\"\$(\$XCOMMAND \$${attribute})\"\${END}" && eval unset ${attribute}
                else
                  eval ${print} "%s%s\ " "\${attribute2print}" "\"\$${attribute}\"" && eval unset ${attribute}
                fi
              fi
            done
            # this trick prints a CR only if attributes have been printed durint the loop:
            $TAGPRINTED && ${print} "\n" && TAGPRINTED=false
          else
            if eval test "\"\$${attributes}\""; then
              if $XAPPLY; then
                eval echo "\${g}\$(\$XCOMMAND \$${attributes})" && eval unset ${attributes}
              else
                eval echo "\$${attributes}" && eval unset ${attributes}
              fi
            fi
          fi
        else
          echo eval $ATTRIBUTES >>$TMP
        fi
      fi
    fi
  fi
  unset CR TAG_NAME ATTRIBUTES CONTENT COMMENTS
done < "${fileXml}" | ${PROSTPROCESS}
# http://mywiki.wooledge.org/BashFAQ/024
# INFO: I set variables in a "while loop" that's in a pipeline. Why do they disappear? workaround:
if [ -s "$TMP" ]; then
  $FORCE_PRINT && ! $LIGHT && cat $TMP
  # $FORCE_PRINT && $LIGHT && perl -pe 's/[[:space:]].*?=/ /g' $TMP
  $FORCE_PRINT && $LIGHT && sed -r 's/[^\"]*([\"][^\"]*[\"][,]?)[^\"]*/\1 /g' $TMP
  . $TMP
  rm -f $TMP
fi
unset ITSACOMMENT
}

最后是rtrim、trim和echo2（输出到标准错误）函数：

rtrim() {
local var=$@
var="${var%"${var##*[![:space:]]}"}"   # remove trailing whitespace characters
echo -n "$var"
}
trim() {
local var=$@
var="${var#"${var%%[![:space:]]*}"}"   # remove leading whitespace characters
var="${var%"${var##*[![:space:]]}"}"   # remove trailing whitespace characters
echo -n "$var"
}
echo2() { echo -e "$@" 1>&2; }

颜色设置：

在开始之前，您需要定义一些漂亮的颜色动态变量，并进行导出：

set -a
TERM=xterm-256color
case ${UNAME} in
AIX|SunOS)
  M=$(${print} '\033[1;35m')
  m=$(${print} '\033[0;35m')
  END=$(${print} '\033[0m')
;;
*)
  m=$(tput setaf 5)
  M=$(tput setaf 13)
  # END=$(tput sgr0)          # issue on Linux: it can produces ^[(B instead of ^[[0m, more likely when using screenrc
  END=$(${print} '\033[0m')
;;
esac
# 24 shades of grey:
for i in $(seq 0 23); do eval g$i="$(${print} \"\\033\[38\;5\;$((232 + i))m\")" ; done
# another way of having an array of 5 shades of grey:
declare -a colorNums=(238 240 243 248 254)
for num in 0 1 2 3 4; do nn[$num]=$(${print} "\033[38;5;${colorNums[$num]}m"); NN[$num]=$(${print} "\033[48;5;${colorNums[$num]}m"); done
# piped decolorization:
DECOLORIZE='eval sed "s,${END}\[[0-9;]*[m|K],,g"'

如何加载所有内容：

如果您知道如何创建函数并通过FPATH（ksh）或FPATH仿真（bash）加载它们，则可以使用这种方法。

如果不知道，请将所有内容复制/粘贴到命令行中。

工作原理：

xml_read [-cdlp] [-x command <-a attribute>] <file.xml> [tag | "any"] [attributes .. | "content"]
  -c = NOCOLOR
  -d = Debug
  -l = LIGHT (no \"attribute=\" printed)
  -p = FORCE PRINT (when no attributes given)
  -x = apply a command on an attribute and print the result instead of the former value, in green color
  (no attribute given will load their values into your shell as $ATTRIBUTE=value; use '-p' to print them as well)

xml_read server.xml title content     # print content between <title></title>
xml_read server.xml Connector port    # print all port values from Connector tags
xml_read server.xml any port          # print all port values from any tags

在调试模式（-d）下，注释和解析的属性会被打印到stderr中。

- scavenger

我正在尝试使用上述两个函数，它们产生了以下结果：./read_xml.sh: line 22: (-1): substring expression < 0？ - khmarbaise

第22行：[ "x${ATTRIBUTES:(-1):1}x" == "x?x" ] ... - khmarbaise

抱歉khmarbaise，这些是bash shell函数。如果您想将它们适应为shell脚本，您肯定需要预期一些小的调整！此外，更新后的函数可以处理您的错误 ;) - scavenger

如果这是一个shell脚本，对于像我这样需要从服务中运行的人来说，那真是太好了。 - JPM

4

yq 可用于 XML 解析（下面示例所需版本：>= 4.30.5）。

它是一个轻量级便携式命令行 YAML 处理器，也可以处理 XML。语法类似于jq。

输入

<root>
  <myel name="Foo" />
  <myel name="Bar">
    <mysubel>stairway to heaven</mysubel>
  </myel>
</root>

使用案例1

yq --input-format xml '.root.myel.0.+@name' $FILE

Foo

用法示例2

yq 具有很好的内置功能，可以轻松地使 XML 成为可搜索的

yq --input-format xml --output-format props $FILE

root.myel.0.+@name = Foo
root.myel.1.+@name = Bar
root.myel.1.mysubel = stairway to heaven

使用示例 3

yq 还可以将 XML 输入转换为 JSON 或 YAML

yq --input-format xml --output-format json $FILE

{
  "root": {
    "myel": [
      {
        "+@name": "Foo"
      },
      {
        "+@name": "Bar",
        "mysubel": "stairway to heaven"
      }
    ]
  }
}

yq --input-format xml $FILE（YAML 是默认格式）

root:
  myel:
    - +@name: Foo
    - +@name: Bar
      mysubel: stairway to heaven

- jpseng

4

我不知道有任何纯shell的XML解析工具。因此，您很可能需要使用其他语言编写的工具。

我的XML::Twig Perl模块带有这样的工具：xml_grep，您可以将想要的内容写成 xml_grep -t '/html/head/title' xhtmlfile.xhtml > titleOfXHTMLPage.txt （-t选项将结果作为文本而不是XML呈现）

- mirod

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- chad · Accepted Answer

这实际上只是对Yuzem的回答进行的解释，但我认为不应该过多地修改别人的内容，而且评论不允许格式化，所以...

rdom () { local IFS=\> ; read -d \< E C ;}

我们将其称为“read_dom”，而不是“rdom”，稍微加些间隔并使用较长的变量名称：

read_dom () {
    local IFS=\>
    read -d \< ENTITY CONTENT
}

好的，这段代码定义了一个名为read_dom的函数。第一行将IFS（输入字段分隔符）限定在此函数内，并将其更改为“>”。这意味着读取数据时，不会自动拆分空格、制表符或换行符，而是会按“>”进行拆分。下一行指示从stdin读取输入，并且不是在遇到换行符时停止，而是在看到“<”字符时停止（使用-d作为定界标志）。然后，使用IFS将读取的内容分割并分配给ENTITY和CONTENT变量。因此，可以使用以下内容：

<tag>value</tag>

第一次调用`read_dom`会得到一个空字符串（因为“<”是第一个字符）。这将被IFS分成' '，因为没有'>'字符。然后`read`将为空字符串分配给两个变量。第二次调用获取字符串'tag>value'。这将被IFS分割成两个字段'tag'和'value'。接着`read`将变量分配为：`ENTITY=tag`和`CONTENT=value`。第三次调用获取字符串'/tag>'。这将被IFS分割成两个字段'/tag'和''。`read`接着将变量分配为：`ENTITY=/tag`和`CONTENT=`。第四次调用将返回非零状态，因为已经到达文件末尾。

现在他的while循环进行了一些整理以匹配上述情况：

while read_dom; do
    if [[ $ENTITY = "title" ]]; then
        echo $CONTENT
        exit
    fi
done < xhtmlfile.xhtml > titleOfXHTMLPage.txt

第一行代码描述的是：“只要 read_dom 函数返回零状态，就执行以下操作。”第二行检查我们刚刚看到的实体是否为“title”。下一行输出标记内容。第四行退出循环。如果不是标题实体，则在第六行重复循环。将“xhtmlfile.xhtml”重定向到标准输入（用于 read_dom 函数），将标准输出重定向到“titleOfXHTMLPage.txt”（之前循环中的输出）。

现在假设给出了类似于在 S3 上列出桶的 input.xml 内容：

<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
  <Name>sth-items</Name>
  <IsTruncated>false</IsTruncated>
  <Contents>
    <Key>item-apple-iso@2x.png</Key>
    <LastModified>2011-07-25T22:23:04.000Z</LastModified>
    <ETag>&quot;0032a28286680abee71aed5d059c6a09&quot;</ETag>
    <Size>1785</Size>
    <StorageClass>STANDARD</StorageClass>
  </Contents>
</ListBucketResult>

并且以下循环：

while read_dom; do
    echo "$ENTITY => $CONTENT"
done < input.xml

您应该获得：

 => 
ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/" => 
Name => sth-items
/Name => 
IsTruncated => false
/IsTruncated => 
Contents => 
Key => item-apple-iso@2x.png
/Key => 
LastModified => 2011-07-25T22:23:04.000Z
/LastModified => 
ETag => &quot;0032a28286680abee71aed5d059c6a09&quot;
/ETag => 
Size => 1785
/Size => 
StorageClass => STANDARD
/StorageClass => 
/Contents =>

如果我们像Yuzem一样编写了一个while循环：

while read_dom; do
    if [[ $ENTITY = "Key" ]] ; then
        echo $CONTENT
    fi
done < input.xml

我们将获取S3存储桶中所有文件的列表。

编辑：如果由于某些原因local IFS=\>无法工作并且您在全局范围内设置了它，则应在函数结尾处重置它。

read_dom () {
    ORIGINAL_IFS=$IFS
    IFS=\>
    read -d \< ENTITY CONTENT
    IFS=$ORIGINAL_IFS
}

否则，你在脚本后面做的任何行拆分都会出现问题。

编辑2 要分离属性名称/值对，您可以像这样增加read_dom()：

read_dom () {
    local IFS=\>
    read -d \< ENTITY CONTENT
    local ret=$?
    TAG_NAME=${ENTITY%% *}
    ATTRIBUTES=${ENTITY#* }
    return $ret
}

然后编写你的函数来解析并获取你想要的数据，就像这样：

parse_dom () {
    if [[ $TAG_NAME = "foo" ]] ; then
        eval local $ATTRIBUTES
        echo "foo size is: $size"
    elif [[ $TAG_NAME = "bar" ]] ; then
        eval local $ATTRIBUTES
        echo "bar type is: $type"
    fi
}

然后，在你调用read_dom时，调用parse_dom：

while read_dom; do
    parse_dom
done

接下来给出以下示例标记:

<example>
  <bar size="bar_size" type="metal">bars content</bar>
  <foo size="1789" type="unknown">foos content</foo>
</example>

您应该得到以下输出：

$ cat example.xml | ./bash_xml.sh 
bar type is: metal
foo size is: 1789

编辑 3：另一位用户表示在FreeBSD中遇到了问题，并建议将从读取操作中保存的退出状态保存并在read_dom结束时返回。

read_dom () {
    local IFS=\>
    read -d \< ENTITY CONTENT
    local RET=$?
    TAG_NAME=${ENTITY%% *}
    ATTRIBUTES=${ENTITY#* }
    return $RET
}

我看不到任何理由为什么这不会起作用。