理想情况下,我希望能够做到以下几点:
cat xhtmlfile.xhtml |
getElementViaXPath --path='/html/head/title' |
sed -e 's%(^<title>|</title>$)%%g' > titleOfXHTMLPage.txt
这实际上只是对Yuzem的回答进行的解释,但我认为不应该过多地修改别人的内容,而且评论不允许格式化,所以...
rdom () { local IFS=\> ; read -d \< E C ;}
我们将其称为“read_dom”,而不是“rdom”,稍微加些间隔并使用较长的变量名称:
read_dom () {
local IFS=\>
read -d \< ENTITY CONTENT
}
好的,这段代码定义了一个名为read_dom的函数。第一行将IFS(输入字段分隔符)限定在此函数内,并将其更改为“>”。这意味着读取数据时,不会自动拆分空格、制表符或换行符,而是会按“>”进行拆分。下一行指示从stdin读取输入,并且不是在遇到换行符时停止,而是在看到“<”字符时停止(使用-d作为定界标志)。然后,使用IFS将读取的内容分割并分配给ENTITY和CONTENT变量。因此,可以使用以下内容:
<tag>value</tag>
while read_dom; do
if [[ $ENTITY = "title" ]]; then
echo $CONTENT
exit
fi
done < xhtmlfile.xhtml > titleOfXHTMLPage.txt
<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<Name>sth-items</Name>
<IsTruncated>false</IsTruncated>
<Contents>
<Key>item-apple-iso@2x.png</Key>
<LastModified>2011-07-25T22:23:04.000Z</LastModified>
<ETag>"0032a28286680abee71aed5d059c6a09"</ETag>
<Size>1785</Size>
<StorageClass>STANDARD</StorageClass>
</Contents>
</ListBucketResult>
并且以下循环:
while read_dom; do
echo "$ENTITY => $CONTENT"
done < input.xml
您应该获得:
=>
ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/" =>
Name => sth-items
/Name =>
IsTruncated => false
/IsTruncated =>
Contents =>
Key => item-apple-iso@2x.png
/Key =>
LastModified => 2011-07-25T22:23:04.000Z
/LastModified =>
ETag => "0032a28286680abee71aed5d059c6a09"
/ETag =>
Size => 1785
/Size =>
StorageClass => STANDARD
/StorageClass =>
/Contents =>
如果我们像Yuzem一样编写了一个while
循环:
while read_dom; do
if [[ $ENTITY = "Key" ]] ; then
echo $CONTENT
fi
done < input.xml
local IFS=\>
无法工作并且您在全局范围内设置了它,则应在函数结尾处重置它。read_dom () {
ORIGINAL_IFS=$IFS
IFS=\>
read -d \< ENTITY CONTENT
IFS=$ORIGINAL_IFS
}
read_dom()
:read_dom () {
local IFS=\>
read -d \< ENTITY CONTENT
local ret=$?
TAG_NAME=${ENTITY%% *}
ATTRIBUTES=${ENTITY#* }
return $ret
}
parse_dom () {
if [[ $TAG_NAME = "foo" ]] ; then
eval local $ATTRIBUTES
echo "foo size is: $size"
elif [[ $TAG_NAME = "bar" ]] ; then
eval local $ATTRIBUTES
echo "bar type is: $type"
fi
}
然后,在你调用read_dom
时,调用parse_dom
:
while read_dom; do
parse_dom
done
接下来给出以下示例标记:
<example>
<bar size="bar_size" type="metal">bars content</bar>
<foo size="1789" type="unknown">foos content</foo>
</example>
$ cat example.xml | ./bash_xml.sh
bar type is: metal
foo size is: 1789
编辑 3:另一位用户表示在FreeBSD中遇到了问题,并建议将从读取操作中保存的退出状态保存并在read_dom结束时返回。
read_dom () {
local IFS=\>
read -d \< ENTITY CONTENT
local RET=$?
TAG_NAME=${ENTITY%% *}
ATTRIBUTES=${ENTITY#* }
return $RET
}
我看不到任何理由为什么这不会起作用。
IFS=\< read ...
,这将仅为读取调用设置IFS。(请注意,我绝不赞成使用“read”解析xml的做法,我认为这样做充满了风险,应该避免。) - William Pursell可从shell脚本中调用的命令行工具包括:
xpath - 围绕Perl的XPath库的命令行封装
sudo apt-get install libxml-xpath-perl
Xidel - 可以处理URL和文件,还可以处理JSON数据格式。
我也使用 xmllint 和 xsltproc ,配合一些小的 XSL 转换脚本来从命令行或 shell 脚本中进行XML处理。
xpath -e 'xpath/expression/here' $filename
命令,并加上 -q
参数以仅显示输出结果,这样你可以将其导出到其他地方或保存到变量中。 - phyattrdom () { local IFS=\> ; read -d \< E C ;}
现在你可以像读取文本文件一样使用rdom来读取HTML文档。
当调用rdom时,它将把元素分配给变量E,将内容分配给变量C。
例如,要执行您想要的操作:
while rdom; do
if [[ $E = title ]]; then
echo $C
exit
fi
done < xhtmlfile.xhtml > titleOfXHTMLPage.txt
/usr/bin/xpath [filename] query
或者XMLStarlet。在opensuse上安装它使用以下命令:
sudo zypper install xmlstarlet
或者在其他平台上尝试使用 cnf xml
。
xpath
不适合用作脚本组件。请参见例如 https://dev59.com/FmUp5IYBdhLWcg3wLVOn 以获取详细信息。 - tripleeeapt-get install xmlstarlet
命令。 - rubo77这就足够了...
xpath xhtmlfile.xhtml '/html/head/title/text()' > titleOfXHTMLPage.txt
apt-get install libxml-xpath-perl
。 - tres.14159请查看来自http://www.ofb.net/~egnor/xml2/的XML2,它可以将XML转换为面向行的格式。
xidel xhtmlfile.xhtml -e /html/head/title > titleOfXHTMLPage.txt
它还具有一个很酷的功能,可以将多个变量导出到bash。例如:
eval $(xidel xhtmlfile.xhtml -e 'title := //title, imgcount := count(//img)' --output-format bash )
将$title
设置为标题,将$imgcount
设置为文件中图像的数量,应该与在bash中直接解析一样灵活。
<!-- comment... -->
<tag attribute="value">content...</tag>
编辑:更新函数,可处理以下内容:
xml_read_dom() {
# https://dev59.com/8HNA5IYBdhLWcg3wmfO5
local ENTITY IFS=\>
if $ITSACOMMENT; then
read -d \< COMMENTS
COMMENTS="$(rtrim "${COMMENTS}")"
return 0
else
read -d \< ENTITY CONTENT
CR=$?
[ "x${ENTITY:0:1}x" == "x/x" ] && return 0
TAG_NAME=${ENTITY%%[[:space:]]*}
[ "x${TAG_NAME}x" == "x?xmlx" ] && TAG_NAME=xml
TAG_NAME=${TAG_NAME%%:*}
ATTRIBUTES=${ENTITY#*[[:space:]]}
ATTRIBUTES="${ATTRIBUTES//xmi:/}"
ATTRIBUTES="${ATTRIBUTES//xmlns:/}"
fi
# when comments sticks to !-- :
[ "x${TAG_NAME:0:3}x" == "x!--x" ] && COMMENTS="${TAG_NAME:3} ${ATTRIBUTES}" && ITSACOMMENT=true && return 0
# http://tldp.org/LDP/abs/html/string-manipulation.html
# INFO: oh wait it doesn't work on IBM AIX bash 3.2.16(1):
# [ "x${ATTRIBUTES:(-1):1}x" == "x/x" -o "x${ATTRIBUTES:(-1):1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:(-1)}"
[ "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x/x" -o "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:${#ATTRIBUTES} -1}"
return $CR
}
第二个:
xml_read() {
# https://dev59.com/8HNA5IYBdhLWcg3wmfO5
ITSACOMMENT=false
local MULTIPLE_ATTR LIGHT FORCE_PRINT XAPPLY XCOMMAND XATTRIBUTE GETCONTENT fileXml tag attributes attribute tag2print TAGPRINTED attribute2print XAPPLIED_COLOR PROSTPROCESS USAGE
local TMP LOG LOGG
LIGHT=false
FORCE_PRINT=false
XAPPLY=false
MULTIPLE_ATTR=false
XAPPLIED_COLOR=g
TAGPRINTED=false
GETCONTENT=false
PROSTPROCESS=cat
Debug=${Debug:-false}
TMP=/tmp/xml_read.$RANDOM
USAGE="${C}${FUNCNAME}${c} [-cdlp] [-x command <-a attribute>] <file.xml> [tag | \"any\"] [attributes .. | \"content\"]
${nn[2]} -c = NOCOLOR${END}
${nn[2]} -d = Debug${END}
${nn[2]} -l = LIGHT (no \"attribute=\" printed)${END}
${nn[2]} -p = FORCE PRINT (when no attributes given)${END}
${nn[2]} -x = apply a command on an attribute and print the result instead of the former value, in green color${END}
${nn[1]} (no attribute given will load their values into your shell; use '-p' to print them as well)${END}"
! (($#)) && echo2 "$USAGE" && return 99
(( $# < 2 )) && ERROR nbaram 2 0 && return 99
# getopts:
while getopts :cdlpx:a: _OPT 2>/dev/null
do
{
case ${_OPT} in
c) PROSTPROCESS="${DECOLORIZE}" ;;
d) local Debug=true ;;
l) LIGHT=true; XAPPLIED_COLOR=END ;;
p) FORCE_PRINT=true ;;
x) XAPPLY=true; XCOMMAND="${OPTARG}" ;;
a) XATTRIBUTE="${OPTARG}" ;;
*) _NOARGS="${_NOARGS}${_NOARGS+, }-${OPTARG}" ;;
esac
}
done
shift $((OPTIND - 1))
unset _OPT OPTARG OPTIND
[ "X${_NOARGS}" != "X" ] && ERROR param "${_NOARGS}" 0
fileXml=$1
tag=$2
(( $# > 2 )) && shift 2 && attributes=$*
(( $# > 1 )) && MULTIPLE_ATTR=true
[ -d "${fileXml}" -o ! -s "${fileXml}" ] && ERROR empty "${fileXml}" 0 && return 1
$XAPPLY && $MULTIPLE_ATTR && [ -z "${XATTRIBUTE}" ] && ERROR param "-x command " 0 && return 2
# nb attributes == 1 because $MULTIPLE_ATTR is false
[ "${attributes}" == "content" ] && GETCONTENT=true
while xml_read_dom; do
# (( CR != 0 )) && break
(( PIPESTATUS[1] != 0 )) && break
if $ITSACOMMENT; then
# oh wait it doesn't work on IBM AIX bash 3.2.16(1):
# if [ "x${COMMENTS:(-2):2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:(-2)}" && ITSACOMMENT=false
# elif [ "x${COMMENTS:(-3):3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:(-3)}" && ITSACOMMENT=false
if [ "x${COMMENTS:${#COMMENTS} - 2:2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 2}" && ITSACOMMENT=false
elif [ "x${COMMENTS:${#COMMENTS} - 3:3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 3}" && ITSACOMMENT=false
fi
$Debug && echo2 "${N}${COMMENTS}${END}"
elif test "${TAG_NAME}"; then
if [ "x${TAG_NAME}x" == "x${tag}x" -o "x${tag}x" == "xanyx" ]; then
if $GETCONTENT; then
CONTENT="$(trim "${CONTENT}")"
test ${CONTENT} && echo "${CONTENT}"
else
# eval local $ATTRIBUTES => eval test "\"\$${attribute}\"" will be true for matching attributes
eval local $ATTRIBUTES
$Debug && (echo2 "${m}${TAG_NAME}: ${M}$ATTRIBUTES${END}"; test ${CONTENT} && echo2 "${m}CONTENT=${M}$CONTENT${END}")
if test "${attributes}"; then
if $MULTIPLE_ATTR; then
# we don't print "tag: attr=x ..." for a tag passed as argument: it's usefull only for "any" tags so then we print the matching tags found
! $LIGHT && [ "x${tag}x" == "xanyx" ] && tag2print="${g6}${TAG_NAME}: "
for attribute in ${attributes}; do
! $LIGHT && attribute2print="${g10}${attribute}${g6}=${g14}"
if eval test "\"\$${attribute}\""; then
test "${tag2print}" && ${print} "${tag2print}"
TAGPRINTED=true; unset tag2print
if [ "$XAPPLY" == "true" -a "${attribute}" == "${XATTRIBUTE}" ]; then
eval ${print} "%s%s\ " "\${attribute2print}" "\${${XAPPLIED_COLOR}}\"\$(\$XCOMMAND \$${attribute})\"\${END}" && eval unset ${attribute}
else
eval ${print} "%s%s\ " "\${attribute2print}" "\"\$${attribute}\"" && eval unset ${attribute}
fi
fi
done
# this trick prints a CR only if attributes have been printed durint the loop:
$TAGPRINTED && ${print} "\n" && TAGPRINTED=false
else
if eval test "\"\$${attributes}\""; then
if $XAPPLY; then
eval echo "\${g}\$(\$XCOMMAND \$${attributes})" && eval unset ${attributes}
else
eval echo "\$${attributes}" && eval unset ${attributes}
fi
fi
fi
else
echo eval $ATTRIBUTES >>$TMP
fi
fi
fi
fi
unset CR TAG_NAME ATTRIBUTES CONTENT COMMENTS
done < "${fileXml}" | ${PROSTPROCESS}
# http://mywiki.wooledge.org/BashFAQ/024
# INFO: I set variables in a "while loop" that's in a pipeline. Why do they disappear? workaround:
if [ -s "$TMP" ]; then
$FORCE_PRINT && ! $LIGHT && cat $TMP
# $FORCE_PRINT && $LIGHT && perl -pe 's/[[:space:]].*?=/ /g' $TMP
$FORCE_PRINT && $LIGHT && sed -r 's/[^\"]*([\"][^\"]*[\"][,]?)[^\"]*/\1 /g' $TMP
. $TMP
rm -f $TMP
fi
unset ITSACOMMENT
}
rtrim() {
local var=$@
var="${var%"${var##*[![:space:]]}"}" # remove trailing whitespace characters
echo -n "$var"
}
trim() {
local var=$@
var="${var#"${var%%[![:space:]]*}"}" # remove leading whitespace characters
var="${var%"${var##*[![:space:]]}"}" # remove trailing whitespace characters
echo -n "$var"
}
echo2() { echo -e "$@" 1>&2; }
在开始之前,您需要定义一些漂亮的颜色动态变量,并进行导出:
set -a
TERM=xterm-256color
case ${UNAME} in
AIX|SunOS)
M=$(${print} '\033[1;35m')
m=$(${print} '\033[0;35m')
END=$(${print} '\033[0m')
;;
*)
m=$(tput setaf 5)
M=$(tput setaf 13)
# END=$(tput sgr0) # issue on Linux: it can produces ^[(B instead of ^[[0m, more likely when using screenrc
END=$(${print} '\033[0m')
;;
esac
# 24 shades of grey:
for i in $(seq 0 23); do eval g$i="$(${print} \"\\033\[38\;5\;$((232 + i))m\")" ; done
# another way of having an array of 5 shades of grey:
declare -a colorNums=(238 240 243 248 254)
for num in 0 1 2 3 4; do nn[$num]=$(${print} "\033[38;5;${colorNums[$num]}m"); NN[$num]=$(${print} "\033[48;5;${colorNums[$num]}m"); done
# piped decolorization:
DECOLORIZE='eval sed "s,${END}\[[0-9;]*[m|K],,g"'
如果您知道如何创建函数并通过FPATH(ksh)或FPATH仿真(bash)加载它们,则可以使用这种方法。
如果不知道,请将所有内容复制/粘贴到命令行中。
xml_read [-cdlp] [-x command <-a attribute>] <file.xml> [tag | "any"] [attributes .. | "content"]
-c = NOCOLOR
-d = Debug
-l = LIGHT (no \"attribute=\" printed)
-p = FORCE PRINT (when no attributes given)
-x = apply a command on an attribute and print the result instead of the former value, in green color
(no attribute given will load their values into your shell as $ATTRIBUTE=value; use '-p' to print them as well)
xml_read server.xml title content # print content between <title></title>
xml_read server.xml Connector port # print all port values from Connector tags
xml_read server.xml any port # print all port values from any tags
在调试模式(-d)下,注释和解析的属性会被打印到stderr中。
./read_xml.sh: line 22: (-1): substring expression < 0
? - khmarbaise[ "x${ATTRIBUTES:(-1):1}x" == "x?x" ] ...
- khmarbaiseyq 可用于 XML 解析(下面示例所需版本:>= 4.30.5)。
它是一个轻量级便携式命令行 YAML 处理器,也可以处理 XML。语法类似于jq。
输入
<root>
<myel name="Foo" />
<myel name="Bar">
<mysubel>stairway to heaven</mysubel>
</myel>
</root>
使用案例1
yq --input-format xml '.root.myel.0.+@name' $FILE
Foo
用法示例2
yq
具有很好的内置功能,可以轻松地使 XML 成为可搜索的
yq --input-format xml --output-format props $FILE
root.myel.0.+@name = Foo
root.myel.1.+@name = Bar
root.myel.1.mysubel = stairway to heaven
使用示例 3
yq
还可以将 XML 输入转换为 JSON 或 YAML
yq --input-format xml --output-format json $FILE
{
"root": {
"myel": [
{
"+@name": "Foo"
},
{
"+@name": "Bar",
"mysubel": "stairway to heaven"
}
]
}
}
yq --input-format xml $FILE
(YAML
是默认格式)
root:
myel:
- +@name: Foo
- +@name: Bar
mysubel: stairway to heaven
我不知道有任何纯shell的XML解析工具。因此,您很可能需要使用其他语言编写的工具。
我的XML::Twig Perl模块带有这样的工具:xml_grep
,您可以将想要的内容写成 xml_grep -t '/html/head/title' xhtmlfile.xhtml > titleOfXHTMLPage.txt
(-t
选项将结果作为文本而不是XML呈现)
Example
的命令是:echo '<html><head><title>Example</title></body></html>' | yq -p xml '.html.head.title'
。参见链接:yq,一些例子。 - jpsengecho '<html><head><title>Example</title></body></html>' | yq -p xml '.html.head.title'
输出Example
。参见:yq,一些示例 - undefined