使用lxml按属性查找元素

Question

使用lxml按属性查找元素

66

我需要解析一个XML文件来提取一些数据。我只需要具有特定属性的一些元素，这是一个文档的示例：

<root>
    <articles>
        <article type="news">
             <content>some text</content>
        </article>
        <article type="info">
             <content>some text</content>
        </article>
        <article type="news">
             <content>some text</content>
        </article>
    </articles>
</root>

这里我只想获取类型为“新闻”的文章。使用lxml，最高效和优雅的方法是什么？

我尝试了find方法，但不太好用：

from lxml import etree
f = etree.parse("myfile")
root = f.getroot()
articles = root.getchildren()[0]
article_list = articles.findall('article')
for article in article_list:
    if "type" in article.keys():
        if article.attrib['type'] == 'news':
            content = article.find('content')
            content = content.text

- Jérôme Pigeot

2个回答

19

仅供参考，你可以使用findall方法获得相同的结果：

root = etree.fromstring("""
<root>
    <articles>
        <article type="news">
             <content>some text</content>
        </article>
        <article type="info">
             <content>some text</content>
        </article>
        <article type="news">
             <content>some text</content>
        </article>
    </articles>
</root>
""")

articles = root.find("articles")
article_list = articles.findall("article[@type='news']/content")
for a in article_list:
    print a.text

- Kjir

如果一个属性有命名空间，它会如何工作？例如，在上面的示例中，属性type类似于imx:type？其中imx='https://some.namespace.imx'。 - Alex Raj Kaliamoorthy

在这种情况下，您可以向 findall 提供一个包含前缀/命名空间映射字典的 namespaces 参数，例如 articles.findall("article[@type='news']/content", namespaces=root.nsmap) 或者您可以手动构建它，例如 namespaces={"imx": "https://some.namespace.imx"}。 - L0tad

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Devin Jeanpierre · Accepted Answer

你可以使用xpath，例如root.xpath("//article[@type='news']") 这个xpath表达式将返回所有具有"类型"属性值为"新闻"的<article/>元素列表。然后你可以迭代它并执行所需操作，或者将其传递到任何需要的地方。

要仅获取文本内容，你可以如下扩展xpath：

root = etree.fromstring("""
<root>
    <articles>
        <article type="news">
             <content>some text</content>
        </article>
        <article type="info">
             <content>some text</content>
        </article>
        <article type="news">
             <content>some text</content>
        </article>
    </articles>
</root>
""")

print root.xpath("//article[@type='news']/content/text()")

运行此代码将输出['some text', 'some text']。或者如果您只想要内容元素，它将是"//article[@type='news']/content"，等等。