漂亮汤提取

Question

漂亮汤提取

3

我在使用BeautifulSoup和re模块时遇到了一些“问题”。

具体来说，问题如下：

import re

from bs4 import BeautifulSoup

string = """
<div id="my_id">
    <ul>
        <li>something</li>
        <li class="color12">something</li>
        <li class="color45">something else</li>
    </ul>
</div>
"""
soup = BeautifulSoup(string)
li = soup.find_all('li', {'class': re.compile('color(\d+)')} )
for ele in li:
    print ele['class'] # will print colorXXXX but i would like to know how to get only this XXXX

但我只想提取颜色之后的数字，这可行吗？还是说我必须使用类似以下代码的方式：

match = re.search(r'color(\d+)', str(ele['class']))
if match:
    print match.group(1)

谢谢你的帮助 :)。

- mosqui

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Martijn Pieters · Accepted Answer

3

您需要重新应用正则表达式。只需将其存储在变量中并重复使用：

colorpattern = re.compile(r'color(\d+)')

li = soup.find_all('li', {'class': colorpattern} )
for ele in li:
    print colorpattern.search(ele['class']).group(1)

- Martijn Pieters

如果正则表达式中有反斜杠，请使用 r''。 - jfs

@J.F.Sebastian：当我复制和粘贴原始帖子时，我应该始终注意到这一点...已添加。 - Martijn Pieters

请问为什么将编译后的模式存储到变量中会产生差异？ - Reorx

@Reorx：我们省去了两次指定模式的麻烦，因为BS已经为我们进行了标签匹配，所以我们可以确定.search()将在循环中成功。不再需要在那里测试None。 - Martijn Pieters

@MartijnPieters 对不起，我误解了为什么find_all只得到一个结果的问题，太愚蠢了！无论如何，谢谢你的回答 :) - Reorx