Python正则表达式用于Beautiful Soup

Question

5

我正在使用Beautiful Soup来提取特定的div标签，但似乎我不能使用简单的字符串匹配。

该页面有一些形式为

的标签。

<div class="comment form new"...>

我想忽略的内容，还有一些类似于

的标签。

<div class="comment comment-xxxx...">

“x”代表任意长度的整数，“省略号”代表若干个由空格分隔的其他值（我不关心其具体数值）。我无法确定正确的正则表达式，尤其是因为我从未使用过Python的re类。

soup.find_all(class_="comment")

找出所有以单词“comment”开头的标签。我已经尝试使用过了。

soup.find_all(class_=re.compile(r'(comment)( )(comment)'))
soup.find_all(class_=re.compile(r'comment comment.*'))

许多其他变体，但我认为我在正则表达式或match()的工作方式方面漏掉了一些显而易见的东西。有人能帮帮我吗？

- user1890572

1

首先，您使用的是BS3还是BS4？其中一个有findAll，另一个有find_all，两者都没有findall... - abarnert

抱歉，BS4 - 我没有直接从我的代码粘贴，我会编辑。 - user1890572

该死，因为我有一个关于BS3的答案...但是对于BS4来说，它似乎不喜欢类中的空格，也许是因为我对BS4还不够了解。我可以匹配“'comment'”，但无法匹配“'comment comment'”。我会进一步研究的。 - abarnert

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- abarnert · Accepted Answer

我想我明白了：

>>> [div['class'] for div in soup.find_all('div')]
[['comment', 'form', 'new'], ['comment', 'comment-xxxx...']]

请注意，与BS3中的等效物不同，它不是这样的：

['comment form new', 'comment comment-xxxx...']

这就是为什么你的正则表达式无法匹配。

但你可以匹配，例如：

>>> soup.find_all('div', class_=re.compile('comment-'))
[<div class="comment comment-xxxx..."></div>]

请注意，BS执行的是与re.search相当的操作，而不是re.match，因此您不需要使用'comment-.*'。当然，如果您想匹配'comment-12345'而不是'comment-of-another-kind'，您可以使用'comment-\d+'。