如何使用BeautifulSoup找到注释标签<!--...-->?

19

我尝试使用soup.find('!--'),但似乎不起作用。先谢谢。

编辑:感谢您提供如何查找所有注释的提示。我有一个跟进问题。我如何特定搜索注释?

例如,我有以下注释标签:

<!-- <span class="titlefont"> <i>Wednesday 110518</i>(05:00PM)<br /></span> -->

我只想要这段内容 <i>Wednesday 110518</i>。 "110518"是日期YYMMDD,我想将其用作搜索目标。然而,我不知道如何在特定的注释标签中查找内容。

2个回答

24
您可以通过findAll方法在文档中找到所有的评论。看看这个例子,展示了如何精确地做您想做的事情:Removing elements
简而言之,您需要这样做:
comments = soup.findAll(text=lambda text:isinstance(text, Comment))

编辑:如果你想在列内搜索,你可以尝试以下方法:

import re
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
for comment in comments:
  e = re.match(r'<i>([^<]*)</i>', comment.string).group(1)
  print e

如何搜索特定的注释?我正在尝试在html文件中搜索以下内容:<!-- <span class="titlefont"> <i>星期三110518</i>(05:00PM)<br /></span> --> 注意110518,那只是yymmdd的日期,如何仅搜索该注释标记内的信息,特别是<i></i>内的信息? - 1stsage
@1stsage 或许你想在你的问题中加入这个要求。 - Steven Rumbalski
1stsage,我已经针对你的具体情况更新了我的帖子。下次请确保你的问题涵盖了你想要做的事情。 - yan
@1stsage 关于搜索评论内容,如果它是有效的html,你也可以解析它。或者你可以使用字符串方法,甚至正则表达式。对于如此小的文本块和简单的要求,我会采用正则表达式(类似于 r'\<i\>(.*?)\</i\>')。 - Steven Rumbalski

0
Pyparsing允许您使用内置的htmlComment表达式搜索HTML注释,并附加解析时回调以验证和提取注释中的各个数据字段:
from pyparsing import makeHTMLTags, oneOf, withAttribute, Word, nums, Group, htmlComment
import calendar

# have pyparsing define tag start/end expressions for the 
# tags we want to look for inside the comments
span,spanEnd = makeHTMLTags("span")
i,iEnd = makeHTMLTags("i")

# only want spans with class=titlefont
span.addParseAction(withAttribute(**{'class':'titlefont'}))

# define what specifically we are looking for in this comment
weekdayname = oneOf(list(calendar.day_name))
integer = Word(nums)
dateExpr = Group(weekdayname("day") + integer("daynum"))
commentBody = '<!--' + span + i + dateExpr("date") + iEnd

# define a parse action to attach to the standard htmlComment expression,
# to extract only what we want (or raise a ParseException in case 
# this is not one of the comments we're looking for)
def grabCommentContents(tokens):
    return commentBody.parseString(tokens[0])
htmlComment.addParseAction(grabCommentContents)


# let's try it
htmlsource = """
want to match this one
<!-- <span class="titlefont"> <i>Wednesday 110518</i>(05:00PM)<br /></span> -->

don't want the next one, wrong span class
<!-- <span class="bodyfont"> <i>Wednesday 110519</i>(05:00PM)<br /></span> -->

not even a span tag!
<!-- some other text with a date in italics <i>Wednesday 110520</i>(05:00PM)<br /></span> -->

another matching comment, on a different day
<!-- <span class="titlefont"> <i>Thursday 110521</i>(05:00PM)<br /></span> -->
"""

for comment in htmlComment.searchString(htmlsource):
    parsedDate = comment.date
    # date info can be accessed like elements in a list
    print parsedDate[0], parsedDate[1]
    # because we named the expressions within the dateExpr Group
    # we can also get at them by name (this is much more robust, and 
    # easier to maintain/update later)
    print parsedDate.day
    print parsedDate.daynum
    print

输出:

Wednesday 110518
Wednesday
110518

Thursday 110521
Thursday
110521

pyparsing 的最新版本现在包括 withClass,以简化 withAttribute 的丑陋代码。 - PaulMcG

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接