如何在Python中循环遍历HTML表格数据集

Question

如何在Python中循环遍历HTML表格数据集

5

我是第一次在这里发布帖子，试图学习一些Python技能；请对我好一点 :-)

虽然我不完全陌生于编程概念（之前曾经玩过PHP），但是转向Python对我来说有些困难。我想这主要是因为我缺乏大多数 - 如果不是全部 - 常见的“设计模式”等基本理解。

话虽如此，问题在于：我的当前项目的一部分涉及使用Beautiful Soup编写简单的爬虫。要处理的数据具有与下面所列出的相似结构。

<table>
    <tr>
        <td class="date">2011-01-01</td>
    </tr>
    <tr class="item">
        <td class="headline">Headline</td>
        <td class="link"><a href="#">Link</a></td>
    </tr>
    <tr class="item">
        <td class="headline">Headline</td>
        <td class="link"><a href="#">Link</a></td>
    </tr>
    <tr>
        <td class="date">2011-01-02</td>
    </tr>
    <tr class="item">
        <td class="headline">Headline</td>
        <td class="link"><a href="#">Link</a></td>
    </tr>
    <tr class="item">
        <td class="headline">Headline</td>
        <td class="link"><a href="#">Link</a></td>
    </tr>
</table>

主要问题是我无法理解如何实现以下三个步骤： 1）在循环下一个tr时跟踪当前日期（tr->td class="date"）； 2）循环遍历后续tr中的项目（tr class="item"->td class="headline"和tr class="item"->td class="link"）； 3）将处理后的数据存储在数组中。

此外，所有数据都将插入到数据库中，每个条目必须包含以下信息： - 日期 - 标题 - 链接

请注意，crud：ing数据库不是问题的一部分，我只是提到这一点是为了更好地说明我想要实现的目标 :-)

现在，有许多不同的方法来解决这个问题。因此，虽然对手头问题的解决方案确实非常受欢迎，但如果有人愿意详细说明您将使用的实际逻辑和策略来“攻击”这种问题，我将非常感激:-)

最后，对于这样一个新手问题，我表示抱歉。

- Mattias

2个回答

3

你可以使用Python包中包含的Element Tree来实现。 http://docs.python.org/library/xml.etree.elementtree.html

from xml.etree.ElementTree import ElementTree

tree = ElementTree()
tree.parse('page.xhtml') #This is the XHTML provided in the OP
root = tree.getroot() #Returns the heading "table" element
print(root.tag) #"table"
for eachTableRow in root.getchildren(): 
    #root.getchildren() is a list of all of the <tr> elements
    #So we're going to loop over them and check their attributes
    if 'class' in eachTableRow.attrib:
        #Good to go. Now we know to look for the headline and link
        pass
    else:
        #Okay, so look for the date
        pass

这应该足以让你开始解析了。

- user407896

嗨，感谢您的输入。我目前正在使用beautifulsoup进行爬取，但很快我可能会考虑使用Element Tree。干杯！ :-) - Mattias

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Hugh Bothwell · Accepted Answer

基本问题在于这个表格是为了外观而标记的，而不是为了语义结构。正确的做法是，每个日期及其相关项目应该共享一个父元素。不幸的是，它们没有共享父元素，所以我们必须将就一下。

基本策略是迭代表格中的每一行：

- 如果第一个表格单元具有“date”类，则获取日期值并更新last_seen_date。 - 否则，提取标题和链接，然后将（last_seen_date、headline、link）保存到数据库中。

import BeautifulSoup

fname = r'c:\mydir\beautifulSoup.html'
soup = BeautifulSoup.BeautifulSoup(open(fname, 'r'))

items = []
last_seen_date = None
for el in soup.findAll('tr'):
    daterow = el.find('td', {'class':'date'})
    if daterow is None:     # not a date - get headline and link
        headline = el.find('td', {'class':'headline'}).text
        link = el.find('a').get('href')
        items.append((last_seen_date, headline, link))
    else:                   # get new date
        last_seen_date = daterow.text