美味汤（BeautifulSoup）HTML表格解析

Question

美味汤（BeautifulSoup）HTML表格解析

pythonbeautifulsouphtml-tablehtml-parsingmechanize

18

我正在尝试从此网站解析信息（HTML表格）：http://www.511virginia.org/RoadConditions.aspx?j=All&r=1

目前我正在使用BeautifulSoup，我的代码如下：

from mechanize import Browser
from BeautifulSoup import BeautifulSoup

mech = Browser()

url = "http://www.511virginia.org/RoadConditions.aspx?j=All&r=1"
page = mech.open(url)

html = page.read()
soup = BeautifulSoup(html)

table = soup.find("table")

rows = table.findAll('tr')[3]

cols = rows.findAll('td')

roadtype = cols[0].string
start = cols.[1].string
end = cols[2].string
condition = cols[3].string
reason = cols[4].string
update = cols[5].string

entry = (roadtype, start, end, condition, reason, update)

print entry

问题出在起始列和结束列上。它们只会输出为"None"。

输出：

(u'Rt. 613N (Giles County)', None, None, u'Moderate', u'snow or ice', u'01/13/2010 10:50 AM')

我知道它们被存储在列列表中，但似乎额外的链接标签会使原始HTML解析出现问题，原始HTML如下：

<td headers="road-type" class="ConditionsCellText">Rt. 613N (Giles County)</td>
<td headers="start" class="ConditionsCellText"><a href="conditions.aspx?lat=37.43036753&long=-80.51118005#viewmap">Big Stony Ck Rd; Rt. 635E/W (Giles County)</a></td>
<td headers="end" class="ConditionsCellText"><a href="conditions.aspx?lat=37.43036753&long=-80.51118005#viewmap">Cabin Ln; Rocky Mount Rd; Rt. 721E/W (Giles County)</a></td>
<td headers="condition" class="ConditionsCellText">Moderate</td>
<td headers="reason" class="ConditionsCellText">snow or ice</td>
<td headers="update" class="ConditionsCellText">01/13/2010 10:50 AM</td>

所以应该打印的是：

(u'Rt. 613N (Giles County)', u'Big Stony Ck Rd; Rt. 635E/W (Giles County)', u'Cabin Ln; Rocky Mount Rd; Rt. 721E/W (Giles County)', u'Moderate', u'snow or ice', u'01/13/2010 10:50 AM')

非常感谢您的帮助和建议，非常期待您的回复。

- Stephen Tanner

你不必使用Beautiful Soup。你可以使用python3 htmlparser：https://github.com/schmijos/html-table-parser-python3/blob/master/html_table_parser/parser.py - schmijos

2个回答

2

我试图重现您的错误，但源HTML页面已更改。

关于这个错误，我遇到了类似的问题，试图重现这个例子在这里，将所提供的URL更改为维基百科表格。

我使用BeautifulSoup4进行修复。

from bs4 import BeautifulSoup

将.string更改为.get_text()

start = cols[1].get_text()

我无法使用您的示例进行测试（如我之前所说，我无法重现错误），但我认为这对于正在寻找解决此问题的人可能会有用。

- evinhas

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Antony Hatchkins · Accepted Answer

start = cols[1].find('a').string

更简单地说

start = cols[1].a.string

或者更好

start = str(cols[1].find(text=True))

和

entry = [str(x) for x in cols.findAll(text=True)]