用BS4解析HTML表格

Question

用BS4解析HTML表格

python-2.7html-parsingweb-scrapingbeautifulsoup

5

我一直在尝试从这个网站（http://nflcombineresults.com/nflcombinedata.php?year=1999&pos=WR&college=）中抓取数据的不同方法，但似乎没有一个能够正常工作。我已经尝试过使用给定的索引，但无法使其正常工作。我认为我已经尝试了太多的方法，所以如果有人能指点我正确的方向，我将非常感激。

我想提取所有信息并将其导出到.csv文件中，但此时我只是想打印名称和位置以开始工作。

以下是我的代码：

import urllib2
from bs4 import BeautifulSoup
import re

url = ('http://nflcombineresults.com/nflcombinedata.php?year=1999&pos=&college=')

page = urllib2.urlopen(url).read()

soup = BeautifulSoup(page)
table = soup.find('table')

for row in table.findAll('tr')[0:]:
    col = row.findAll('tr')
    name = col[1].string
    position = col[3].string
    player = (name, position)
    print "|".join(player)

这是我得到的错误：第14行，name = col[1].string IndexError: list index out of range.

--更新-- 好的，我取得了一点进展。现在它允许我从头到尾执行，但需要知道表格中有多少行。如何使其只遍历到结束？更新代码：

import urllib2
from bs4 import BeautifulSoup
import re

url = ('http://nflcombineresults.com/nflcombinedata.php?year=1999&pos=&college=')

page = urllib2.urlopen(url).read()

soup = BeautifulSoup(page)
table = soup.find('table')


for row in table.findAll('tr')[1:250]:
    col = row.findAll('td')
    name = col[1].getText()
    position = col[3].getText()
    player = (name, position)
    print "|".join(player)

- ISuckAtLife

3个回答

2

由于防火墙权限的限制，我无法运行你的脚本，但我认为问题在这一行：

col = row.findAll('tr')

row 是一个 tr 标签，你正在要求 BeautifulSoup 查找该 tr 标签内的所有 tr 标签。你可能想要做的是：

col = row.findAll('td')

此外，由于实际文本不直接位于td中，而是隐藏在嵌套的div和a标签中，因此使用getText方法而不是.string可能会更有用。

name = col[1].getText()
position = col[3].getText()

- Kevin

啊，这很有道理。谢谢！好的，我已经按照你建议的进行了更改，并且在页面上打印了大部分结果，进展顺利。不过它从Adrian Dingle开始，而不是列中的第一个名字，但之后完整地打印了列表，包括|和位置。然后返回了这个错误：File "nfltest.py", line 14, in <module> name = col[1].getText() IndexError: list index out of range。我又试着调整索引，但似乎无法消除这个错误。难道只有我觉得这个表格格式奇怪吗？ - ISuckAtLife

0

按列解析表格的简单方法：

def table_to_list(table):
    data = []
    all_th = table.find_all('th')
    all_heads = [th.get_text() for th in all_th]
    for tr in table.find_all('tr'):
        all_th = tr.find_all('th')
        if all_th:
            continue
        all_td = tr.find_all('td')
        data.append([td.get_text() for td in all_td])
    return list(zip(all_heads, *data))

r = requests.get(url, headers=headers)
bs = BeautifulSoup(r.text)
all_tables = bs.find_all('table')
table_to_list(all_tables[0])

- sankalp

在新的代码中避免使用旧语法findAll()，而应该使用find_all() - 想了解更多，请花一分钟查看文档。 - HedgeHog

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- ISuckAtLife · Accepted Answer

我大约只用了8个小时就解决了。学习很有趣。谢谢你的帮助，Kevin！现在它包含了代码将抓取的数据输出到CSV文件中。接下来要做的是获取这些数据并过滤出特定的职位...

这是我的代码：

import urllib2
from bs4 import BeautifulSoup
import csv

url = ('http://nflcombineresults.com/nflcombinedata.php?year=2000&pos=&college=')

page = urllib2.urlopen(url).read()

soup = BeautifulSoup(page)
table = soup.find('table')

f = csv.writer(open("2000scrape.csv", "w"))
f.writerow(["Name", "Position", "Height", "Weight", "40-yd", "Bench", "Vertical", "Broad", "Shuttle", "3-Cone"])
# variable to check length of rows
x = (len(table.findAll('tr')) - 1)
# set to run through x
for row in table.findAll('tr')[1:x]:
    col = row.findAll('td')
    name = col[1].getText()
    position = col[3].getText()
    height = col[4].getText()
    weight = col[5].getText()
    forty = col[7].getText()
    bench = col[8].getText()
    vertical = col[9].getText()
    broad = col[10].getText()
    shuttle = col[11].getText()
    threecone = col[12].getText()
    player = (name, position, height, weight, forty, bench, vertical, broad, shuttle, threecone, )
    f.writerow(player)