将HTML表格的每一行读入Python列表

Question

将HTML表格的每一行读入Python列表

3

我正在尝试使用Python进行网页爬取并且需要从HTML表格中爬取数据。我使用美丽汤（BeautifulSoup）对网页进行爬取。在HTML页面中有许多表格，每个表格中有很多行。我希望每一行都有一个不同的名称，并且如果该行中有列，则将其分开。

我的代码如下：

page = get("https://www.4dpredict.com/mysingaporetoto.p3.html")
html = BeautifulSoup(page.content, 'html.parser')
result = defaultdict(list)
tables = html.find_all('table')
for table in tables:
    for row in table.find_all('tr')[0:15]:
        try:
            #stuck here
        except ValueError:
            continue  # blank/empty row

需要在这方面提供一些指导。

- lakshmen

4个回答

1

请检查以下代码，如果有问题，请告诉我。

import requests
from bs4 import BeautifulSoup
import pprint
page = requests.get("https://www.4dpredict.com/mysingaporetoto.p3.html")
html = BeautifulSoup(page.content, 'html.parser')

tables = html.find_all('table')
table_data = dict()
for table_id, table in enumerate(tables):
    print('[!] Scraping Table -', table_id + 1)
    table_data['table_{}'.format(table_id+1)] = dict()
    table_info = table_data['table_{}'.format(table_id+1)]
    for row_id, row in enumerate(table.find_all('tr')):
        col = []
        for val in row.find_all('td'):
            val = val.text
            val = val.replace('\n', '').strip()
            if val:
                col.append(val)
        table_info['row_{}'.format(row_id+1)] = col
    pprint.pprint(table_info)
    print('+-+' * 20)

pprint.pprint(table_data)

样例输出

[!] Scraping Table - 1
{'row_1': ['SINGAPORE TOTO2018-08-23 (Thu) 3399'],
 'row_10': ['Group 2', '$', '-'],
 'row_11': ['Group 3', '$1,614', '124'],
 'row_12': ['Group 4', '$344', '318'],
 'row_13': ['Group 5', '$50', '6,876'],
 'row_14': ['Group 6', '$25', '9,092'],
 'row_15': ['Group 7', '$10', '117,080'],
 'row_16': ['SHOW ANALYSISEVEN : ODD, 2 : 5SUM :138, AVERAGE :23 MIN :02, MAX '
            ':41, DIFF :39',
            'EVEN : ODD, 2 : 5',
            'SUM :138, AVERAGE :23',
            'MIN :02, MAX :41, DIFF :39'],
 'row_17': ['EVEN : ODD, 2 : 5'],
 'row_18': ['SUM :138, AVERAGE :23'],
 'row_19': ['MIN :02, MAX :41, DIFF :39'],
 'row_2': ['WINNING NUMBERS'],
 'row_3': ['02', '03', '23', '30', '39', '41'],
 'row_4': ['ADDITIONAL'],
 'row_5': ['19'],
 'row_6': ['Prize: $2,499,788'],
 'row_7': ['WINNING SHARES'],
 'row_8': ['Group', 'Share Amt', 'Winners'],
 'row_9': ['Group 1', '$1,249,894', '2']}
+-++-++-++-++-++-++-++-++-++-++-++-++-++-++-++-++-++-++-++-+

- Vijay Anand Pandian

0

我建议放弃使用BeautifulSoup（尽管它很美），改用pandas（它在后端使用BeautifulSoup或lxml）。你所描述的在pandas中是标准操作，只需要阅读文档即可。

- Igor Rivin

0

我建议使用requests.get()方法代替get()方法。

- Suresh

请问您能否使用一些操作码来增强您的回答？以便清楚地指出哪一行代码是 OP 问题的答案。 - Giulio Caccin

OP似乎使用了requests库。然而，他可能从中导入了get，如from requests import get。我仍然找不到问题答案和你的一行评论之间的任何关联。 - SIM

感谢SIM的建议。我对Python和Stack Overflow都很陌生，但会努力学习和解决问题。 - Suresh

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- SIM · Accepted Answer

如果我正确理解您的要求，下面的脚本应该可以解决问题：

import requests
from bs4 import BeautifulSoup

url = 'https://www.4dpredict.com/mysingaporetoto.p3.html'

res = requests.get(url).text
soup = BeautifulSoup(res, 'lxml')
num = 0
for tables in soup.select("table tr"):
    num+=1
    data = [f'{num}'] + [item.get_text(strip=True) for item in tables.select("td")]
    print(data)

部分输出：

['1', 'SINGAPORE TOTO2018-08-23 (Thu) 3399']
['2', 'WINNING NUMBERS']
['3', '02', '03', '23', '30', '39', '41']
['4', 'ADDITIONAL']
['5', '19']
['6', 'Prize:$2,499,788']
['7', 'WINNING SHARES']
['8', 'Group', 'Share Amt', 'Winners']
['9', 'Group 1', '$1,249,894', '2']
['10', 'Group 2', '$', '-']
['11', 'Group 3', '$1,614', '124']
['12', 'Group 4', '$344', '318']
['13', 'Group 5', '$50', '6,876']
['14', 'Group 6', '$25', '9,092']