使用BeautifulSoup解析Python表格

3

HTML页面结构:

<table>
    <tbody>
        <tr>
           <th>Timestamp</th>
           <th>Call</th>
           <th>MHz</th>
           <th>SNR</th>
           <th>Drift</th>
           <th>Grid</th>
           <th>Pwr</th>
           <th>Reporter</th>
           <th>RGrid</th>
           <th>km</th> 
           <th>az</th>
        </tr>
        <tr>
           <td align="right">&nbsp;2019-12-10 14:02&nbsp;</td>
           <td align="left">&nbsp;DL1DUZ&nbsp;</td>
           <td align="right">&nbsp;10.140271&nbsp;</td>
           <td align="right">&nbsp;-26&nbsp;</td>
           <td align="right">&nbsp;0&nbsp;</td>
           <td align="left">&nbsp;JO61tb&nbsp;</td>
           <td align="right">&nbsp;0.2&nbsp;</td>
           <td align="left">&nbsp;F4DWV&nbsp;</td>
           <td align="left">&nbsp;IN98bc&nbsp;</td>
           <td align="right">&nbsp;1162&nbsp;</td>
           <td align="right">&nbsp;260&nbsp;</td>
        </tr>
        <tr>
           <td align="right">&nbsp;2019-10-10 14:02&nbsp;</td>
           <td align="left">&nbsp;DL23UH&nbsp;</td>
           <td align="right">&nbsp;11.0021&nbsp;</td>
           <td align="right">&nbsp;-20&nbsp;</td>
           <td align="right">&nbsp;0&nbsp;</td>
           <td align="left">&nbsp;JO61tb&nbsp;</td>
           <td align="right">&nbsp;0.2&nbsp;</td>
           <td align="left">&nbsp;F4DWV&nbsp;</td>
           <td align="left">&nbsp;IN98bc&nbsp;</td>
           <td align="right">&nbsp;1162&nbsp;</td>
           <td align="right">&nbsp;260&nbsp;</td>
        </tr>
    </tbody>
</table>

等等,tr-td之类的...

from bs4 import BeautifulSoup as bs
import requests
import csv

base_url = 'some_url'
session = requests.Session()
request = session.get(base_url)
val_th = []
val_td = []

if request.status_code == 200:
    soup = bs(request.content, 'html.parser')
    table = soup.findChildren('table')
    tr = soup.findChildren('tr')
    my_table = table[0]
    my_tr_th = tr[0]
    my_tr_td = tr[1]
    rows = my_table.findChildren('tr')
    row_th = my_tr_th.findChildren('th')
    row_td = my_tr_td.findChildren('td')
    for r_th in row_th:
       heading = r_th.text
       val_th.append(heading)
    for r_td in row_td:
        data = r_td.text
        val_td.append(data)
    with open('output.csv', 'w') as f:
        a_pen = csv.writer(f)
        a_pen.writerow(val_th)
        a_pen.writerow(val_td)

1)我打印了一行。如何确保页面上所有的行在csv中都显示? 2)标签-页面上有很多。 3)如果my_tr_td = tr[1],那么写成my_tr_td = tr[1:50]是错误的。如何将所有数据从-行写入csv文件中? 谢谢您的提前帮助。

你想要什么样的输出? - Jack Fleeting
我理解你的意思;但是你期望那个文件包含什么内容? - Jack Fleeting
输出文件为 output.csv。表头列使用 <th> 标签,行使用 <td> 标签。 - Outlaw
此外,每个页面有许多<tr><td>标签。 - Outlaw
如果是这样的话,您能否编辑您的示例以展示两组 tr/thtr/td 是什么样子? - Jack Fleeting
没问题。我已经编辑了。在第一个标签<tr>下面总是有标签<th>,在两个或更多标签<tr>下面总是有标签<td>。http://wsprnet.org/drupal/wsprnet/spots - url。 - Outlaw
1个回答

1
让我们尝试这种方式:

import lxml.html
import csv
import requests

url = "http://wsprnet.org/drupal/wsprnet/spots"
res = requests.get(url)

doc = lxml.html.fromstring(res.text)

cols = []
#first, we need to extract the column headers, stuck all the way at the top, with the first one in a particular location and format

cols.append(doc.xpath('//table/tr/node()/text()')[0])
for item in doc.xpath('//table/tr/th'):
    typ = str(type(item.getnext()))
    if not 'NoneType' in typ:        
        cols.append(item.getnext().text)
#now for the actual data
inf = []
for item in doc.xpath('//table//tr//td'):
    inf.append(item.text.replace('\\xa02', '').strip()) #text info needs to be cleaned

#this will take all the data and split it into rows for each column
rows = [inf[x:x+len(cols)] for x in range(0, len(inf), len(cols))]

#finally, write to file:
with open("output.csv", "w", newline='') as f:
    writer = csv.writer(f)
    writer.writerow(cols) 
    for l in rows:
        writer.writerow(l)

谢谢,杰克。代码可以使用。但是头文件在文件中出现了两次。像这样:['Timestamp','Call','MHz','SNR','Drift','Grid','Pwr','Reporter','RGrid','km','az'],Call,MHz,SNR,Drift,Grid,Pwr,Reporter,RGrid,km,az - Outlaw
@Outlaw - 你说得对!打错字了...我在编辑后的版本中更改了cols.append()语句。请再次检查。 - Jack Fleeting
抱歉让杰克担心了。 我需要将新页面内容传递到你的函数中,但是没有效果。如何实现?接下来 - btn_elem_upd = driver.find_element_by_id('edit-submit').click() new_source = driver.page_source 然后就是你的函数。我需要将“new_source”转移到你的函数中。 url = "http://wsprnet.org/drupal/wsprnet/spots" res = requests.get(url) doc = lxml.html.fromstring(res.text) 如何操作? - Outlaw
@Outlaw - 这是一个不同的问题,根据SO政策(也因为这是个好主意),您应该将其作为一个新问题发布,并且由于它看起来像是Selenium函数,您应该在新问题中添加Selenium标签。然后我会很乐意查看它(可能还有其他人...) - Jack Fleeting
我会创建一个新问题。谢谢 =) - Outlaw

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接