美丽汤 - 忽略Unicode错误，仅打印文本

Question

美丽汤 - 忽略Unicode错误，仅打印文本

3

我正在进行一些网络数据抓取，提取表格中的文本。经常出现Unicode错误，当我使用utf8编码时，我的结果中混杂着大量的'和'\xc2\xa0'，是否有一种方法可以避免编码并仅从表格中获取文本？

Traceback (most recent call last): File "c:\...\...\...", line 15, in 
<module> print(rows) File 
"C:\...\...\...\Python\Python37\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u2612' in position 3: character maps to <undefined>

当我使用replace时，出现了类型错误：

TypeError: a bytes-like object is required, not 'str'

无论我是否使用str()，我都尝试迭代并仅打印可以转换为字符串的项目，但仍然出现Unicode错误。

test = 'https://www.sec.gov/Archives/edgar/data/789019/000156459019001392/msft-10q_20181231.htm'

import re

import requests
from urllib.request import urlopen


from bs4 import BeautifulSoup

page = urlopen(test).read()
soup = BeautifulSoup(page, 'lxml')

tables = soup.findAll('table')

for table in tables:
  for row in table.findAll('tr'):
    for cel in row.findAll('td'):
      if str(cel.getText().encode('utf-8').strip()) != "b'\\xc2\\xa0'":
        print(str(cel.getText().encode('utf-8').strip())
        #print(str(cel.getText().encode('utf-8').strip().replace('\\xc2\\xa0', '').replace('b\'', '')

实际结果：

b'\xe2\x98\x92'
b'QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934'

b'\xe2\x98\x90'
b'TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934'

b'Washington'

b'\xc2\xa0'

b'91-1144442'

b'(State or other jurisdiction of\nincorporation or organization)'
...
...

期望的结果:

'QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934'

'TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934'

'Washington'

'91-1144442'

'(State or other jurisdiction of\nincorporation or organization)'

...
...

- kenneh

2

不要编码！将文本作为文本而不是字节字符串处理。打印文本字符串。 - Mark Tolonen

@MarkTolonen 如果我不进行编码，就会出现“charmap”编解码器无法编码字符的错误。 - kenneh

如果我移除encode()，那么在Linux上我会得到正确的文本。如果您使用的是Windows，则可能会遇到控制台/终端/cmd.exe的问题，因为它不使用UTF-8。 - furas

1

请使用 Python 3.6 或更高版本。此版本及以后的版本使用 Unicode API 直接写入终端。之前版本会编码为终端编码，这会限制可打印的代码点。如果无法切换，请使用支持 UTF-8 的 IDE。 - Mark Tolonen

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Martin Evans · Accepted Answer

BeautifulSoup已经正确处理了utf-8格式的HTML，而编码则是将字符串转换为字节。

以下代码可以输出所需内容：

from bs4 import BeautifulSoup
import requests

test = 'https://www.sec.gov/Archives/edgar/data/789019/000156459019001392/msft-10q_20181231.htm'
req = requests.get(test)
soup = BeautifulSoup(req.content, "html.parser")

for table in soup.find_all('table'):
    for row in table.findAll('tr'):
        for cel in row.findAll('td'):
            text = cel.get_text(strip=True)

            if text:   # skip blank lines
                print(text)

HTML表格可以按以下方式存储为列表的列表：

from bs4 import BeautifulSoup
import requests

test = 'https://www.sec.gov/Archives/edgar/data/789019/000156459019001392/msft-10q_20181231.htm'
req = requests.get(test)
soup = BeautifulSoup(req.content, "html.parser")

rows = []

for table in soup.find_all('table'):
    for row in table.findAll('tr'):
        values = [cel.get_text(strip=True) for cel in row.findAll('td')]
        rows.append(values)

print(rows)

测试环境：

Python 3.7.3，BS4 4.7.1
Python 2.7.16，BS4 4.7.1

最初的回答：