我正在进行一些网络数据抓取,提取表格中的文本。经常出现Unicode错误,当我使用utf8编码时,我的结果中混杂着大量的'和'\xc2\xa0',是否有一种方法可以避免编码并仅从表格中获取文本?
Traceback (most recent call last): File "c:\...\...\...", line 15, in
<module> print(rows) File
"C:\...\...\...\Python\Python37\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u2612' in position 3: character maps to <undefined>
当我使用replace时,出现了类型错误:
TypeError: a bytes-like object is required, not 'str'
无论我是否使用
str()
,我都尝试迭代并仅打印可以转换为字符串的项目,但仍然出现Unicode错误。test = 'https://www.sec.gov/Archives/edgar/data/789019/000156459019001392/msft-10q_20181231.htm'
import re
import requests
from urllib.request import urlopen
from bs4 import BeautifulSoup
page = urlopen(test).read()
soup = BeautifulSoup(page, 'lxml')
tables = soup.findAll('table')
for table in tables:
for row in table.findAll('tr'):
for cel in row.findAll('td'):
if str(cel.getText().encode('utf-8').strip()) != "b'\\xc2\\xa0'":
print(str(cel.getText().encode('utf-8').strip())
#print(str(cel.getText().encode('utf-8').strip().replace('\\xc2\\xa0', '').replace('b\'', '')
实际结果:
b'\xe2\x98\x92'
b'QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934'
b'\xe2\x98\x90'
b'TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934'
b'Washington'
b'\xc2\xa0'
b'91-1144442'
b'(State or other jurisdiction of\nincorporation or organization)'
...
...
期望的结果:
'QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934'
'TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934'
'Washington'
'91-1144442'
'(State or other jurisdiction of\nincorporation or organization)'
...
...
encode()
,那么在Linux上我会得到正确的文本。如果您使用的是Windows,则可能会遇到控制台/终端/cmd.exe的问题,因为它不使用UTF-8。 - furas