我该如何在Python中对这个BeautifulSoup字符串进行编码/解码，以便输出非标准拉丁字符？

Question

我该如何在Python中对这个BeautifulSoup字符串进行编码/解码，以便输出非标准拉丁字符？

pythonutf-8beautifulsoupcharacter-encoding

3

我正在使用Beautiful Soup抓取一个网页，输出包含非标准的拉丁字符，以十六进制显示。

我正在抓取https://www.archchinese.com，它包含拼音单词，使用非标准的拉丁字符（例如ǎ，ā）。我一直在尝试遍历一系列包含拼音的链接，使用BeautifulSoup的.string函数和utf-8编码来输出这些单词。在非标准字符的位置出现十六进制。单词“好”显示为“h\xc7\x8eo”。我确定我在编码方面做错了什么，但是我不知道应该如何修复。我尝试先用utf-8解码，但是收到一个错误，提示元素没有解码函数。尝试不使用编码打印字符串会给我一个关于字符未定义的错误，我想这是因为它们需要先编码成某些内容。

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import re

url = "https://www.archchinese.com/"

driver = webdriver.Chrome() #Set selenium up for opening page with Chrome.
driver.implicitly_wait(30)
driver.get(url)

driver.find_element_by_id('dictSearch').send_keys('好') # This character is hǎo.

python_button = driver.find_element_by_id('dictSearchBtn')
python_button.click() # Look for submit button and click it.

soup=BeautifulSoup(driver.page_source, 'lxml')

div = soup.find(id='charDef') # Find div with the target links.

for a in div.find_all('a', attrs={'class': 'arch-pinyin-font'}):
    print (a.string.encode('utf-8')) # Loop through all links with pinyin and attempt to encode.

实际结果： b'h\xc7\x8eo' b'h\xc3\xa0o'

预期结果： hǎo hào

编辑：问题似乎与Windows中的UnicodeEncodeError有关。我尝试安装了win-unicode-console，但没有成功。感谢snakecharmerb提供的信息。

- ep84

2个回答

1

在调用BeautifulSoup时，请使用编码，而不是之后再使用。

soup=BeautifulSoup(driver.page_source.encode('utf-8'), 'lxml')

div = soup.find(id='charDef') # Find div with the target links.

for a in div.find_all('a', attrs={'class': 'arch-pinyin-font'}):
    print (a.string)

- nandu kk

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- snakecharmerb · Accepted Answer

2

在打印输出时，您无需对值进行编码 - 打印函数会自动处理。目前，您正在打印组成编码值的字节的表示，而不仅仅是字符串本身。

>>> s = 'hǎo'
>>> print(s)
hǎo

>>> print(s.encode('utf-8'))
b'h\xc7\x8eo'

- snakecharmerb

尝试使用无编码打印（print(a)），结果与不使用编码打印（print(a.string)）相同： Traceback (most recent call last): File "hanziscrape.py", line 22, in <module> print (a) File "C:\Users\root\AppData\Local\Programs\Python\Python37\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' 编解码器无法在位置177处编码字符'\u01ce'，该字符映射到<undefined>。 - ep84

1

好老的Windows。这个答案可能会有所帮助：https://dev59.com/KXVD5IYBdhLWcg3wWaVh#32176732。 - snakecharmerb

是的，我之前已经遇到过这个问题并通过pip安装了win-unicode-console。我再试一次，得到了C：\ Users \ root> pip install win-unicode-console 要求已在c：\ users \ root \ appdata \ local \ programs \ python \ python37 \ lib \ site-packages（0.5）中满足。 - ep84

1

我手头没有Windows电脑，所以无法提供更多帮助。但我建议您编辑您的问题，明确指出您在将内容打印到Windows控制台时遇到的“UnicodeEncodeError”问题，并详细说明您已经采取的解决步骤。 - snakecharmerb

1

结果发现我是在 Windows 上使用 Git 控制台，那正是问题所在。你的建议完美地解决了这个问题。 - ep84