如何使用beautifulsoup正确提取网页中的特殊字符？

Question

如何使用beautifulsoup正确提取网页中的特殊字符？

3

我正在尝试使用beautifulsoup从网页中提取所有文本，给定其url。我尝试运行在此处找到的代码：https://www.researchgate.net/post/how_to_scrape_text_from_webpage_using_beautifulsoup_python。除了像“é”或“à”这样的特殊字符之外，一切都正常工作。我尝试了很多修改，但无法使其正常工作。以下是我的代码：

from bs4 import BeautifulSoup
import requests
import re
import codecs

html = requests.get(yourWebsiteURL).content

unicode_str = html.decode('utf8')
encoded_str = unicode_str.encode("ascii",'ignore')
news_soup = BeautifulSoup(encoded_str, "html.parser")
a_text = news_soup.find_all('p')

y=[re.sub(r'<.+?>',r'',str(a)) for a in a_text]

file = codecs.open("textOutput.txt", "wb", encoding='utf-8')
file.write(str(y))
file.close()

然而，我确信问题出在我的bs4使用上，因为我在写入文件时从未遇到过这个问题。

- user8502474

顺便提一下，使用[a.text for a in a_text]来获取p标签之间的文本。你不需要用正则表达式。 - Keyur Potdar

那个页面上给出的建议相当愚蠢。问题提问者似乎不理解什么是Unicode文本，而你所使用的答案处理非ASCII文本的方式相当粗糙。 - Martijn Pieters

1

@KeyurPotdar：更好的方法是：[a.get_text() for a in a_text]，然后您可以指定如何连接各个部分的选项。 - Martijn Pieters

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Bruno Ryckaert · Answer 1

encoded_str = unicode_str.encode("ascii",'ignore')

这行代码将您的文本编码为ASCII。 ASCII不包含特殊字符，如é或à。我不确定为什么您要从包含这些字符的UTF8解码为不包含它们的ASCII。