Python中最快的去除\n、\、\t、\xa0、â\x80\x93字符的方法

Question

Python中最快的去除\n、\、\t、\xa0、â\x80\x93字符的方法

3

我正在使用BeautifulSoup转换HTML数据，提取所有'p'标签中的文本，并将其转换为字符串。我是这样做的：

source = BeautifulSoup(response.text, "html.parser")

content = ""

for section in source.findAll('p'):
    content += section.get_text()

然而，当我进行转换时，如上述的标签会散布在字符串中。我尝试了多种方法来从我使用的字符串中删除所有这些字符，例如： unicodedata.normalize('NFKC', text)

content = u" ".join(content.split())

text.strip(), text.rstrip()

有没有一个库可以从字符串中删除这些标签。其中一些方法可以解决部分问题，但大部分仍然存在。

编辑：这里是一个字符串示例：https://pastebin.com/2DGECKXa

- MythKhan

你能提供一些你的数据的例子吗？ - undefined

@PacketLoss 这里有一个例子 https://pastebin.com/2DGECKXa - undefined

content = content.strip()这个操作不是你想要的结果吗？ - undefined

@PacketLoss 并不是我下载的每个页面都与这个格式相同。对于一些页面有效，而对于其他页面则无效。我需要一个能够普遍移除这些标签的方法。 - undefined

@MythKhan，请考虑接受一个答案。 - undefined

2个回答

0

看看这个是否有效

from simplified_scrapy.simplified_doc import SimplifiedDoc

doc = SimplifiedDoc(response.text)
content = ""
for section in doc.ps:
    content += section.text
    # content += section.unescape()
print (content)

- dabingsou

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- ProteinGuy · Accepted Answer

您可以使用.replace方法编写一个函数来实现这一点。

unwanted_chars = ['\n', '\t', 'r', '\xa0', 'â\x80\x93'] # Edit this to include all characters you want to remove

def clean_up_text(text, unwanted_chars=unwanted_chars):
    
    for char in unwanted_chars:
        text = text.replace(char, '')

    return text

然后您可以应用函数clean_up_text来删除所有不必要的字符。

new_text = clean_up_text(old_text)