如何在Python 3中从字符串中删除特殊字符？

Question

如何在Python 3中从字符串中删除特殊字符？

6

I would like to convert

from this

&lt;b&gt;&lt;i&gt;&lt;u&gt;Charming boutique selling trendy casual &amp;amp; dressy apparel for women, some plus sized items, swimwear, shoes &amp;amp; jewelry.&lt;/u&gt;&lt;/i&gt;&lt;/b&gt;

转化为这样

Charming boutique selling trendy casual dressy apparel for women, some plus sized items, swimwear, shoes jewelry.

我很困惑如何移除不仅是特殊字符，还包括在特殊字符之间的某些字母。有人可以提供一种方法吗？

- Jay P.

2个回答

4

请尝试以下操作：

import re

string = '&lt;b&gt;&lt;i&gt;&lt;u&gt;Charming boutique selling trendy casual &amp;amp; dressy apparel for women, some plus sized items, swimwear, shoes &amp;amp; jewelry.&lt;/u&gt;&lt;/i&gt;&lt;/b&gt;'

string = re.sub('&lt;/?[a-z]+&gt;', '', string)
string = string.replace('&amp;amp;', '&')

print(string)  # prints 'Charming boutique selling trendy casual & dressy apparel for women, some plus sized items, swimwear, shoes & jewelry.'

您想要更改的字符串看起来像是已经进行了多次转义的HTML，因此我的解决方案仅适用于这种情况。

我使用正则表达式将标签替换为空字符串，并将转义的字符 & 替换为实际字符 &。

希望这正是您要寻找的内容，如果有任何问题，请告诉我。

- CoffeeTableEspresso

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Andrej Kesely · Accepted Answer

您可以使用html模块和BeautifulSoup获取没有转义标签的文本：

s = "&lt;b&gt;&lt;i&gt;&lt;u&gt;Charming boutique selling trendy casual &amp;amp; dressy apparel for women, some plus sized items, swimwear, shoes &amp;amp; jewelry.&lt;/u&gt;&lt;/i&gt;&lt;/b&gt;"

from bs4 import BeautifulSoup
from html import unescape

soup = BeautifulSoup(unescape(s), 'lxml')
print(soup.text)

打印：

Charming boutique selling trendy casual & dressy apparel for women, some plus sized items, swimwear, shoes & jewelry.