使用Beautiful Soup进行抓取，保留 实体

Question

使用Beautiful Soup进行抓取，保留 实体

pythonweb-scrapingbeautifulsouphtml-parsinghtml-entities

10

我希望从网页中抓取一个表格，并保持 实体的完整性，以便稍后能够重新发布为HTML。然而，BeautifulSoup似乎将它们转换为了空格。例如：

from bs4 import BeautifulSoup

html = "<html><body><table><tr>"
html += "<td>&nbsp;hello&nbsp;</td>"
html += "</tr></table></body></html>"

soup = BeautifulSoup(html)
table = soup.find_all('table')[0]
row = table.find_all('tr')[0]
cell = row.find_all('td')[0]

print cell

观察结果：

<td> hello </td>

需要的结果：

<td>&nbsp;hello&nbsp;</td>

- Holy Mackerel

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- alecxe · Accepted Answer

在bs4中，不再支持在BeautifulSoup构造函数中使用convertEntities参数。HTML实体始终会转换为相应的Unicode字符（请参见文档）。

根据文档，您需要使用输出格式化程序，例如：

print soup.find_all('td')[0].prettify(formatter="html")

使用Beautiful Soup进行抓取，保留&nbsp;实体

使用Beautiful Soup进行抓取，保留实体