如何在BeautifulSoup输出中移除"\n\r\n"?

3
我有这样一段代码
from bs4 import BeautifulSoup
import requests
import re

page = open('doc1.html','rb').read()
soup = BeautifulSoup(page,'lxml')
# print(soup.prettify())

# eng = soup.find_all(string = re.compile("righteou"))
# print(eng)

# heb = soup.findAll('p',{'dir':'RTL'})
# print(heb)
list=[]
all_tr =soup.findAll('tr')
for td in all_tr:
    all_td = soup.findAll('td')
    d={
    'hob':all_td[0].text.strip(),
    'english':all_td[1].text.strip()

        }
    list.append(d)
print(list)

我的输出结果如下:
[{'hob': 'עִנְיָן שֶׁנִּיתְּנָה הַתּוֹרָה עַל הַר סִינַי דַּוְקָא,', 'english': '\n\r\n                    We need to understand\r\n                    \r\n      
              the idea that the Torah was given specifically on Mount\r\n                        Sinai,\r\n                    '}, {'hob': 'עִנְיָן שֶׁנִּיתְּנָה הַתּוֹרָה עַל הַר סִינַי דַּוְקָא,', 'english': '\n\r\n                    We need to understand\r\n                    \r\n                        the idea that the Torah was given specifically on Mount\r\n                        Sinai,\r\n                    '}, {'hob': 'עִנְיָן שֶׁנִּיתְּנָה הַתּוֹרָה עַל הַר סִינַי דַּוְקָא,', 'english': '\n\r\n                    We need to understand\r\n                    \r\n                        the idea that the Torah was given specifically on Mount\r\n                        Sinai,\r\n                    '}, {'hob': 'עִנְיָן שֶׁנִּיתְּנָה הַתּוֹרָה עַל הַר סִינַי דַּוְקָא,', 'english': '\n\r\n                    We need to understand\r\n                    \r\n                        the idea that the Torah was given specifically on Mount\r\n                        Sinai,\r\n                    '}, {'hob': 'עִנְיָן שֶׁנִּיתְּנָה הַתּוֹרָה עַל הַר סִינַי דַּוְקָא,', 'english': '\n\r\n                    We need to understand\r\n                    \r\n                        the idea that the Torah was given specifically on Mount\r\n                        Sinai,\r\n                    '}, {'hob': 'עִנְיָן שֶׁנִּיתְּנָה הַתּוֹרָה עַל הַר סִינַי דַּוְקָא,', 'english': '\n\r\n                    We need to understand\r\n                    \r\n                        the idea that the Torah was given specifically on Mount\r\n                        Sinai,\r\n                    '}, {'hob': 'עִנְיָן שֶׁנִּיתְּנָה הַתּוֹרָה עַל הַר סִינַי דַּוְקָא,', 'english': '\n\r\n                    We need to understand\r\n                    \r\n                        the idea that the Torah was given specifically on Mount\r\n                        Sinai,\r\n                    '}, {'hob': 'עִנְיָן שֶׁנִּיתְּנָה הַתּוֹרָה עַל הַר סִינַי דַּוְקָא,', 'english': '\n\r\n                    We need to understand\r\n                    \r\n                        the idea that the Torah was given specifically on Mount\r\n                        Sinai,\r\n                    '}, {'hob': 'עִנְיָן שֶׁנִּיתְּנָה הַתּוֹרָה עַל הַר סִינַי דַּוְקָא,', 'english': '\n\r\n                    We need to understand\r\n                    \r\n                        the idea that the Torah was given specifically on Mount\r\n                        Sinai,\r\n                    '}, {'hob': 'עִנְיָן שֶׁנִּ...................................................................................................................................................................................................................................................

我想要将输出中的换行符和制表符去掉,这样我的文件就会变得更加清晰了。我该如何做到这一点?


1
你能输出all_td的内容或者all_td[1]是什么样子吗? - Dmitriy Kisil
这看起来像是:<tr> <td width="367" valign="top"> <p dir="RTL"> 摩西在西奈山上接受了《托拉》, </p> </td> <td width="367" valign="top"> <p> ........很多代码在这里 </p> </td> </tr> - user10086707
1个回答

1

拆分单词并用空格连接。

'english':" ".join(all_td[1].text.split())

这将删除所有的 "\n","\r" 和空格。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接