当我使用scrapy爬取网页时,我遇到了同样的问题。我有两种方法来解决这个问题。第一种方法是使用replace()函数。因为"response.xpath"返回一个列表格式,而replace函数只能操作字符串格式。所以我使用for循环将列表中的每个项目作为字符串提取出来,替换其中的'\n'和'\t',然后将其添加到一个新列表中。
import re
test_string =["\n\t\t", "\n\t\t\n\t\t\n\t\t\t\t\t", "\n", "\n", "\n", "\n", "Do you like shopping?", "\n", "Yes, I\u2019m a shopaholic.", "\n", "What do you usually shop for?", "\n", "I usually shop for clothes. I\u2019m a big fashion fan.", "\n", "Where do you go shopping?", "\n", "At some fashion boutiques in my neighborhood.", "\n", "Are there many shops in your neighborhood?", "\n", "Yes. My area is the city center, so I have many choices of where to shop.", "\n", "Do you spend much money on shopping?", "\n", "Yes and I\u2019m usually broke at the end of the month.", "\n", "\n\n\n", "\n", "\t\t\t\t", "\n\t\t\t\n\t\t\t", "\n\n\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t"]
print(test_string)
a = re.compile(r'(\t)+')
b = re.compile(r'(\n)+')
text = []
for n in test_string:
n = a.sub('',n)
n = b.sub('',n)
text.append(n)
print(text)
while '' in text:
text.remove('')
print(text)
第二种方法使用map()和strip。map()函数直接处理列表并获取原始格式。'Unicode'在Python2中使用,而在Python3中更改为'str',如下所示:
text = list(map(str.strip, test_string))
print(text)
strip函数仅从字符串的开头和结尾删除\n\t\r,而不是字符串中间。它与remove函数不同。
strip()
只考虑字符串的开头和结尾字符,所以如果你想去掉字符串中间的任何内容,你需要使用其他方法。如果这是你的问题,那么import re
和re.sub('[\r\n\t]', '', 'Hel\nlo\r!')
可以帮助你。 - Quentin PradetItemLoader
s http://doc.scrapy.org/en/latest/topics/loaders.html ,它可以让你管理你的Item
的输入和输出。 - Granitosaurus