Python爬取网页仍包含诸如\r \n \t等字符

Question

Python爬取网页仍包含诸如\r \n \t等字符

4

我正在尝试使用Scrapy 0.20.2从http://www.dmoz.org/Computers/Programming/Languages/Python/Books这个页面进行爬取。

我已经成功获取了所需的信息并进行了排序...

但是，我的结果中仍然包含\r、\t和\n。例如，以下是一个JSON：{"desc": ["\r\n\t\t\t\r\n ", " \r\n\t\t\t\r\n - 本书的主要目标是使用Python促进面向对象设计，并说明新兴的面向对象设计模式的使用。\r\n本书的次要目标是及时呈现数学工具。分析技术和证明按需要和在适当的上下文中呈现。\r\n \r\n "], "link": ["http://www.brpreiss.com/books/opus7/html/book.html"], "title": ["使用面向对象设计模式的Python数据结构和算法"]},

数据是正确的，但我不想在结果中看到\t、\r和\n。

我的爬虫代码如下：

from scrapy.spider import BaseSpider
from scrapy.selector import Selector

from dirbot.items import DmozItem

class DmozSpider(BaseSpider):
   name = "dmoz"
   allowed_domains = ["dmoz.org"]
   start_urls = [
       "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
   ]

   def parse(self, response):
       sel = Selector(response)
       sites = sel.xpath('//ul[@class="directory-url"]/li')
       items = []
       for site in sites:
           item = DmozItem()
           item['title'] = site.xpath('a/text()').extract()
           item['link'] = site.xpath('a/@href').extract()
           item['desc'] = site.xpath('text()').extract()
           items.append(item)
       return items

- Marco Dinatsoli

\r 和 \n 是行尾（EOL）字符，\t 是制表符。最常见的去除它们的方法是使用 rstrip()。 - e h

@emh 请提供一个例子，我应该在我的项目类中使用吗？ - Marco Dinatsoli

@emh 当我尝试执行 site.xpath('a/text()').extract().rstrip() 时，结果为空。 - Marco Dinatsoli

2

你可以使用类似这样的代码：item['desc'] = map(unicode.strip, site.xpath('a/text()').extract()) - paul trmbrth

1

正如Paul所说，有几种方法可以做到这一点。使用rstrip，您需要告诉Python您想要剥离什么。类似于.rstrip('\r\n\t')的东西将告诉它剥离EOL和制表符。这可能会有所帮助：https://dev59.com/lGgv5IYBdhLWcg3wSe62 - e h

3个回答

0

这里有另一种方法来实现这个（我使用了你的JSON数据）：

>>> data = {"desc": ["\r\n\t\t\t\r\n ", " \r\n\t\t\t\r\n - The primary goal of this book is to promote object-oriented design using Python and to illustrate the use of the emerging object-oriented design patterns.\r\nA secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context.\r\n \r\n "], "link": ["http://www.brpreiss.com/books/opus7/html/book.html"], "title": ["Data Structures and Algorithms with Object-Oriented Design Patterns in Python"]}

>>> clean_data = ''.join(data['desc'])

>>> print clean_data.strip(' \r\n\t')

输出：

- The primary goal of this book is to promote object-oriented design using Python and to illustrate the use of the emerging object-oriented design patterns.
A secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context.

改为：

['\r\n\t\t\t\r\n ', ' \r\n\t\t\t\r\n - The primary goal of this book is to promote object-oriented design using Python and to illustrate the use of the emerging object-oriented design patterns.\r\nA secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context.\r\n \r\n ']

- e h

0

假设您想要删除所有的\r、\n和\t（而不仅仅是边缘上的内容），同时仍然保持JSON的格式，您可以尝试以下方法：

def normalize_whitespace(json):
    if isinstance(json, str):
        return ' '.join(json.split())

    if isinstance(json, dict):
        it = json.items() # iteritems in Python 2
    elif isinstance(json, list):
        it = enumerate(json)

    for k, v in it:
        json[k] = normalize_whitespace(v)

    return json

使用方法：

>>> normalize_whitespace({"desc": ["\r\n\t\t\t\r\n ", " \r\n\t\t\t\r\n - The primary goal of this book is to promote object-oriented design using Python and to illustrate the use of the emerging object-oriented design patterns.\r\nA secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context.\r\n \r\n "], "link": ["http://www.brpreiss.com/books/opus7/html/book.html"], "title": ["Data Structures and Algorithms with Object-Oriented Design Patterns in Python"]})
{'title': ['Data Structures and Algorithms with Object-Oriented Design Patterns in Python'], 'link': ['http://www.brpreiss.com/books/opus7/html/book.html'], 'desc': ['', '- The primary goal of this book is to promote object-oriented design using Python and to illustrate the use of the emerging object-oriented design patterns. A secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context.']}

如https://dev59.com/lGgv5IYBdhLWcg3wSe62#10711166所提醒的那样，与正则表达式替换相比，split-join方法可能更适合这种情况，因为它结合了strip功能和空格规范化。

- JAB

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- user15163 · Accepted Answer

我使用了：

def parse(self, response):
    sel = Selector(response)
    sites = sel.xpath('//ul/li')
    items = []
    for site in sites:
        item = DmozItem()
        item['title'] = map(unicode.strip,site.xpath('a/text()').extract())
        item['link'] = map(unicode.strip, site.xpath('a/@href').extract())
        item['desc'] = map(unicode.strip, site.xpath('text()').extract())
        items.append(item)
    print "hello"
    return items

并且它可以正常运行。我不确定这是什么，但我还在阅读有关unicode.strip的内容。希望这有所帮助。