Scrapy - 在列表输出中去除HTML标签

Question

Scrapy - 在列表输出中去除HTML标签

10

我正在尝试编写一个小脚本，用于提取Steam游戏标签并将它们存储在CSV文件中。目前遇到的问题是不知道如何从输出中删除HTML标签。以下是我的代码：

from __future__ import absolute_import
import scrapy
from Example.items import SteamItem
from scrapy.selector import HtmlXPathSelector


class SteamSpider(scrapy.Spider):
    name = 'steamspider'
    allowed_domains = ['https://store.steampowered.com/app']
    start_urls = ["https://store.steampowered.com/app/578080/PLAYERUNKNOWNS_BATTLEGROUNDS/",]

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    tags = hxs.xpath('//*[@id="game_highlights"]/div[1]/div/div[4]/div/div[2]')
    for sel in tags:
        item = SteamItem()
        item['gametags'] = sel.xpath('.//a/text()').extract()
        item['gametitle'] = sel.xpath('//html/body/div[1]/div[7]/div[3]/div[1]/div[2]/div[2]/div[2]/div/div[3]/text()').extract()
    yield item

我的物品类：

class SteamItem(scrapy.Item):
    #defining item fields
    url = scrapy.Field()
    gametitle = scrapy.Field()
    gametags = scrapy.Field()

我的输出看起来像这样：

[u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tSurvival\t\t\t\t\t\t\t\t\t\t\t\t',
 u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tShooter\t\t\t\t\t\t\t\t\t\t\t\t',
 u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tMultiplayer\t\t\t\t\t\t\t\t\t\t\t\t',
 u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tPvP\t\t\t\t\t\t\t\t\t\t\t\t',
 u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tThird-Person Shooter\t\t\t\t\t\t\t\t\t\t\t\t',
 u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tFPS\t\t\t\t\t\t\t\t\t\t\t\t',
 u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tAction\t\t\t\t\t\t\t\t\t\t\t\t',
 u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tBattle Royale\t\t\t\t\t\t\t\t\t\t\t\t',
 u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tOnline Co-Op\t\t\t\t\t\t\t\t\t\t\t\t',
 u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tTactical\t\t\t\t\t\t\t\t\t\t\t\t',
 u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tCo-op\t\t\t\t\t\t\t\t\t\t\t\t',
 u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tEarly Access\t\t\t\t\t\t\t\t\t\t\t\t',
 u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tFirst-Person\t\t\t\t\t\t\t\t\t\t\t\t',
 u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tViolent\t\t\t\t\t\t\t\t\t\t\t\t',
 u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tStrategy\t\t\t\t\t\t\t\t\t\t\t\t',
 u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tThird Person\t\t\t\t\t\t\t\t\t\t\t\t',
 u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tCompetitive\t\t\t\t\t\t\t\t\t\t\t\t',
 u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tTeam-Based\t\t\t\t\t\t\t\t\t\t\t\t',
 u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tDifficult\t\t\t\t\t\t\t\t\t\t\t\t',
 u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tSimulation\t\t\t\t\t\t\t\t\t\t\t\t'],

我的目标是移除所有标签"u'\r\n\t.....\t

你有什么好的想法吗？

谢谢！

- r_user

7个回答

3

简单使用remove_tags

（直译）

from scrapy.utils.markup import remove_tags
ToRemove = remove_tags(YourOutPut)
print(ToRemove)

这将解决您的问题。

- JB.py

1

不再建议使用：警告：<string>：1：ScrapyDeprecationWarning：模块scrapy.utils.markup已被弃用。请改为从w3lib.html导入。 - Gaket

1

为了相应地获取标题和标签，您可以尝试以下脚本。为了去除制表符和空格，您应该在 .extract_first() 上使用 .strip()。

import scrapy

class SteamSpider(scrapy.Spider):
    name = 'steamspider'
    start_urls = ["https://store.steampowered.com/app/578080/PLAYERUNKNOWNS_BATTLEGROUNDS/",]

    def parse(self, response):
        title = response.xpath("//*[@class='apphub_AppName']/text()").extract_first().strip()
        tag_name = [item.strip() for item in response.xpath('//*[contains(@class,"popular_tags")]/*[@class="app_tag"]/text()').extract()]
        yield {"title":title,"tagname":tag_name}

- robots.txt

1

在新的代码中，.get() 可以作为 .extract_first() 的更具 Python 风格的替代方案。 - Gallaecio

0

首先要明白的是，你想要去除的不是“HTML标签”，而只是空格，其中大部分是制表符，还有一些换行符。你可能需要重新命名你的问题以更好地表达这个意思。

至于去除空格，你正在使用的HTML库可能会提供一个函数来完成这个任务。

如果没有，或者在这个问题的更一般情况下，Python字符串有一个strip方法（和一些相关方法），它将返回删除所有前导和尾随空格的字符串。因此，你可以做类似这样的事情：

item['field'] = sel.xpath('...').extract().strip()

Python手册中提供了更多信息：https://docs.python.org/2/library/string.html#string.strip

- Cheetah

1

.extract() 返回一个列表，你不能对列表应用 strip()。 - Sagun Shrestha

0

item['gametags'] = sel.xpath('.//a/text()').extract()
item['gametitle'] = .xpath('//html/body/div[1]/div[7]/div[3]/div[1]/div[2]/div[2]/div[2]/div/div[3]/text()').extract()

strip 在提取时去除你的值：

item['gametags'] = [val.strip() for val in sel.xpath('.//a/text()').extract()]

同样适用于您的第二个提取器 :)

- ThunderMind

0

你可以使用 strip 方法。由于您正在使用最终将返回列表的 extract()，因此您可以尝试这样做。

item['gametags'] = list(map(str.strip, sel.xpath('.//a/text()').extract())
item['gametitle'] = list(map(str.strip, sel.xpath('//html/body/div[1]/div[7]/div[3]/div[1]/div[2]/div[2]/div[2]/div/div[3]/text()').extract())

您也可以关注此博客文章进行 Steam 数据爬取

- Sagun Shrestha

0

使用strip()是一种方法。然而，如果您想完全使用XPath实现这一点，请查看normalize-space函数。在您的情况下，只需更改值的提取为：

item['gametags'] = [a.xpath('normalize-space(.)').extract_first() for a in sel.xpath('.//a')]
item['gametitle'] = sel.xpath('normalize-space(//html/body/div[1]/div[7]/div[3]/div[1]/div[2]/div[2]/div[2]/div/div[3])').extract_first()

- Tomáš Linhart

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Len Lin · Accepted Answer

既然您正在使用Scrapy框架，那么您可以使用Scrapy自带的一个名为w3lib的库。

import w3lib.html
output= w3lib.html.remove_tags(input)
print(output)

scrapy.utils.markup在2019年已被弃用，请使用w3lib代替。

您可以参考https://w3lib.readthedocs.io/en/latest/index.html获取更多信息。