Scrapy不能爬取带有下划线的子域名页面。

Question

Scrapy不能爬取带有下划线的子域名页面。

3

我试图爬取包含下划线的子域名页面，例如: https://taxi-3-extreme-rush_1.en.softonic.com

我查看了规格并发现子域名可以包含下划线。我尝试使用link.encode('idna')，但仍无法正常工作。

我遇到了错误：

    Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 1297, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib64/python2.7/site-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib64/python2.7/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))
  File "/usr/lib64/python2.7/site-packages/scrapy/utils/defer.py", line 45, in mustbe_deferred
    result = f(*args, **kw)
  File "/usr/lib64/python2.7/site-packages/scrapy/core/downloader/handlers/__init__.py", line 65, in download_request
    return handler.download_request(request, spider)
  File "/usr/lib64/python2.7/site-packages/scrapy/core/downloader/handlers/http11.py", line 60, in download_request
    return agent.download_request(request)
  File "/usr/lib64/python2.7/site-packages/scrapy/core/downloader/handlers/http11.py", line 285, in download_request
    method, to_bytes(url, encoding='ascii'), headers, bodyproducer)
  File "/usr/lib64/python2.7/site-packages/twisted/web/client.py", line 1596, in request
    endpoint = self._getEndpoint(parsedURI)
  File "/usr/lib64/python2.7/site-packages/twisted/web/client.py", line 1580, in _getEndpoint
    return self._endpointFactory.endpointForURI(uri)
  File "/usr/lib64/python2.7/site-packages/twisted/web/client.py", line 1456, in endpointForURI
    uri.port)
  File "/usr/lib64/python2.7/site-packages/scrapy/core/downloader/contextfactory.py", line 59, in creatorForNetloc
    return ScrapyClientTLSOptions(hostname.decode("ascii"), self.getContext())
  File "/usr/lib64/python2.7/site-packages/twisted/internet/_sslverify.py", line 1201, in __init__
    self._hostnameBytes = _idnaBytes(hostname)
  File "/usr/lib64/python2.7/site-packages/twisted/internet/_sslverify.py", line 87, in _idnaBytes
    return idna.encode(text)
  File "/usr/lib/python2.7/site-packages/idna/core.py", line 355, in encode
    result.append(alabel(label))
  File "/usr/lib/python2.7/site-packages/idna/core.py", line 276, in alabel
    check_label(label)
  File "/usr/lib/python2.7/site-packages/idna/core.py", line 253, in check_label
    raise InvalidCodepoint('Codepoint {0} at position {1} of {2} not allowed'.format(_unot(cp_value), pos+1, repr(label)))
InvalidCodepoint: Codepoint U+005F at position 20 of u'taxi-3-extreme-rush_1' not allowed

- Verz1Lka

3个回答

1

似乎这是一个关于Twisted的问题。

有一个解决方案，关于它在这里：

引用：看着Twisted的代码，如果可用的话，它会使用idna库。如果我卸载idna并再次发出相同的请求，则成功。

idna与pip install twisted[tls]或pip install treq一起安装。

我尝试通过pip uninstall idna卸载idna，确实可以进行请求。

- Granitosaurus

您IP地址为143.198.54.68，由于运营成本限制，当前对于免费用户的使用频率限制为每个IP每72小时10次对话，如需解除限制，请点击左下角设置图标按钮（手机用户先点击左上角菜单按钮）。 - paul trmbrth

0

我尝试使用Selenium，它可以正确解析。我可以验证这一点，因为如果我禁用蜘蛛的中间件（其中包含我的Selenium代码），会抛出相同的错误。

raise InvalidCodepoint('Codepoint {0} at position {1} of {2} not allowed'.format(_unot(cp_value), pos+1, repr(label)))
idna.core.InvalidCodepoint: Codepoint U+005F at position 3 of 'xyx_abc' not allowed

- hydradon

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- nanvel · Accepted Answer

一个解决方法：

import idna


idna.idnadata.codepoint_classes['PVALID'] = tuple(
    sorted(list(idna.idnadata.codepoint_classes['PVALID']) + [0x5f0000005f])
)