使用Scrapy爬虫获取代理IP地址

Question

使用Scrapy爬虫获取代理IP地址

pythonproxyweb-scrapingscrapyweb-crawler

5

我使用 Tor 来爬取网页。我启动了 Tor 和 Polipo 服务，并添加了

class ProxyMiddleware(object):   # overwrite process request   def
  process_request(self, request, spider):
     # Set the location of the proxy
    request.meta['proxy'] = "127.0.0.1:8123"

现在，我该如何确保Scrapy在请求时使用不同的IP地址？

- cyn0

3个回答

8

最快的选项是使用scrapy shell并检查meta是否包含proxy。

从项目根目录开始：

$ scrapy shell http://google.com
>>> request.meta
{'handle_httpstatus_all': True, 'redirect_ttl': 20, 'download_timeout': 180, 'proxy': 'http://127.0.0.1:8123', 'download_latency': 0.4804518222808838, 'download_slot': 'google.com'}
>>> response.meta
{'download_timeout': 180, 'handle_httpstatus_all': True, 'redirect_ttl': 18, 'redirect_times': 2, 'redirect_urls': ['http://google.com', 'http://www.google.com/'], 'depth': 0, 'proxy': 'http://127.0.0.1:8123', 'download_latency': 1.5814828872680664, 'download_slot': 'google.com'}

这样您可以检查中间件是否正确配置以及请求是否通过代理进行。

- alecxe

0

也可以这样访问：

    def parse_response(self, response):
        print(response.ip_address)

更多信息请参见： https://docs.scrapy.org/en/latest/topics/request-response.html?highlight=ip_address#scrapy.http.Response.ip_address 在Scrapy shell中，它看起来像这样：

scrapy shell www.wikipedia.org
>>>response.ip_address
IPv4Address('153.140.27.68')

- fschn

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- bosnjak · Accepted Answer

您可以放弃第一次请求以检查您的公共IP地址，然后将其与在不使用Tor / VPN的情况下访问http://checkip.dyndns.org/时看到的IP进行比较。如果它们不同，那么Scrapy显然正在使用不同的IP。

def start_reqests():
    yield Request('http://checkip.dyndns.org/', callback=self.check_ip)
    # yield other requests from start_urls here if needed

def check_ip(self, response):
    pub_ip = response.xpath('//body/text()').re('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')[0]
    print "My public IP is: " + pub_ip

    # yield other requests here if needed