如何使用代理服务器（例如luminati.io）正确地向https发出请求？

Question

如何使用代理服务器（例如luminati.io）正确地向https发出请求？

27

这是由luminati.io提供的API，是一个高级代理提供者。然而，它返回的是字节码而不是字典，因此需要将其转换为字典以便能够提取ip和port：

每个请求都会使用新的对等代理，因为IP会在每个请求中轮换。

import csv
import requests
import json
import time

#!/usr/bin/env python

print('If you get error "ImportError: No module named \'six\'"'+\
    'install six:\n$ sudo pip install six');
import sys
if sys.version_info[0]==2:
    import six
    from six.moves.urllib import request
    opener = request.build_opener(
        request.ProxyHandler(
            {'http': 'http://lum-customer-hl_1247574f-zone-static:lnheclanmc@127.0.3.1:20005'}))
    proxy_details = opener.open('http://lumtest.com/myip.json').read()
if sys.version_info[0]==3:
    import urllib.request
    opener = urllib.request.build_opener(
        urllib.request.ProxyHandler(
            {'http': 'http://lum-customer-hl_1247574f-zone-static:lnheclanmc@127.0.3.1:20005'}))
    proxy_details = opener.open('http://lumtest.com/myip.json').read()
proxy_dictionary = json.loads(proxy_details)

print(proxy_dictionary)

接下来我计划使用requests模块中的ip和port连接到所需的网站：

headers = {'USER_AGENT': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:63.0) Gecko/20100101 Firefox/63.0'}

if __name__ == "__main__":

    search_keyword = input("Enter the search keyword: ")
    page_number =  int(input("Enter total number of pages: "))

    for i in range(1,page_number+1):
        time.sleep(10)

        link = 'https://www.experiment.com.ph/catalog/?_keyori=ss&ajax=true&from=input&page='+str(i)+'&q='+str(search_keyword)+'&spm=a2o4l.home.search.go.239e6ef06RRqVD'
        proxy = proxy_dictionary["ip"] + ':' + str(proxy_dictionary["asn"]["asnum"])
        print(proxy)
        req = requests.get(link,headers=headers,proxies={"https":proxy})

但是我的问题在于当执行 requests 时出现了错误。当我将 proxies={"https":proxy} 更改为 proxies={"http":proxy} 时，有一次可以连接成功，但除此之外，代理无法连接。

print_dictionary = {'ip': '84.22.151.191', 'country': 'RU', 'asn': {'asnum': 57129, 'org_name': 'Optibit LLC'}, 'geo': {'city': 'Krasnoyarsk', 'region': 'KYA', 'postal_code': '660000', 'latitude': 56.0097, 'longitude': 92.7917, 'tz': 'Asia/Krasnoyarsk'}}

下图展示了同级代理的详细信息：

print(proxy)将会得到84.22.151.191:57129，并被传递给requests.get方法

我收到的错误：

(Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x00000282DDD592B0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it',)))

我测试了一下requests方法中删除proxies={"https":proxy}参数后，爬取工作正常进行。因此，代理存在问题或者我访问代理的方式有误。

- Pherdindy

哦，无法测试，出现“urllib.error.URLError: <urlopen error [WinError 10061] No connection could be made because the target machine actively refused it>”错误。代理结尾的“@127.0.3.1:20005”是什么意思？你正在尝试在本地设置一个代理吗？ - CristiFati

你是否使用你的ISP的ASN号码作为代理端口？你提到了端口，但只是在注释中。此外，代理值是否也应该包含协议？例如：http://84.22.151.191:57129？ - CristiFati

@CristiFati @127.0.3.3.1:20005 是我的应用程序用来连接 Luminati Proxy Manager 的，然后他们会返回一个 peer proxy，即 84.22.151.191:57129:57129，我会用它来连接并爬取感兴趣的网站。由于我定义了 proxy = proxy_dictionary["ip"] + ':' + str(proxy_dictionary["asn"]["asnum"])，所以 proxies={"https":proxy} 就是 proxies={"https":84.22.151.191:57129}。你的意思是必须是 proxies={"https":"https://84.22.151.191:57129"} 吗？ - Pherdindy

请注意，您将无法使用此'http': 'http://lum-customer-hl_1247574f-zone-static:lnheclanmc@127.0.3.1:20005'进行连接，因为我已更改详细信息，以便使用我的用户名和密码访问该服务。 - Pherdindy

我也尝试了proxies={"https":"https://84.22.151.191:57129"}的格式，但是出现了相同的错误。 - Pherdindy

2个回答

0

有点晚了，但这是对我有效的方法。

proxies = {'http': 'http://lum-customer-hl_1247574f-zone-static:lnheclanmc@127.0.3.1:20005', 'https': 'http://lum-customer-hl_1247574f-zone-static:lnheclanmc@127.0.3.1:20005'}
            
req = requests.get(link,headers=headers,proxies=proxies)

在定义了这样的代理之后，我能够访问链接并获得响应。我相信Luminati需要使用他们的代理凭据来进行轮换和访问链接。

- Sachin Nair

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Nazim Kerimbekov · Accepted Answer

当将proxies={"https":proxy}更改为proxies={"http":proxy}时，您还需要确保您的链接是http而不是https，因此也请尝试替换：

link = 'https://www.experiment.com.ph/catalog/?_keyori=ss&ajax=true&from=input&page='+str(i)+'&q='+str(search_keyword)+'&spm=a2o4l.home.search.go.239e6ef06RRqVD'

带有。

link = 'http://www.experiment.com.ph/catalog/?_keyori=ss&ajax=true&from=input&page='+str(i)+'&q='+str(search_keyword)+'&spm=a2o4l.home.search.go.239e6ef06RRqVD'

你的整体代码应该像这样：

headers = {'USER_AGENT': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:63.0) Gecko/20100101 Firefox/63.0'}

if __name__ == "__main__":

    search_keyword = input("Enter the search keyword: ")
    page_number =  int(input("Enter total number of pages: "))

    for i in range(1,page_number+1):
        time.sleep(10)

        link = 'http://www.experiment.com.ph/catalog/?_keyori=ss&ajax=true&from=input&page='+str(i)+'&q='+str(search_keyword)+'&spm=a2o4l.home.search.go.239e6ef06RRqVD'
        proxy = proxy_dictionary["ip"] + ':' + str(proxy_dictionary["asn"]["asnum"])
        print(proxy)
        req = requests.get(link,headers=headers,proxies={"http":proxy})

希望这有所帮助！