我正在尝试构建一个使用Tor代理的多线程爬虫: 我使用以下方法来建立Tor连接:
from stem import Signal
from stem.control import Controller
controller = Controller.from_port(port=9151)
def connectTor():
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9150)
socket.socket = socks.socksocket
def renew_tor():
global request_headers
request_headers = {
"Accept-Language": "en-US,en;q=0.5",
"User-Agent": random.choice(BROWSERS),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Referer": "http://thewebsite2.com",
"Connection": "close"
}
controller.authenticate()
controller.signal(Signal.NEWNYM)
以下是URL获取器:
def get_soup(url):
while True:
try:
connectTor()
r = requests.Session()
response = r.get(url, headers=request_headers)
the_page = response.content.decode('utf-8',errors='ignore')
the_soup = BeautifulSoup(the_page, 'html.parser')
if "captcha" in the_page.lower():
print("flag condition matched while url: ", url)
#print(the_page)
renew_tor()
else:
return the_soup
break
except Exception as e:
print ("Error while URL :", url, str(e))
我正在创建一个多线程抓取任务:
with futures.ThreadPoolExecutor(200) as executor:
for url in zurls:
future = executor.submit(fetchjob,url)
然后我遇到了以下错误,但是在使用多进程时我没有看到这个错误:
Socket connection failed (Socket error: 0x01: General SOCKS server failure)
我希望您能提供任何建议,以避免出现袜子错误,并改善爬取方法的性能,使其支持多线程。