我认为你的问题不是机器人检测。你无法仅使用
requests
从该页面获取结果,因为它在幕后进行XHR请求。因此,您必须使用Selenium、Splash等工具,但对于此情况似乎不可行。
然而,如果您在页面上进行一些研究,可以找到在幕后请求以显示结果的URL。我进行了这方面的研究,并发现了这个页面(
https://ms-mt--api-web.spain.advgo.net/search),它返回一个json,因此在解析方面会更容易。使用Chrome开发工具,我得到了curl请求并将其映射到Python requests,获得了这段代码:
import json
import requests
headers = {
'authority': 'ms-mt--api-web.spain.advgo.net',
'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
'accept': 'application/json, text/plain, */*',
'x-adevinta-channel': 'web-desktop',
'x-schibsted-tenant': 'coches',
'sec-ch-ua-mobile': '?0',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
'content-type': 'application/json;charset=UTF-8',
'origin': 'https://www.coches.net',
'sec-fetch-site': 'cross-site',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://www.coches.net/',
'accept-language': 'en-US,en;q=0.9,es;q=0.8',
}
data = '{"pagination":{"page":1,"size":30},"sort":{"order":"desc","term":"relevance"},"filters":{"categories":{"category1Ids":[2500]},"offerTypeIds":[0,2,3,4,5],"isFinanced":false,"price":{"from":null,"to":null},"year":{"from":null,"to":null},"km":{"from":null,"to":null},"provinceIds":[],"fuelTypeIds":[],"bodyTypeIds":[],"doors":[],"seats":[],"transmissionTypeId":0,"hp":{"from":null,"to":null},"sellerTypeId":0,"hasWarranty":null,"isCertified":false,"luggageCapacity":{"from":null,"to":null},"contractId":0}}'
while True:
response = requests.post('https://ms-mt--api-web.spain.advgo.net/search', headers=headers, data=data).json()
print(response)
if not response["items"]:
break
data_dict = json.loads(data)
data_dict["pagination"]["page"] = data_dict["pagination"]["page"]+1
data = json.dumps(data_dict)
可能有很多无用的头部和主体信息,您可以编写并测试代码以改进它。