有没有一种不可检测的Selenium WebDriver版本?

42
我在Ubuntu服务器上使用住宅代理网络运行Selenium上的Chrome驱动程序,但是我的Selenium被检测到了。有没有办法使Chrome驱动程序和Selenium变得100%不可检测?我已经尝试了很长时间,包括:尝试不同版本的Chrome、添加几个标志并从Chrome驱动程序文件中删除一些单词、在隐身模式下使用代理(包括住宅代理)、加载配置文件、随机鼠标移动、随机化所有内容。我正在寻找一个真正的Selenium版本,它是100%不可检测的,如果存在的话。或者另一种不被机器人跟踪器检测到的自动化方式。这是浏览器启动的一部分:
sx = random.randint(1000, 1500)
sn = random.randint(3000, 4500)

display = Display(visible=0, size=(sx,sn))
display.start()


randagent = random.randint(0,len(useragents_desktop)-1)

uag = useragents_desktop[randagent]
#this is to prevent ip leaking
preferences =
    "webrtc.ip_handling_policy" : "disable_non_proxied_udp",
    "webrtc.multiple_routes_enabled": False,
    "webrtc.nonproxied_udp_enabled" : False

chrome_options.add_experimental_option("prefs", preferences)
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-impl-side-painting")
chrome_options.add_argument("--disable-setuid-sandbox")
chrome_options.add_argument("--disable-seccomp-filter-sandbox")
chrome_options.add_argument("--disable-breakpad")
chrome_options.add_argument("--disable-client-side-phishing-detection")
chrome_options.add_argument("--disable-cast")
chrome_options.add_argument("--disable-cast-streaming-hw-encoding")
chrome_options.add_argument("--disable-cloud-import")
chrome_options.add_argument("--disable-popup-blocking")
chrome_options.add_argument("--ignore-certificate-errors")
chrome_options.add_argument("--disable-session-crashed-bubble")
chrome_options.add_argument("--disable-ipv6")
chrome_options.add_argument("--allow-http-screen-capture")
chrome_options.add_argument("--start-maximized")

wsize = "--window-size=" +  str(sx-10) + ',' + str(sn-10)
chrome_options.add_argument(str(wsize) )

prefs = {"profile.managed_default_content_settings.images": 2}
chrome_options.add_experimental_option("prefs", prefs)

chrome_options.add_argument("blink-settings=imagesEnabled=true")
chrome_options.add_argument("start-maximized")
chrome_options.add_argument("user-agent="+uag)
chrome_options.add_extension(pluginfile)#this is for the residential proxy
driver = webdriver.Chrome(executable_path="/usr/bin/chromedriver", chrome_options=chrome_options)

2
你看过这个问题了吗?每个网站的机器人检测方式都不同。它们可能通过JavaScript工作,也可能在服务器上检查一些内容,有些自动化工具还会设置适当的用户代理字符串。 - Bob
我完成了99%的工作,还做了很多其他的事情,但什么都没用... - Grman
你尝试过覆盖 userAgent 吗?你正在使用 Headless 浏览器吗? - Adi Ohana
简短回答:不,这是不可能的。 - Corey Goldberg
Java机器人可能是一个解决方案。(https://docs.oracle.com/javase/7/docs/api/java/awt/Robot.html) 我认为这在服务器端不会被检测到,但我真的不知道。使用Selenium将是一个移动目标,因为检测到它的公司不断改变他们的策略来检测。你可以考虑成为一个好的互联网公民,不要自动化那些不希望这样做的网站...或者先征得许可。 - pcalkins
3个回答

61
事实上,检测到被Selenium驱动的WebDriver不取决于任何特定的Selenium、Chrome或ChromeDriver版本。网站本身可以检测网络流量并识别浏览器客户端即Web浏览器是否受WebDriver控制。但是,避免在Web爬虫时被检测到的一些通用方法如下:

@Antoine Vastel在他的博客网站Detecting Chrome Headless中提到了几种方法,这些方法区分了浏览器Chrome和无头Chrome浏览器。

  • User agent: The user agent attribute is commonly used to detect the OS as well as the browser of the user. With Chrome version 59 it has the following value:

    Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/59.0.3071.115 Safari/537.36
    
    • A check for the presence of Chrome headless can be done through:

      if (/HeadlessChrome/.test(window.navigator.userAgent)) {
          console.log("Chrome headless detected");
      }
      
  • Plugins: navigator.plugins returns an array of plugins present in the browser. Typically, on Chrome we find default plugins, such as Chrome PDF viewer or Google Native Client. On the opposite, in headless mode, the array returned contains no plugin.

    • A check for the presence of Plugins can be done through:

      if(navigator.plugins.length == 0) {
          console.log("It may be Chrome headless");
      }
      
  • Languages: In Chrome two Javascript attributes enable to obtain languages used by the user: navigator.language and navigator.languages. The first one is the language of the browser UI, while the second one is an array of string representing the user’s preferred languages. However, in headless mode, navigator.languages returns an empty string.

    • A check for the presence of Languages can be done through:

      if(navigator.languages == "") {
           console.log("Chrome headless detected");
      }
      
  • WebGL: WebGL is an API to perform 3D rendering in an HTML canvas. With this API, it is possible to query for the vendor of the graphic driver as well as the renderer of the graphic driver. With a vanilla Chrome and Linux, we can obtain the following values for renderer and vendor: Google SwiftShader and Google Inc.. In headless mode, we can obtain Mesa OffScreen, which is the technology used for rendering without using any sort of window system and Brian Paul, which is the program that started the open source Mesa graphics library.

    • A check for the presence of WebGL can be done through:

      var canvas = document.createElement('canvas');
      var gl = canvas.getContext('webgl');
      
      var debugInfo = gl.getExtension('WEBGL_debug_renderer_info');
      var vendor = gl.getParameter(debugInfo.UNMASKED_VENDOR_WEBGL);
      var renderer = gl.getParameter(debugInfo.UNMASKED_RENDERER_WEBGL);
      
      if(vendor == "Brian Paul" && renderer == "Mesa OffScreen") {
          console.log("Chrome headless detected");
      }
      
    • Not all Chrome headless will have the same values for vendor and renderer. Others keep values that could also be found on non headless version. However, Mesa Offscreen and Brian Paul indicates the presence of the headless version.

  • Browser features: Modernizr library enables to test if a wide range of HTML and CSS features are present in a browser. The only difference we found between Chrome and headless Chrome was that the latter did not have the hairline feature, which detects support for hidpi/retina hairlines.

    • A check for the presence of hairline feature can be done through:

      if(!Modernizr["hairline"]) {
          console.log("It may be Chrome headless");
      }
      
  • Missing image: The last on our list also seems to be the most robust, comes from the dimension of the image used by Chrome in case an image cannot be loaded. In case of a vanilla Chrome, the image has a width and height that depends on the zoom of the browser, but are different from zero. In a headless Chrome, the image has a width and an height equal to zero.

    • A check for the presence of Missing image can be done through:

      var body = document.getElementsByTagName("body")[0];
      var image = document.createElement("img");
      image.src = "http://iloveponeydotcom32188.jg";
      image.setAttribute("id", "fakeimage");
      body.appendChild(image);
      image.onerror = function(){
          if(image.width == 0 && image.height == 0) {
          console.log("Chrome headless detected");
          }
      }   
      

参考资料

您可以在以下链接中找到一些类似的讨论:


概括:


嗨,谢谢提供信息。我已经完成了大部分的工作,甚至更多。但是我仍然被检测到了。是否有一个包含所有这些实现的脚本链接,我可以用作指南? - Grman
2
Selenium自我识别:https://w3c.github.io/webdriver/#dom-navigatorautomationinformation-webdriver - Corey Goldberg
嗨,抱歉问一个“新手”的问题,如果机器人继续使用相同的IP地址,改变UserAgent有什么意义呢?在这种情况下,改变UserAgent不会使Selenium机器人更加可疑并容易被网站屏蔽吗? - Upchanges

14

为什么不试试undetected-chromedriver

优化的Selenium Chromedriver补丁,不会触发Distill Network / Imperva / DataDome / Botprotect.io等反爬虫服务,会自动下载驱动程序二进制文件并进行修补。

已测试至当前Chrome Beta版本,也适用于勇敢浏览器和许多其他基于Chromium的浏览器。Python 3.6++。

您可以使用以下命令安装:pip install undetected-chromedriver

有一些重要的事情你应该注意:由于模块内部机制的原因,需要通过编程方式进行浏览(即使用.get(url))。请勿使用GUI进行导航,因为使用键盘和鼠标导航可能会被检测到!新选项卡:同样道理。如果您真的需要多个选项卡,则打开空白页面的选项卡(提示:url为data:,包括逗号,驱动程序接受它),然后像往常一样进行操作。如果您遵循这些“规则”(实际上是默认行为),那么现在您将度过美好时光。

In [1]: import undetected_chromedriver as uc
In [2]: driver = uc.Chrome()
In [3]: driver.execute_script('return navigator.webdriver')
Out[3]: True  # Detectable
In [4]: driver.get('https://distilnetworks.com') # starts magic
In [4]: driver.execute_script('return navigator.webdriver')
In [5]: None  # Undetectable!

1
多标签问题日志在此处:https://github.com/ultrafunkamsterdam/undetected-chromedriver/issues/27 - Lam Yip Ming
我之前一直在手动修改chromedriver以使其不被检测,但是然后citi.com停止允许我登录。这个方法可以让我再次登录citi,并且比自己修改容易。 - poleguy
无法使用Google专利下载按钮 :( (429) - Petar Ulev
1
这个包与selenium-stealth相比如何? - Andreas L.

-1

怎么样:

import random
from selenium import webdriver
import time
driver = webdriver.Chrome("C:\\Users\\DusEck\\Desktop\\chromedriver.exe")
username = "username"  # data_user
password = "password"  # data_pass
driver.get("https://www.depop.com/login/")  # get URL
driver.find_element_by_xpath('/html/body/div[1]/div/div[3]/div[2]/button[2]').click()  # Accept cookies

split_char_pw = []  # Empty lists
split_char = []
n = 1  # Splitter
for index in range(0, len(username), n):
    split_char.append(username[index: index + n])

for user_letter in split_char:
    time.sleep(random.uniform(0.1, 0.8))
    driver.find_element_by_id("username").send_keys(user_letter)

for index in range(0, len(password), n):
    split_char.append(password[index: index + n])


for pw_letter in split_char_pw:
    time.sleep(random.uniform(0.1, 0.8))
    driver.find_element_by_id("password").send_keys(pw_letter)

稍微详细阐述一下你解决方案的基本原理,会更有助于理解它。 - Andreas L.
你的代码可能解决了因打字速度而被检测到的问题,但并没有解决如何使驱动程序对于像Cloudflare这样的网站不可检测的问题。 - kaliiiiiiiii

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接