有没有一种不可检测的Selenium WebDriver版本？

Question

有没有一种不可检测的Selenium WebDriver版本？

google-chromeselenium-webdriverselenium-chromedriverwebdriver

42

我在Ubuntu服务器上使用住宅代理网络运行Selenium上的Chrome驱动程序，但是我的Selenium被检测到了。有没有办法使Chrome驱动程序和Selenium变得100%不可检测？我已经尝试了很长时间，包括：尝试不同版本的Chrome、添加几个标志并从Chrome驱动程序文件中删除一些单词、在隐身模式下使用代理（包括住宅代理）、加载配置文件、随机鼠标移动、随机化所有内容。我正在寻找一个真正的Selenium版本，它是100%不可检测的，如果存在的话。或者另一种不被机器人跟踪器检测到的自动化方式。这是浏览器启动的一部分：

sx = random.randint(1000, 1500)
sn = random.randint(3000, 4500)

display = Display(visible=0, size=(sx,sn))
display.start()


randagent = random.randint(0,len(useragents_desktop)-1)

uag = useragents_desktop[randagent]
#this is to prevent ip leaking
preferences =
    "webrtc.ip_handling_policy" : "disable_non_proxied_udp",
    "webrtc.multiple_routes_enabled": False,
    "webrtc.nonproxied_udp_enabled" : False

chrome_options.add_experimental_option("prefs", preferences)
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-impl-side-painting")
chrome_options.add_argument("--disable-setuid-sandbox")
chrome_options.add_argument("--disable-seccomp-filter-sandbox")
chrome_options.add_argument("--disable-breakpad")
chrome_options.add_argument("--disable-client-side-phishing-detection")
chrome_options.add_argument("--disable-cast")
chrome_options.add_argument("--disable-cast-streaming-hw-encoding")
chrome_options.add_argument("--disable-cloud-import")
chrome_options.add_argument("--disable-popup-blocking")
chrome_options.add_argument("--ignore-certificate-errors")
chrome_options.add_argument("--disable-session-crashed-bubble")
chrome_options.add_argument("--disable-ipv6")
chrome_options.add_argument("--allow-http-screen-capture")
chrome_options.add_argument("--start-maximized")

wsize = "--window-size=" +  str(sx-10) + ',' + str(sn-10)
chrome_options.add_argument(str(wsize) )

prefs = {"profile.managed_default_content_settings.images": 2}
chrome_options.add_experimental_option("prefs", prefs)

chrome_options.add_argument("blink-settings=imagesEnabled=true")
chrome_options.add_argument("start-maximized")
chrome_options.add_argument("user-agent="+uag)
chrome_options.add_extension(pluginfile)#this is for the residential proxy
driver = webdriver.Chrome(executable_path="/usr/bin/chromedriver", chrome_options=chrome_options)

- Grman

2

你看过这个问题了吗？每个网站的机器人检测方式都不同。它们可能通过JavaScript工作，也可能在服务器上检查一些内容，有些自动化工具还会设置适当的用户代理字符串。 - Bob

我完成了99%的工作，还做了很多其他的事情，但什么都没用... - Grman

你尝试过覆盖 userAgent 吗？你正在使用 Headless 浏览器吗？ - Adi Ohana

简短回答：不，这是不可能的。 - Corey Goldberg

可能是重复的问题：一个网站能否检测到你正在使用chromedriver的selenium？ - Corey Goldberg

Java机器人可能是一个解决方案。(https://docs.oracle.com/javase/7/docs/api/java/awt/Robot.html) 我认为这在服务器端不会被检测到，但我真的不知道。使用Selenium将是一个移动目标，因为检测到它的公司不断改变他们的策略来检测。你可以考虑成为一个好的互联网公民，不要自动化那些不希望这样做的网站...或者先征得许可。 - pcalkins

3个回答

14

为什么不试试undetected-chromedriver？

优化的Selenium Chromedriver补丁，不会触发Distill Network / Imperva / DataDome / Botprotect.io等反爬虫服务，会自动下载驱动程序二进制文件并进行修补。

已测试至当前Chrome Beta版本，也适用于勇敢浏览器和许多其他基于Chromium的浏览器。Python 3.6++。

您可以使用以下命令安装：pip install undetected-chromedriver

有一些重要的事情你应该注意：由于模块内部机制的原因，需要通过编程方式进行浏览（即使用.get(url)）。请勿使用GUI进行导航，因为使用键盘和鼠标导航可能会被检测到！新选项卡：同样道理。如果您真的需要多个选项卡，则打开空白页面的选项卡（提示：url为data：，包括逗号，驱动程序接受它），然后像往常一样进行操作。如果您遵循这些“规则”（实际上是默认行为），那么现在您将度过美好时光。

In [1]: import undetected_chromedriver as uc
In [2]: driver = uc.Chrome()
In [3]: driver.execute_script('return navigator.webdriver')
Out[3]: True  # Detectable
In [4]: driver.get('https://distilnetworks.com') # starts magic
In [4]: driver.execute_script('return navigator.webdriver')
In [5]: None  # Undetectable!

- hans

1

多标签问题日志在此处：https://github.com/ultrafunkamsterdam/undetected-chromedriver/issues/27 - Lam Yip Ming

我之前一直在手动修改chromedriver以使其不被检测，但是然后citi.com停止允许我登录。这个方法可以让我再次登录citi，并且比自己修改容易。 - poleguy

无法使用Google专利下载按钮 :( (429) - Petar Ulev

1

这个包与selenium-stealth相比如何？ - Andreas L.

-1

怎么样：

import random
from selenium import webdriver
import time
driver = webdriver.Chrome("C:\\Users\\DusEck\\Desktop\\chromedriver.exe")
username = "username"  # data_user
password = "password"  # data_pass
driver.get("https://www.depop.com/login/")  # get URL
driver.find_element_by_xpath('/html/body/div[1]/div/div[3]/div[2]/button[2]').click()  # Accept cookies

split_char_pw = []  # Empty lists
split_char = []
n = 1  # Splitter
for index in range(0, len(username), n):
    split_char.append(username[index: index + n])

for user_letter in split_char:
    time.sleep(random.uniform(0.1, 0.8))
    driver.find_element_by_id("username").send_keys(user_letter)

for index in range(0, len(password), n):
    split_char.append(password[index: index + n])


for pw_letter in split_char_pw:
    time.sleep(random.uniform(0.1, 0.8))
    driver.find_element_by_id("password").send_keys(pw_letter)

- user17453888

稍微详细阐述一下你解决方案的基本原理，会更有助于理解它。 - Andreas L.

你的代码可能解决了因打字速度而被检测到的问题，但并没有解决如何使驱动程序对于像Cloudflare这样的网站不可检测的问题。 - kaliiiiiiiii

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- undetected Selenium · Accepted Answer

事实上，检测到被Selenium驱动的WebDriver不取决于任何特定的Selenium、Chrome或ChromeDriver版本。网站本身可以检测网络流量并识别浏览器客户端即Web浏览器是否受WebDriver控制。但是，避免在Web爬虫时被检测到的一些通用方法如下：

网站可以通过您的显示器尺寸来确定脚本/程序的最重要属性。因此，建议不要使用传统视口(Viewport)。
如果需要向网站发送多个请求，需要在每个请求中不断更改用户代理(user-agent)。您可以在Selenium中更改Google Chrome用户代理的方法中找到详细讨论。
为了模拟类人行为，您可能需要减缓脚本执行速度，甚至超过WebDriverWait和expected_conditions，引入time.sleep(secs)。在这里，您可以找到关于如何在python中以毫秒为单位休眠webdriver的详细讨论。

@Antoine Vastel在他的博客网站Detecting Chrome Headless中提到了几种方法，这些方法区分了浏览器Chrome和无头Chrome浏览器。

User agent: The user agent attribute is commonly used to detect the OS as well as the browser of the user. With Chrome version 59 it has the following value:

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/59.0.3071.115 Safari/537.36

A check for the presence of Chrome headless can be done through:

if (/HeadlessChrome/.test(window.navigator.userAgent)) {
    console.log("Chrome headless detected");
}

Plugins: navigator.plugins returns an array of plugins present in the browser. Typically, on Chrome we find default plugins, such as Chrome PDF viewer or Google Native Client. On the opposite, in headless mode, the array returned contains no plugin.
- A check for the presence of Plugins can be done through:
```
if(navigator.plugins.length == 0) {
    console.log("It may be Chrome headless");
}
```
Languages: In Chrome two Javascript attributes enable to obtain languages used by the user: navigator.language and navigator.languages. The first one is the language of the browser UI, while the second one is an array of string representing the user’s preferred languages. However, in headless mode, navigator.languages returns an empty string.
- A check for the presence of Languages can be done through:
```
if(navigator.languages == "") {
     console.log("Chrome headless detected");
}
```
WebGL: WebGL is an API to perform 3D rendering in an HTML canvas. With this API, it is possible to query for the vendor of the graphic driver as well as the renderer of the graphic driver. With a vanilla Chrome and Linux, we can obtain the following values for renderer and vendor: Google SwiftShader and Google Inc.. In headless mode, we can obtain Mesa OffScreen, which is the technology used for rendering without using any sort of window system and Brian Paul, which is the program that started the open source Mesa graphics library.
- A check for the presence of WebGL can be done through:
```
var canvas = document.createElement('canvas');
var gl = canvas.getContext('webgl');

var debugInfo = gl.getExtension('WEBGL_debug_renderer_info');
var vendor = gl.getParameter(debugInfo.UNMASKED_VENDOR_WEBGL);
var renderer = gl.getParameter(debugInfo.UNMASKED_RENDERER_WEBGL);

if(vendor == "Brian Paul" && renderer == "Mesa OffScreen") {
    console.log("Chrome headless detected");
}
```
- Not all Chrome headless will have the same values for vendor and renderer. Others keep values that could also be found on non headless version. However, Mesa Offscreen and Brian Paul indicates the presence of the headless version.
Browser features: Modernizr library enables to test if a wide range of HTML and CSS features are present in a browser. The only difference we found between Chrome and headless Chrome was that the latter did not have the hairline feature, which detects support for hidpi/retina hairlines.
- A check for the presence of hairline feature can be done through:
```
if(!Modernizr["hairline"]) {
    console.log("It may be Chrome headless");
}
```
Missing image: The last on our list also seems to be the most robust, comes from the dimension of the image used by Chrome in case an image cannot be loaded. In case of a vanilla Chrome, the image has a width and height that depends on the zoom of the browser, but are different from zero. In a headless Chrome, the image has a width and an height equal to zero.
- A check for the presence of Missing image can be done through:
```
var body = document.getElementsByTagName("body")[0];
var image = document.createElement("img");
image.src = "http://iloveponeydotcom32188.jg";
image.setAttribute("id", "fakeimage");
body.appendChild(image);
image.onerror = function(){
    if(image.width == 0 && image.height == 0) {
    console.log("Chrome headless detected");
    }
}   
```

参考资料

您可以在以下链接中找到一些类似的讨论：

有没有一种不可检测的Selenium WebDriver版本？

参考资料

概括：