如何使用Python让GeckoDriver和Firefox无法检测到Selenium脚本？

Question

如何使用Python让GeckoDriver和Firefox无法检测到Selenium脚本？

pythonseleniumfirefoxgeckodriverselenium-firefoxdriver

26

有没有办法在Python中使用geckodriver使你的Selenium脚本不被检测到?

我正在使用Selenium进行爬取。是否需要采取任何保护措施以避免网站检测到Selenium?

- user12285770

你明白吗？你从网页上获取到了区块吗？ - Sin Han Jinn

@HjSin 是的，该网站屏蔽了我，一到两分钟后又会不停地给我验证码。 - user12285770

@HjSin 我觉得这个网站正在检测我。 - user12285770

有没有任何保护措施可以使用，以便网站可以检测到Selenium？ - user12285770

1

如果一个网站被阻止了，那就意味着它不允许机器人。 - Tek Nath Acharya

显示剩余4条评论

5个回答

14

selenium driven Firefox / GeckoDriver被检测到的事实，并不取决于任何特定的 GeckoDriver 或 Firefox 版本。网站本身可以检测网络流量并识别浏览器客户端，即Web浏览器是 WebDriver控制。

根据最新的WebDriver接口编辑草案和 WebDriver - W3C Living Document 文档，当用户代理处于远程控制下，即通过Selenium控制时，最初设置为 false的 webdriver-active 标志设置为true。
现在， NavigatorAutomationInformation 界面不应暴露在 WorkerNavigator 上。所以， webdriver Returns true if webdriver-active flag is set, false otherwise. 然而， navigator.webdriver Defines a standard way for co-operating user agents to inform the document that it is controlled by WebDriver, for example so that alternate code paths can be triggered during automation. 因此，结论是： Selenium会被识别出来然而，以下是一些通用方法，可以避免在网络爬虫时被检测到：网站可以通过您的监视器尺寸来确定您的脚本/程序，因此建议不要使用传统的Viewport。如果您需要向网站发送多个请求，则需要在每个请求中不断更改用户代理。在这里，您可以找到有关如何在Selenium中更改Google Chrome用户代理的详细讨论？为了模拟类似人类的行为，您可能需要减慢脚本执行速度，甚至超过WebDriverWait和expected_conditions，引入time.sleep(secs)。在这里，您可以找到有关如何在Python中为Webdriver睡眠毫秒的详细讨论

- undetected Selenium

我不确定我理解了这个。这个标志在哪里公开？在http请求中吗？是用户代理字符串的一部分吗？它可以更改吗？ - d-b

@d-b 网站可以运行客户端 JavaScript，该脚本会评估变量并公开浏览器设置。虽然对于合法的用户活动不构成问题，但会为每个访问者运行。 - Joel Wigton

@undetectedSelenium，请问您能帮我解决这个问题吗？https://dev59.com/77z4oIgBc1ULPQZFqzga?noredirect=1#comment127859410_72375645 - S Mev

1

根据当前的WebDriver W3C编辑草案规范：

引用块： webdriver-active标志在用户代理受到远程控制时设置为true。它最初为false。

因此，只读布尔属性webdriver如果设置了webdriver-active标志，返回true；否则返回false。

进一步的规范进一步澄清：

引用块： navigator.webdriver定义了一种合作的用户代理的标准方式，以通知文档它是由WebDriver控制的，例如可以触发自动化期间的其他代码路径。

已经有成千上万的讨论要求禁用navigator.webdriver == true的选项吗？，而@whimboo在他的评论中得出结论：

这是因为WebDriver规范在Navigator对象上定义了该属性，在启用webdriver运行测试时必须将其设置为true：

https://w3c.github.io/webdriver/#interface

实现必须符合此要求。因此，我们不会提供规避它的方法。

总结

通过以上讨论，可以得出以下结论：

Selenium 会自我识别

并且无法隐藏浏览器是由 WebDriver 驱动的这一事实。

建议

然而，一些用户提出了可以隐藏 Mozilla Firefox 浏览器是 WebDriver 控制的方法，通过使用 Firefox Profiles 和 Proxies，如下所示：

selenium4 兼容的 python 代码

from selenium.webdriver import Firefox
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options

profile_path = r'C:\Users\Admin\AppData\Roaming\Mozilla\Firefox\Profiles\s8543x41.default-release'
options=Options()
options.set_preference('profile', profile_path)
options.set_preference('network.proxy.type', 1)
options.set_preference('network.proxy.socks', '127.0.0.1')
options.set_preference('network.proxy.socks_port', 9050)
options.set_preference('network.proxy.socks_remote_dns', False)
service = Service('C:\\BrowserDrivers\\geckodriver.exe')
driver = Firefox(service=service, options=options)
driver.get("https://www.google.com")
driver.quit()

其他替代方案

观察发现，在某些特定的操作系统变体中，以下几种不同的设置/配置可以绕过机器人检测：

selenium4 兼容代码块

from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.chrome.service import Service

options = Options()
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('excludeSwitches', ['enable-logging'])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--disable-blink-features=AutomationControlled')
s = Service('C:\\BrowserDrivers\\geckodriver.exe')
driver = webdriver.Chrome(service=s, options=options)

可能的解决方案

一个可能的解决方案是使用tor浏览器，如下所示：

selenium4兼容的python代码

from selenium.webdriver import Firefox  
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options
import os

torexe = os.popen(r'C:\Users\username\Desktop\Tor Browser\Browser\TorBrowser\Tor\tor.exe')
profile_path = r'C:\Users\username\Desktop\Tor Browser\Browser\TorBrowser\Data\Browser\profile.default'
firefox_options=Options()
firefox_options.set_preference('profile', profile_path)
firefox_options.set_preference('network.proxy.type', 1)
firefox_options.set_preference('network.proxy.socks', '127.0.0.1')
firefox_options.set_preference('network.proxy.socks_port', 9050)
firefox_options.set_preference("network.proxy.socks_remote_dns", False)
firefox_options.binary_location = r'C:\Users\username\Desktop\Tor Browser\Browser\firefox.exe'
service = Service('C:\\BrowserDrivers\\geckodriver.exe')
driver = webdriver.Firefox(service=service, options=firefox_options)
driver.get("https://www.tiktok.com/")

- undetected Selenium

2

感谢您的关注，但是我在Linux上使用Selenium 4.1.3尝试了这三种方法，以tiktok.com为例，但都没有成功。（另外，最好编辑您现有的答案而不是新建一个答案。）使用第一种方法（“推荐”），TikTok仍然检测到了Selenium。使用第二种方法（“其他替代方案”），我得到了AttributeError：'Options'对象没有属性'add_experimental_option'；是否有支持此功能的不同版本的Selenium？使用第三种方法（“潜在解决方案”），我发现TikTok对于Tor只是无条件地返回“访问被拒绝”。 - Kodiologist

@Kodiologist 我只是想用普通的Firefox来保持_tor_示例的简单性，否则使用_Firefox Nightly_来逃避检测就非常完美。 - undetected Selenium

但是，你如何在Firefox Nightly中使用Tor浏览器呢？我想，仅仅下载Firefox Nightly并用新的替换Tor浏览器的Firefox可执行文件是行不通的。 - Kodiologist

我猜我误解了配置工作的方式。无论如何，不幸的是我仍然得到“拒绝访问”的错误。如果有帮助的话，这是我使用的确切代码：https://paste.rs/inZ.py 顺便说一下，我认为popen行是一个空操作。感谢您一直以来的支持。 - Kodiologist

add_experimental_option 今天在 Linux 上对我爬取的网站有效。谢谢！ - Sebapi

显示剩余2条评论

0

如上述答案所述，当使用时，navigator.webdriver返回true符合规范。 chromedriver有选项--disable-blink-features=AutomationControlled来禁用它，但Mozilla已经拒绝添加相应的选项。在Firefox 88之前，可以通过dom.webdriver.enabled来禁用，但这不再是受支持的首选项。 useAutomationExtension在其他地方发布，但似乎也只适用于Chrome。

您可以通过使用{{link4：selenium-wire}}修改响应来覆盖navigator.webdriver的值，如{{link5：此答案}}中所述。例如，通过注入以下脚本：

Object.defineProperty(navigator, "webdriver", { get: () => false });

然而，这还不足以模拟undetected-chromedriver的功能，该功能目前没有Firefox版本。

- user12638282

-2

它听起来很简单，但是如果你看一下网站是如何检测Selenium（或机器人）的，就是通过跟踪移动，因此如果你的程序更像一个人在浏览网站，你可以得到更少的验证码，例如添加光标/页面滚动等在你的操作之间，以及其他模拟浏览的行为。所以在两个操作之间尝试添加一些其他操作、添加一些延迟等。这会使你的机器人变慢，并且可能不被发现。

谢谢。

- JAbr

1

你有一个可行的例子吗，还是这只是猜测？ - Kodiologist

我亲身经历过！ - JAbr

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- CST · Accepted Answer

有不同的方法来避免网站检测到Selenium的使用。

当使用Selenium时，默认情况下navigator.webdriver的值会被设置为true。这个变量将在Chrome和Firefox中存在。为了避免检测，应该将此变量设置为“undefined”。
还可以使用代理服务器来避免检测。
一些网站能够利用您的浏览器状态来确定是否使用了Selenium。您可以设置Selenium使用自定义浏览器配置文件来避免这种情况。

下面的代码同时使用了这三种方法。

profile = webdriver.FirefoxProfile('C:\\Users\\You\\AppData\\Roaming\\Mozilla\\Firefox\\Profiles\\something.default-release')

PROXY_HOST = "12.12.12.123"
PROXY_PORT = "1234"
profile.set_preference("network.proxy.type", 1)
profile.set_preference("network.proxy.http", PROXY_HOST)
profile.set_preference("network.proxy.http_port", int(PROXY_PORT))
profile.set_preference("dom.webdriver.enabled", False)
profile.set_preference('useAutomationExtension', False)
profile.update_preferences()
desired = DesiredCapabilities.FIREFOX

driver = webdriver.Firefox(firefox_profile=profile, desired_capabilities=desired)

运行代码后，您将能够手动检查Selenium运行的浏览器是否具有您的Firefox历史记录和扩展程序。您还可以在devtools控制台中键入“navigator.webdriver”来检查它是否未定义。