Python并行执行与Selenium

Question

Python并行执行与Selenium

pythonseleniumparallel-processingconcurrent.futures

17

我对使用selenium在Python中进行并行执行感到困惑。似乎有几种方法可以实现，但其中一些方法已经过时了。

有一个名为python-wd-parallel的Python模块似乎具有某些功能用于此操作，但它来自2013年，现在是否仍然有用？我还发现了这个示例。
还有concurrent.futures，这似乎更加新颖，但不太容易实现。 有没有人有使用Selenium的并行执行的工作示例？
还可以使用线程和执行器来完成任务，但我觉得这样会更慢，因为它没有使用所有内核，仍然以序列形式运行。

使用Selenium进行并行执行的最新方式是什么？

- Ke.

关于项目1。市面上有许多公司提供并行测试解决方案，Saucelabs就是其中之一。但在selenium网格页面这里列出了更多的公司。Selenium网格也是非纯Python并行性的替代方案。 - imbr

只是为了完整起见，这些公司被列为Selenium Level Sponsors。 - imbr

2个回答

6

Python Parallel Wd 似乎已经死亡（最后一次提交是9年前）。它还使用了一个过时的协议用于selenium。最后，代码是专有的 saucelabs。

通常最好使用基于selenium和pytest的Python测试框架SeleniumBase。它非常完整，并支持所有性能提升、并行线程等功能。如果这不是你的情况，请继续阅读。

Selenium性能提升 (concurrent.futures)

简短回答

使用线程和进程都可以在selenium代码中大幅提高速度。

下面给出了简短的示例。selenium工作由selenium_title函数完成，该函数返回页面标题。这不涉及每个线程/进程执行期间发生的异常处理。有关此内容，请参见详细答案 - 处理异常。

线程工作池{{link1：concurrent.futures.ThreadPoolExecutor}}。

from selenium import webdriver  
from concurrent import futures

def selenium_title(url):  
  wdriver = webdriver.Chrome() # chrome webdriver
  wdriver.get(url)  
  title = wdriver.title  
  wdriver.quit()
  return title

links = ["https://www.amazon.com", "https://www.google.com"]

with futures.ThreadPoolExecutor() as executor: # default/optimized number of threads
  titles = list(executor.map(selenium_title, links))

进程池工作线程 concurrent.futures.ProcessPoolExecutor。只需要在上面的代码中将 ThreadPoolExecuter 替换为 ProcessPoolExecutor 即可。它们都是从基类 Executor 派生而来。此外，你必须像下面这样保护主函数。

if __name__ == '__main__':
 with futures.ProcessPoolExecutor() as executor: # default/optimized number of processes
   titles = list(executor.map(selenium_title, links))

长答案

为什么带有Python GIL的线程能够工作？

尽管Python由于Python GIL对线程有限制，即使线程将被上下文切换。由于Selenium的实现细节，性能提升将会出现。 Selenium通过发送命令（如POST，GET（HTTP请求））来工作。这些命令将被发送到浏览器驱动程序服务器。因此，您可能已经知道I / O绑定任务（HTTP请求）会释放GIL，从而提高性能。

处理异常

我们可以对上面的示例进行小修改，以处理生成的线程中的Exceptions。我们使用executor.submit而不是使用executor.map。那将返回包装在Future实例中的标题。

要访问返回的标题，我们可以使用future_titles[index].result，其中索引大小为len(links)，或者简单地使用像下面这样的for。

with futures.ThreadPoolExecutor() as executor:
  future_titles = [ executor.submit(selenium_title, link) for link in links ]
  for future_title, link in zip(future_titles, links): 
    try:        
      title = future_title.result() # can use `timeout` to wait max seconds for each thread               
    except Exception as exc: # this thread migh have had an exception
      print('url {:0} generated an exception: {:1}'.format(link, exc))

请注意，除了迭代future_titles之外，我们还迭代links，因此如果某个线程出现异常，我们就知道哪个url(link)有问题。 futures.Future类很酷，因为它们使您可以控制从每个线程接收到的结果。例如，它是否正确完成或是否出现异常等，更多信息请参见这里。

另一个重要的事情是futures.as_completed更好，如果您不关心线程返回项的顺序。但是，由于使用该语法来控制异常有点丑陋，在此我省略了它。

性能提升和线程

首先，为什么我一直在使用线程来加速我的selenium代码：

在I/O绑定任务中，我的经验表明使用进程池（Process）或线程池（Threads）之间几乎没有{{link1：区别}}。{{link2：这里}}也得出了类似的结论，即Python线程与进程在I/O绑定任务上的表现相似。

我们还知道，进程使用自己的内存空间。这意味着更多的内存消耗。此外，生成进程比线程稍慢。

- imbr

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- bluesummers · Accepted Answer

使用joblib的Parallel模块来实现，这是一个执行并行的绝佳库。

假设我们有一个名为urls的URL列表，我们想要并行地对每个URL进行截图。

首先让我们导入必要的库。

from selenium import webdriver
from joblib import Parallel, delayed

现在让我们定义一个函数，它以base64格式获取屏幕截图

def take_screenshot(url):
    phantom = webdriver.PhantomJS('/path/to/phantomjs')
    phantom.get(url)
    screenshot = phantom.get_screenshot_as_base64()
    phantom.close()

    return screenshot

现在要并行执行它，你需要做的是

screenshots = Parallel(n_jobs=-1)(delayed(take_screenshot)(url) for url in urls)

当这行代码执行完毕时，您将在screenshots中拥有所有运行过程的数据。

关于并行计算的说明

Parallel(n_jobs=-1) 表示使用所有可用资源
delayed(function)(input) 是joblib创建要在并行上运行的函数输入的方式

更多信息可以在joblib文档中找到。