如何使用远程Selenium WebDriver下载文件?

22

我正在使用远程的selenium webdriver进行一些测试。但是在某个时刻,我需要下载文件并检查其内容。

我使用以下方法使用远程webdriver(在python中):

PROXY = ...

prefs = {
    "profile.default_content_settings.popups":0,
    "download.prompt_for_download": "false",
    "download.default_directory": os.getcwd(),
}
chrome_options = Options()
chrome_options.add_argument("--disable-extensions")
chrome_options.add_experimental_option("prefs", prefs)

webdriver.DesiredCapabilities.CHROME['proxy'] = {
  "httpProxy":PROXY,
  "ftpProxy":PROXY,
  "sslProxy":PROXY,
  "noProxy":None,
  "proxyType":"MANUAL",
  "class":"org.openqa.selenium.Proxy",
  "autodetect":False
}
driver = webdriver.Remote(
        command_executor='http://aaa.bbb.ccc:4444/wd/hub',
        desired_capabilities=DesiredCapabilities.CHROME)

使用“普通”的webdriver,我可以在本地计算机上无问题地下载文件。然后我可以使用测试代码来验证已下载文件的内容(这取决于测试参数可能会有变化)。这不是对下载本身的测试,但我需要一种方法来验证所生成文件的内容...

但是,如何使用远程webdriver进行操作呢?我在任何地方都没有找到有用的信息...


你遇到了什么问题?有错误日志吗?如果你的浏览器在远程主机上运行(由于节点设置),你可能需要检查浏览器默认下载目录的写入权限。此外,你可以通过browser.download.dir来为FF配置文件和download.default_directory来为Chrome选项设置默认下载目录。 - ekostadinov
@ekostadinov:请查看更新的问题;我添加了完整的选项,包括下载目录选项... - Alex
3
你还没有回答关于你所面临的问题是什么的问题。 - Bill Hileman
1
我需要将文件放到可以被测试脚本访问的位置... - Alex
1
我觉得你需要一个共享驱动器来存储那些下载的文件。 - Buaban
显示剩余4条评论
7个回答

18

Selenium API没有提供一种获取远程计算机上下载文件的方法。

但是,根据浏览器的不同,仅使用Selenium也是可能的。

对于Chrome浏览器,可以通过导航到chrome://downloads/来列出下载的文件,并通过页面中注入的<input type="file">来检索它们:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
import os, time, base64


def get_downloaded_files(driver):

  if not driver.current_url.startswith("chrome://downloads"):
    driver.get("chrome://downloads/")

  return driver.execute_script( \
    "return downloads.Manager.get().items_   "
    "  .filter(e => e.state === 'COMPLETE')  "
    "  .map(e => e.filePath || e.file_path); " )


def get_file_content(driver, path):

  elem = driver.execute_script( \
    "var input = window.document.createElement('INPUT'); "
    "input.setAttribute('type', 'file'); "
    "input.hidden = true; "
    "input.onchange = function (e) { e.stopPropagation() }; "
    "return window.document.documentElement.appendChild(input); " )

  elem._execute('sendKeysToElement', {'value': [ path ], 'text': path})

  result = driver.execute_async_script( \
    "var input = arguments[0], callback = arguments[1]; "
    "var reader = new FileReader(); "
    "reader.onload = function (ev) { callback(reader.result) }; "
    "reader.onerror = function (ex) { callback(ex.message) }; "
    "reader.readAsDataURL(input.files[0]); "
    "input.remove(); "
    , elem)

  if not result.startswith('data:') :
    raise Exception("Failed to get file content: %s" % result)

  return base64.b64decode(result[result.find('base64,') + 7:])



capabilities_chrome = { \
    'browserName': 'chrome',
    # 'proxy': { \
     # 'proxyType': 'manual',
     # 'sslProxy': '50.59.162.78:8088',
     # 'httpProxy': '50.59.162.78:8088'
    # },
    'goog:chromeOptions': { \
      'args': [
      ],
      'prefs': { \
        # 'download.default_directory': "",
        # 'download.directory_upgrade': True,
        'download.prompt_for_download': False,
        'plugins.always_open_pdf_externally': True,
        'safebrowsing_for_trusted_sources_enabled': False
      }
    }
  }

driver = webdriver.Chrome(desired_capabilities=capabilities_chrome)
#driver = webdriver.Remote('http://127.0.0.1:5555/wd/hub', capabilities_chrome)

# download a pdf file
driver.get("https://www.mozilla.org/en-US/foundation/documents")
driver.find_element_by_css_selector("[href$='.pdf']").click()

# list all the completed remote files (waits for at least one)
files = WebDriverWait(driver, 20, 1).until(get_downloaded_files)

# get the content of the first file remotely
content = get_file_content(driver, files[0])

# save the content in a local file in the working directory
with open(os.path.basename(files[0]), 'wb') as f:
  f.write(content)

使用Firefox,可以通过切换上下文并使用脚本调用浏览器API来直接列出和检索文件:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
import os, time, base64

def get_file_names_moz(driver):
  driver.command_executor._commands["SET_CONTEXT"] = ("POST", "/session/$sessionId/moz/context")
  driver.execute("SET_CONTEXT", {"context": "chrome"})
  return driver.execute_async_script("""
    var { Downloads } = Components.utils.import('resource://gre/modules/Downloads.jsm', {});
    Downloads.getList(Downloads.ALL)
      .then(list => list.getAll())
      .then(entries => entries.filter(e => e.succeeded).map(e => e.target.path))
      .then(arguments[0]);
    """)
  driver.execute("SET_CONTEXT", {"context": "content"})

def get_file_content_moz(driver, path):
  driver.execute("SET_CONTEXT", {"context": "chrome"})
  result = driver.execute_async_script("""
    var { OS } = Cu.import("resource://gre/modules/osfile.jsm", {});
    OS.File.read(arguments[0]).then(function(data) {
      var base64 = Cc["@mozilla.org/scriptablebase64encoder;1"].getService(Ci.nsIScriptableBase64Encoder);
      var stream = Cc['@mozilla.org/io/arraybuffer-input-stream;1'].createInstance(Ci.nsIArrayBufferInputStream);
      stream.setData(data.buffer, 0, data.length);
      return base64.encodeToString(stream, data.length);
    }).then(arguments[1]);
    """, path)
  driver.execute("SET_CONTEXT", {"context": "content"})
  return base64.b64decode(result)

capabilities_moz = { \
    'browserName': 'firefox',
    'marionette': True,
    'acceptInsecureCerts': True,
    'moz:firefoxOptions': { \
      'args': [],
      'prefs': {
        # 'network.proxy.type': 1,
        # 'network.proxy.http': '12.157.129.35', 'network.proxy.http_port': 8080,
        # 'network.proxy.ssl':  '12.157.129.35', 'network.proxy.ssl_port':  8080,      
        'browser.download.dir': '',
        'browser.helperApps.neverAsk.saveToDisk': 'application/octet-stream,application/pdf', 
        'browser.download.useDownloadDir': True, 
        'browser.download.manager.showWhenStarting': False, 
        'browser.download.animateNotifications': False, 
        'browser.safebrowsing.downloads.enabled': False, 
        'browser.download.folderList': 2,
        'pdfjs.disabled': True
      }
    }
  }

# launch Firefox
# driver = webdriver.Firefox(capabilities=capabilities_moz)
driver = webdriver.Remote('http://127.0.0.1:5555/wd/hub', capabilities_moz)

# download a pdf file
driver.get("https://www.mozilla.org/en-US/foundation/documents")
driver.find_element_by_css_selector("[href$='.pdf']").click()

# list all the downloaded files (waits for at least one)
files = WebDriverWait(driver, 20, 1).until(get_file_names_moz)

# get the content of the last downloaded file
content = get_file_content_moz(driver, files[0])

# save the content in a local file in the working directory
with open(os.path.basename(files[0]), 'wb') as f:
  f.write(content)

在你的回答中,似乎你忘记了这个非常重要的事实。 - Alex
好的,我运行了这个示例,但在files = WebDriverWait(driver, 20, 1).until(get_file_list_cr)这一行出现了超时错误。 - Alex
这对你来说很好,但是对我来说并没有用。我遇到了如上所示的超时错误。那么还有什么其他办法吗?该如何进一步解决这个问题?如何进一步进行调试呢...? - Alex
@FlorentB。你分享的链接似乎指向 Selenium 的 dotnet 版本。就 Java 实现而言,没有 setContext 方法。 - Savvy
1
@Savvy,我的错,该命令在Java客户端中未实现。请参考获取完整页面截图的示例,了解如何调用命令。 - Florent B.
显示剩余14条评论

6

Webdriver:

如果您使用webdriver,这意味着您的代码使用内部Selenium客户端和服务器代码与浏览器实例通信。下载的文件存储在本地计算机中,可以直接使用诸如Java、Python、.Net、node.js等语言访问。

远程WebDriver [Selenium-Grid]:

如果您正在使用远程webdriver,这意味着您正在使用网格概念, 网格的主要目的是将测试分发到多个物理机器或虚拟机(VM)上。因此,您的代码使用Selenium客户端与Selenium Grid Server通信,后者向指定浏览器的已注册节点传递指令。从那里,网格节点将指令从特定于浏览器的驱动程序传递给浏览器实例。这里的下载是在该系统的文件系统|硬盘上进行的,但用户无法访问运行浏览器的虚拟机上的文件系统。

  • 如果我们能够使用JavaScript访问文件,那么我们可以将文件转换为base64字符串并返回给客户端代码。但出于安全原因,JavaScript不允许从磁盘读取文件。

  • 如果Selenium Grid Hub和Node位于同一系统中,并且它们在公共网络中,则可以将下载文件的路径更改为某些公共下载路径,例如../Tomcat/webapps/Root/CutrentTimeFolder/file.pdf。使用公共URL,您可以直接访问该文件。

例如从Tomcat的根文件夹下载文件[]。

System.out.println("FireFox Driver Path « "+ geckodriverCloudRootPath);
File temp = File.createTempFile("geckodriver",  null);
temp.setExecutable(true);
FileUtils.copyURLToFile(new URL( geckodriverCloudRootPath ), temp);

System.setProperty("webdriver.gecko.driver", temp.getAbsolutePath() );
capabilities.setCapability("marionette", true);
  • 如果Selenium Grid Hub和Node不在同一台计算机上,你可能无法下载文件,因为Grid Hub将在公共网络[WAN]中,而Node将在组织的私有网络[LAN]中。

您可以使用以下代码将浏览器下载文件路径更改为硬盘上的指定文件夹。

String downloadFilepath = "E:\\download";
    
HashMap<String, Object> chromePrefs = new HashMap<String, Object>();
chromePrefs.put("profile.default_content_settings.popups", 0);
chromePrefs.put("download.default_directory", downloadFilepath);
ChromeOptions options = new ChromeOptions();
HashMap<String, Object> chromeOptionsMap = new HashMap<String, Object>();
options.setExperimentalOption("prefs", chromePrefs);
options.addArguments("--test-type");
options.addArguments("--disable-extensions"); //to disable browser extension popup

DesiredCapabilities cap = DesiredCapabilities.chrome();
cap.setCapability(ChromeOptions.CAPABILITY, chromeOptionsMap);
cap.setCapability(CapabilityType.ACCEPT_SSL_CERTS, true);
cap.setCapability(ChromeOptions.CAPABILITY, options);
RemoteWebDriver driver = new ChromeDriver(cap);

@ 查看


6
@FlorentB的答案对于Chrome 79版本之前的版本有效。对于更新的版本,需要更新get_downloaded_files函数,因为无法再访问downloads.Manager。然而,这个更新版本也应该可以在之前的版本中使用。
def get_downloaded_files(driver):

  if not driver.current_url.startswith("chrome://downloads"):
    driver.get("chrome://downloads/")

  return driver.execute_script( \
     "return  document.querySelector('downloads-manager')  "
     " .shadowRoot.querySelector('#downloadsList')         "
     " .items.filter(e => e.state === 'COMPLETE')          "
     " .map(e => e.filePath || e.file_path || e.fileUrl || e.file_url); ")

2
这只是@Florent上面答案的Java版本。在他的大力协助下、经过一番查找和调整,我终于成功地让它适用于Java。我想通过这里展示,可以帮助其他人节省时间。 火狐浏览器 首先,我们需要创建自定义的火狐驱动程序,因为我们需要使用 SET_CONTEXT 命令,而该命令在 Java 客户端中尚未实现(截至 Selenium - 3.141.59)。
public class CustomFirefoxDriver extends RemoteWebDriver{


    public CustomFirefoxDriver(URL RemoteWebDriverUrl, FirefoxOptions options) throws Exception {
        super(RemoteWebDriverUrl, options);
        CommandInfo cmd = new CommandInfo("/session/:sessionId/moz/context", HttpMethod.POST);
        Method defineCommand = HttpCommandExecutor.class.getDeclaredMethod("defineCommand", String.class, CommandInfo.class);
        defineCommand.setAccessible(true);
        defineCommand.invoke(super.getCommandExecutor(), "SET_CONTEXT", cmd);
    }


    public Object setContext(String context) {
        return execute("SET_CONTEXT", ImmutableMap.of("context", context)).getValue();
    }
}

以下代码检索已下载的 .xls 文件的内容,并将其保存为文件(temp.xls)在运行 Java 类的同一目录中。在 Firefox 中,这相当简单,因为我们可以使用浏览器 API。
public String getDownloadedFileNameBySubStringFirefox(String Matcher) {

    String fileName = "";

    ((CustomFirefoxDriver) driver).setContext("chrome");

    String script = "var { Downloads } = Components.utils.import('resource://gre/modules/Downloads.jsm', {});"
            + "Downloads.getList(Downloads.ALL).then(list => list.getAll())"
            + ".then(entries => entries.filter(e => e.succeeded).map(e => e.target.path))"
            + ".then(arguments[0]);";

    String fileNameList = js.executeAsyncScript(script).toString();
    String name = fileNameList.substring(1, fileNameList.length() -1);

    if(name.contains(Matcher)) {
        fileName = name;
    }

    ((CustomFirefoxDriver) driver).setContext("content");

    return fileName;
}

public void getDownloadedFileContentFirefox(String fileIdentifier) {

    String filePath = getDownloadedFileNameBySubStringFirefox(fileIdentifier);
    ((CustomFirefoxDriver) driver).setContext("chrome");

    String script = "var { OS } = Cu.import(\"resource://gre/modules/osfile.jsm\", {});" + 
                    "OS.File.read(arguments[0]).then(function(data) {" + 
                    "var base64 = Cc[\"@mozilla.org/scriptablebase64encoder;1\"].getService(Ci.nsIScriptableBase64Encoder);" +
                    "var stream = Cc['@mozilla.org/io/arraybuffer-input-stream;1'].createInstance(Ci.nsIArrayBufferInputStream);" +
                    "stream.setData(data.buffer, 0, data.length);" +
                    "return base64.encodeToString(stream, data.length);" +
                    "}).then(arguments[1]);" ;

    Object base64FileContent = js.executeAsyncScript(script, filePath);//.toString();
    try {
        Files.write(Paths.get("temp.xls"), DatatypeConverter.parseBase64Binary(base64FileContent.toString()));
    } catch (IOException i) {
        System.out.println(i.getMessage());
    }

}

Chrome

在Chrome中,我们需要采用不同的方法来达到相同的目标。我们将一个输入文件元素添加到“下载”页面并将文件位置传递给该元素。一旦该元素指向所需的文件,我们就可以使用它来读取其内容。

public String getDownloadedFileNameBySubStringChrome(String Matcher) {
    String file = "";
    //The script below returns the list of files as a list of the form '[$FileName1, $FileName2...]'
    // with the most recently downloaded file listed first.
    String script = "return downloads.Manager.get().items_.filter(e => e.state === 'COMPLETE').map(e => e.file_url);" ;
    if(!driver.getCurrentUrl().startsWith("chrome://downloads/")) {
        driver.get("chrome://downloads/");
        }
    String fileNameList =  js.executeScript(script).toString();
    //Removing square brackets
    fileNameList = fileNameList.substring(1, fileNameList.length() -1);
    String [] fileNames = fileNameList.split(",");
    for(int i=0; i<fileNames.length; i++) {
        if(fileNames[i].trim().contains(Matcher)) {
            file = fileNames[i].trim();
            break;
        }
    }

    return file;

}


public void getDownloadedFileContentChrome(String fileIdentifier) {

    //This causes the user to be navigated to the Chrome Downloads page
    String fileName = getDownloadedFileNameBySubStringChrome(fileIdentifier);
    //Remove "file://" from the file path
    fileName = fileName.substring(7);

    String script =  "var input = window.document.createElement('INPUT'); " +
            "input.setAttribute('type', 'file'); " +
            "input.setAttribute('id', 'downloadedFileContent'); " +
            "input.hidden = true; " +
            "input.onchange = function (e) { e.stopPropagation() }; " +
            "return window.document.documentElement.appendChild(input); " ;
    WebElement fileContent = (WebElement) js.executeScript(script);
    fileContent.sendKeys(fileName);

    String asyncScript = "var input = arguments[0], callback = arguments[1]; " +
            "var reader = new FileReader(); " +
            "reader.onload = function (ev) { callback(reader.result) }; " +
            "reader.onerror = function (ex) { callback(ex.message) }; " +
            "reader.readAsDataURL(input.files[0]); " +
            "input.remove(); " ;

    String content = js.executeAsyncScript(asyncScript, fileContent).toString();
    int fromIndex = content.indexOf("base64,") +7 ;
    content = content.substring(fromIndex);

    try {
        Files.write(Paths.get("temp.xls"), DatatypeConverter.parseBase64Binary(content));
    } catch (IOException i) {
        System.out.println(i.getMessage());
    }

}

我需要这个设置的原因是因为我的测试套件正在Jenkins服务器上运行;而Selenium Grid中心和节点的设置指向在不同服务器上运行的Docker容器(https://github.com/SeleniumHQ/docker-selenium)。再次强调,这只是@Florent上面回答的Java翻译版本,请参考获取更多信息。


js 对象从哪里来的? - EJC
我找到了它 IJavaScriptExecutor js = (IJavaScriptExecutor)driver; - EJC

1
我在Medium上找到了这篇文章。它提到了另一个可能有帮助的教程。

https://lindajosiah.medium.com/python-selenium-docker-downloading-and-saving-files-ebb9ab8b2039

我正在使用一个 Docker 镜像来下载 Python 脚本,以及一个 Docker 栈用于 Selenium Hub。

Source: https://github.com/SeleniumHQ/docker-selenium/blob/trunk/docker-compose-v2.yml

version: '2'
services:
  chrome:
    image: selenium/node-chrome:4.8.1-20230306
    shm_size: 2gb
    depends_on:
      - selenium-hub
    environment:
      - SE_EVENT_BUS_HOST=selenium-hub
      - SE_EVENT_BUS_PUBLISH_PORT=4442
      - SE_EVENT_BUS_SUBSCRIBE_PORT=4443
    ports:
      - "6900:5900"
    networks:
      - scraper-service
    volumes:
      - ./downloads:/home/seluser/Downloads // <= link a local directory to the downloads location
  selenium-hub:
    image: selenium/hub:4.8.1-20230306
    ports:
      - "4442:4442"
      - "4443:4443"
      - "4444:4444"
    networks:
      - scraper-service
networks:
  scraper-service:
    external: true

然后我在我的Python脚本中设置了下载目录。

from selenium import webdriver

options = webdriver.ChromeOptions()
options.add_experimental_option("prefs", {
    "download.default_directory": "/home/seluser/Downloads/", // <= link to the downloads location
    "download.prompt_for_download": False,
    "download.directory_upgrade": True,
    "safebrowsing_for_trusted_sources_enabled": False,
    "safebrowsing.enabled": False
})
chrome = webdriver.Remote(
      command_executor='http://selenium-hub:4444/wd/hub',
      options=options)

你可以真正设置任何外部音量。


请考虑容器和主机或其他容器的卷权限。Selenium正在使用特定的用户ID和组ID。通过修复这个问题,该解决方案可以与Compose一起使用。谢谢。 - undefined

0
如果由于某些原因(例如pylint),您想要避免访问受保护的成员(elem._execute),那么请使用以下代码行:
elem._execute('sendKeysToElement', {'value': [ path ], 'text': path})

在@FlorentB的回答中可以重写为:

elem.parent.execute('sendKeysToElement', {'value': [ path ], 'text': path, 'id': elem.id})

来源:https://github.com/SeleniumHQ/selenium/blob/trunk/py/selenium/webdriver/remote/webelement.py 的第703、708和727行


0

以下内容适用于2020年使用Chrome的PHP php-webdriver:

$downloaddir = "/tmp/";
$host = 'http://ipaddress:4444/wd/hub';
try {
    $options = new ChromeOptions();
    $options->setExperimentalOption("prefs",["safebrowsing.enabled" => "true", "download.default_directory" => $downloaddir]);
    $options->addArguments( array("disable-extensions",'safebrowsing-disable-extension-blacklist','safebrowsing-disable-download-protection') );
    $caps = DesiredCapabilities::chrome();
    $caps->setCapability(ChromeOptions::CAPABILITY, $options);
    $caps->setCapability("unexpectedAlertBehaviour","accept");
    $driver = RemoteWebDriver::create($host, $caps);
    $driver->manage()->window()->setPosition(new WebDriverPoint(500,0));
    $driver->manage()->window()->setSize(new WebDriverDimension(1280,1000));
    $driver->get("https://file-examples.com/index.php/sample-documents-download/sample-rtf-download/");
    sleep(1);
    $driver->findElement(WebDriverBy::xpath("//table//tr//td[contains(., 'rtf')]//ancestor::tr[1]//a"))->click();
    sleep(1);
    $driver->get('chrome://downloads/');
    sleep(1);
    // $inject = "return downloads.Manager.get().items_.filter(e => e.state === 'COMPLETE').map(e => e.filePath || e.file_path); ";
    $inject = "return document.querySelector('downloads-manager').shadowRoot.querySelector('downloads-item').shadowRoot.querySelector('a').innerText;";
    $filename = $driver->executeScript(" $inject" );
    echo "File name: $filename<br>";
    $driver->executeScript( 
        "var input = window.document.createElement('INPUT'); ".
        "input.setAttribute('type', 'file'); ".
        "input.hidden = true; ".
        "input.onchange = function (e) { e.stopPropagation() }; ".
        "return window.document.documentElement.appendChild(input); " );
    $elem1 = $driver->findElement(WebDriverBy::xpath("//input[@type='file']"));
    $elem1->sendKeys($downloaddir.$filename);
    $result = $driver->executeAsyncScript( 
        "var input = arguments[0], callback = arguments[1]; ".
        "var reader = new FileReader(); ".
        "reader.onload = function (ev) { callback(reader.result) }; ".
        "reader.onerror = function (ex) { callback(ex.message) }; ".
        "reader.readAsDataURL(input.files[0]); ".
        "input.remove(); "
        , [$elem1]);
    $coding = 'base64,';
    $cstart = strpos( $result, 'base64,' );
    if ( $cstart !== false ) 
        $result = base64_decode(substr( $result, $cstart + strlen($coding) ));
    echo "File content: <br>$result<br>";
    $driver->quit();
} catch (Exception $e) {
    echo 'Caught exception: ',  $e->getMessage(), "\n";
} 

1
虽然提供仅包含代码的答案是可以接受的,但如果您能够提供代码的解释并帮助人们理解它是如何解决问题的,那对社区来说通常更有用。这可以减少后续问题的数量,并帮助新开发人员理解基本概念。您介意在问题中添加更多细节吗? - Jeremy Caney

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接