如何使用selenium.py(Python代码)获取状态码

50

我正在用Python编写Selenium脚本,但我发现没有找到以下信息:

如何从Selenium Python代码中获取HTTP状态码

或者是我漏掉了什么。如果有人发现了,请随意发布。

13个回答

52

不可能的。

不幸的是,Selenium 没有设计提供这个信息。关于此问题有一个非常冗长的讨论,但简而言之:

  1. Selenium 是一个浏览器仿真工具,不一定是一个测试工具。
  2. Selenium 在渲染页面并添加接口的过程中执行了多次 GET 和 POST 请求,添加这样的功能将以作者所抵制的方式复杂化 API。

我们只能使用以下方法进行处理:

  1. 在返回的 HTML 中查找错误信息。
  2. 使用其他工具,例如 Requests(但请参见 @Zeinab 的回答中该方法的缺点)。

所提出的问题只有一个实际答案。谢谢! - Xonshiz
2
你的答案是错误的。Stefan Matei的答案Jarad的答案获取了状态码。 - Peilonrayz
有点同意“按设计”评论。通常情况下,由于初始请求中包含的脚本,在浏览器中会触发多个请求。除非浏览器保留第一个状态码,否则不清楚它所指的响应状态码是什么。 - Synru

16

我对Python没有太多经验。这里有一个更详细的Java示例:

https://dev59.com/n2w15IYBdhLWcg3wmc4J#39979509

思路是启用性能日志记录。这会在chromedriver上触发“Network.enable”。然后获取性能日志条目并解析其中的“Network.responseReceived”消息。

    from selenium import webdriver

    from selenium.webdriver.common.desired_capabilities import DesiredCapabilities    
    # enable browser logging
    d = DesiredCapabilities.CHROME
    d['loggingPrefs'] = { 'performance':'ALL' }

    driver = webdriver.Chrome(executable_path="c:\\windows\\chromedriver.exe", service_args=["--verbose", "--log-path=D:\\temp3\\chromedriverxx.log"], desired_capabilities=d)

    driver.get('https://api.ipify.org/?format=text')

    print(driver.title)

    print(driver.page_source)

    performance_log = driver.get_log('performance')
    print (str(performance_log).strip('[]'))

    for entry in driver.get_log('performance'):
        print (entry)

输出结果将包含您的URL的"Network.responseReceived",以及页面加载时执行的其他请求或重定向URL。您所需做的就是解析日志条目。

'{"message":{"method":"Network.responseReceived","params":{"frameId":"9488.1","loaderId":"9488.1","requestId":"9488.1","response":{"connectionId":14,"connectionReused":false,"encodedDataLength":-1,"fromDiskCache":false,"fromServiceWorker":false,"headers":{"Connection":"keep-alive","Content-Length":"13","Content-Type":"text/plain","Date":"Wed, 12 Oct 2016 06:15:47 GMT","Server":"Cowboy","Via":"1.1 vegur"},"headersText":"HTTP/1.1 200 OK\\r\\nServer: Cowboy\\r\\nConnection: keep-alive\\r\\nContent-Type: text/plain\\r\\nDate: Wed, 12 Oct 2016 06:15:47 GMT\\r\\nContent-Length:13\\r\\nVia:1.1vegur\\r\\n\\r\\n","mimeType":"text/plain","protocol":"http/1.1","remoteIPAddress":"54.197.246.207","remotePort":443,"requestHeaders":{"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8","Accept-Encoding":"gzip, deflate, sdch, br","Accept-Language":"en-GB,en-US;q=0.8,en;q=0.6","Connection":"keep-alive","Host":"api.ipify.org","Upgrade-Insecure-Requests":"1","User-Agent":"Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36"},"requestHeadersText":"GET /?format=text HTTP/1.1\\r\\nHost: api.ipify.org\\r\\nConnection: keep-alive\\r\\nUpgrade-Insecure-Requests: 1\\r\\nUser-Agent: Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36\\r\\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8\\r\\nAccept-Encoding: gzip, deflate, sdch, br\\r\\nAccept-Language: en-GB,en-US;q=0.8,en;q=0.6\\r\\n\\r\\n","securityDetails":{"certificateId":1,"certificateValidationDetails":{"numInvalidScts":0,"numUnknownScts":0,"numValidScts":0},"cipher":"AES_128_GCM","keyExchange":"ECDHE_RSA","protocol":"TLS 1.2","signedCertificateTimestampList":[]},"securityState":"secure","status":200,"statusText":"OK","timing":{"connectEnd":320.508999997401,"connectStart":3.08100000256673,"dnsEnd":3.08100000256673,"dnsStart":0,"proxyEnd":-1,"proxyStart":-1,"pushEnd":0,"pushStart":0,"receiveHeadersEnd":465.725000001839,"requestTime":78246.775045,"sendEnd":320.995999994921,"sendStart":320.825999995577,"sslEnd":320.435000001453,"sslStart":141.675999999279,"workerReady":-1,"workerStart":-1},"url":"https://api.ipify.org/?format=text"},"timestamp":78247.242716,"type":"Document"}},"webview":"6e8a3b1d-e5aa-40fb-a695-280cbb0ee420"}'}, {'timestamp': 1476252948094, 'level': 'INFO', 'message': '{"message":{"method":"Network.dataReceived","params":{"dataLength":13,"encodedDataLength":171,"requestId":"9488.1","timestamp":78247.243137}},"webview":"6e8a3b1d-e5aa-40fb-a695-280cbb0ee420"}'}, {'timestamp': 1476252948094, 'level': 'INFO', 'message': '{"message":{"method":"Page.frameNavigated","params":{"frame":{"id":"9488.1","loaderId":"9488.1","mimeType":"text/plain","securityOrigin":"https://api.ipify.org","url":"https://api.ipify.org/?format=text"}}},"webview":"6e8a3b1d-e5aa-40fb-a695-280cbb0ee420"}'}, {'timestamp': 1476252948095, 'level': 'INFO', 'message': '{"message":{"method":"Network.loadingFinished","params":{"encodedDataLength":171,"requestId":"9488.1","timestamp":78247.242066}},"webview":"6e8a3b1d-e5aa-40fb-a695-280cbb0ee420"}'}, {'timestamp': 1476252948115, 'level': 'INFO', 'message': '{"message":{"method":"Page.loadEventFired","params":{"timestamp":78247.264169}},"webview":"6e8a3b1d-e5aa-40fb-a695-280cbb0ee420"}'}, {'timestamp': 1476252948115, 'level': 'INFO', 'message': '{"message":{"method":"Page.frameStoppedLoading","params":{"frameId":"9488.1"}},"webview":"6e8a3b1d-e5aa-40fb-a695-280cbb0ee420"}'}, {'timestamp': 147625298116, 'level': 'INFO', 'message': '{"message":{"method":"Page.domContentEventFired","params":{"timestamp":78247.276475}},"webview":"6e8a3b1d-e5aa-40fb-a695-280cbb0ee420"}'}, {'timestamp': 1476252948122, 'level': 'INFO', 'message': '{"message":{"method":"Network.requestWillBeSent","params":{"documentURL":"https://api.ipify.org/?format=text","frameId":"9488.1","initiator":{"type":"other"},"loaderId":"9488.1","request":{"headers":{"Referer":"https://api.ipify.org/?format=text","User-Agent":"Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36"},"initialPriority":"High","method":"GET","mixedContentType":"none","url":"https://api.ipify.org/favicon.ico"},"requestId":"9488.2","timestamp":78247.280131,"type":"Other","wallTime":1476252948.11805}},"webview":"6e8a3b1d-e5aa-40fb-a695-280cbb0ee420"}'}

并从JSON响应中获取"status":200。您还可以解析响应的“headers”。


在 Mac 上出现错误:selenium.common.exceptions.WebDriverException: Message: POST /session/4fd2b36a-6c9a-e34d-8e9a-022424c7f36f/log did not match a known command - user305883
1
@user305883 它只适用于Chrome浏览器。通常在使用其他浏览器(如Firefox)时会出现此错误。对于Firefox,您需要转储日志文件,然后解析它 java示例 - Stefan Matei
今天在Chrome中至少使用perl时,这不起作用。说这不是W3C命令。 - nck

11
import json
from selenium.webdriver.chrome.webdriver import WebDriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

chromedriver_path = "YOUR/PATH/TO/chromedriver.exe"
url = "https://selenium-python.readthedocs.io/api.html"
capabilities = DesiredCapabilities.CHROME.copy()
capabilities['goog:loggingPrefs'] = {'performance': 'ALL'}

browser = WebDriver(chromedriver_path, desired_capabilities=capabilities)

browser.get(url)
logs = browser.get_log('performance')

选项1: 如果您只想在假设您要获取状态代码的页面存在于包含'text/html'内容类型的日志中的情况下返回状态代码

def get_status(logs):
    for log in logs:
        if log['message']:
            d = json.loads(log['message'])
            try:
                content_type = 'text/html' in d['message']['params']['response']['headers']['content-type']
                response_received = d['message']['method'] == 'Network.responseReceived'
                if content_type and response_received:
                    return d['message']['params']['response']['status']
            except:
                pass

使用方法:

>>> get_status(logs)
200

选项 2:如果您想在相关日志中查看所有状态码

def get_status_codes(logs):
    statuses = []
    for log in logs:
        if log['message']:
            d = json.loads(log['message'])
            if d['message'].get('method') == "Network.responseReceived":
                statuses.append(d['message']['params']['response']['status'])
    return statuses

使用方法:

>>> get_status_codes(logs)
[200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200]

注意1:这大部分是基于@Stefan Matei的回答,不过在Chrome版本之间有一些变化,我提供了解析日志的想法。

注意2:['content-type'] 不完全可靠。大小写可能会改变。根据您的使用情况进行检查。


4

通过API似乎可以从日志中获取响应状态码。

from selenium import webdriver
import json
browser = webdriver.PhantomJS()
browser.get('http://www.google.fr')
har = json.loads(browser.get_log('har')[0]['message'])
har['log']['entries'][0]['response']['status']
har['log']['entries'][0]['response']['statusText']

“日志”方面是否有任何特定于浏览器的内容,还是该代码可以在所有浏览器上运行? - Ywapom
1
我只用PhantomJS测试过。我不知道IE,但我认为使用Chrome应该是可能的。 - Mma
4
我收到了selenium.common.exceptions.WebDriverException: Message: unknown error: log type 'har' not found的错误信息。 - etayluz
1
@atb00ker,你会在代码的哪个位置插入capabilities['loggingPrefs'] = {'har': 'ALL'}呢? - Marcin Kulik
@MarcinKulik 做 from selenium.webdriver.common.desired_capabilities import DesiredCapabilities 然后你可以创建一个变量 capabilities = DesiredCapabilities.FIREFOX,现在你可以使用它来修改浏览器的功能。希望能帮到你。这可能作为一个例子:https://github.com/openwisp/docker-openwisp/blob/3cfe866459d146c0b56ffee7abd500962b21442c/tests/runtests.py - atb00ker
显示剩余2条评论

4
我会引用我早些时候提出的一个问题:如何检测Selenium何时加载浏览器的错误页面
简而言之,除非你想使用像鱿鱼代理或browsermob这样的高级解决方案,否则你必须选择以下这种不太优雅的解决方案。
替换
driver.get( "http://google.com" )

使用

def goTo( url ):
    if "errorPageContainer" in [ elem.get_attribute("id") for elem in driver.find_elements_by_css_selector("body > div") ]:
        raise Exception( "this page is an error" )
    else:
        driver.get( url )

您可以根据实际浏览器中显示的文本进行创意,并获取错误代码。这将根据浏览器进行自定义;上述方法适用于firefox。

唯一可能会出现问题的情况是404(页面不存在),因为许多网站都有自己的错误页面,您需要为每个页面进行自定义。


3

使用Selenium从URL获取状态码,可以使用JavaScript和XMLHttpRequest对象。 WebDriver类有一个execute_async_script()方法,可以调用它在浏览器中执行JavaScript代码:

from selenium import webdriver

driver = webdriver.Chrome(executable_path="C:\ChromeDriver\chromedriver.exe")
driver.get('https://stackoverflow.com/')

js = '''
let callback = arguments[0];
let xhr = new XMLHttpRequest();
xhr.open('GET', 'https://stackoverflow.com/', true);
xhr.onload = function () {
    if (this.readyState === 4) {
        callback(this.status);
    }
};
xhr.onerror = function () {
    callback('error');
};
xhr.send(null);
'''

status_code = driver.execute_async_script(js)
print(status_code)    # 200

driver.close()

关于执行异步脚本(execute_async_script)方法的更多信息。


GET方法看起来不错。但是在使用POST方法的页面中,有没有办法检查表单提交的响应代码? - bhattraideb

2
你还可以检查日志中最后一条信息的错误状态码: print browser.get_log('browser')[-1]['message']

1

同时,有一个名为selenium-wire的Python库。

pip install selenium-wire

它可以让你像这样做:

from seleniumwire import webdriver
from selenium.webdriver.chrome.options import Options

url = request.POST.get('https://stackoverflow.com', None)
driver = webdriver.Chrome()
driver.get(url)

for request in driver.requests:
    if request.response:
        print(
            request.url,
            request.response.status_code,
            request.response.headers['Content-Type']
        )

1
永远不要说什么是不可能的。得票最高的答案太糟糕了。还有很多其他的答案可以导致可能的解决方案,但我会分享一下我个人实现的方法,这是基于另一个Stack Overflow答案
使用Google Chrome测试。Firefox或PhantomJS的具体细节可能会有所不同。
我创建了一个检查您访问过的任何URL的响应状态代码的方法。我相信它可能需要清理,但它有效。
import json
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

capabilities = DesiredCapabilities.CHROME
capabilities['goog:loggingPrefs'] = {'performance': 'ALL'}

driver = webdriver.Chrome(desired_capabilities=capabilities)


def get_status_code(url):
    for entry in driver.get_log('performance'):
        for k, v in entry.items():
            if k == 'message' and 'status' in v:
                msg = json.loads(v)['message']['params']
                for mk, mv in msg.items():
                    if mk == 'response':
                        response_url = mv['url']
                        response_status = mv['status']
                        if response_url == url:
                            return response_status


print(get_status_code(driver.current_url))

输出:

200


0

我在这里使用Java,因为我没有太多的Python经验。而且,我不知道如何仅获取HTTP状态码。以下内容将给你整个网络流量,你可以从中捕获状态码。

首先启动你的服务器

selenium.start("captureNetworkTraffic=true");

然后捕获您的流量。
String traffic = selenium.captureNetworkTraffic("xml");

你也可以得到JSON格式的输出。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接