如何在Python 3中解析原始的HTTP请求？

Question

如何在Python 3中解析原始的HTTP请求？

9

我希望能够以本地方式在Python 3中解析HTTP请求。这个问题展示了如何在Python 2中实现，但使用了已被弃用的模块（和Python 2），我希望能够在Python 3中实现。

我主要想弄清楚请求的资源，并从一个简单的请求中解析头信息。(例如):

GET /index.html HTTP/1.1
Host: localhost
Connection: keep-alive
Cache-Control: max-age=0
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8

有人能向我展示一种基本的解析此请求的方法吗？

- Startec

1

你的第一句话表明你知道应该使用一个库（例如urllib3，requests）。然后你说你正在尝试在Python 3中完成它，但不知道如何操作。为什么不直接使用requests呢？ - Jonathon Reinhart

@JonathonReinhart 我在一个不允许使用第三方库的环境中工作。 - Startec

1

urllib不是第三方库。 - OneCricketeer

看起来标准库中的这个类可以满足你的需求。https://docs.python.org/3/library/http.server.html#http.server.BaseHTTPRequestHandler.MessageClass - OneCricketeer

1

@cricket_007他没有提到urllib。他提到的是第三方库urllib3。 - Startec

尝试使用kiss-headers，这是一个专门用于正确解析头文件的库。https://pypi.org/project/kiss-headers/ - Ousret

3个回答

2

每个字段名都应该由回车和换行符分隔，然后字段名和值由冒号分隔。所以假设您已经将响应作为字符串得到，这应该很容易：

fields = resp.split("\r\n")
fields = fields[1:] #ignore the GET / HTTP/1.1
output = {}
for field in fields:
    key,value = field.split(':', 1)#split each line by http field name and value
    output[key] = value

更新 4/13

使用链接帖子中的 http resp 示例：

resp = 'GET /search?sourceid=chrome&ie=UTF-8&q=ergterst HTTP/1.1\r\nHost: www.google.com\r\nConnection: keep-alive\r\nA
ccept: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5\r\nUser-Agent: Mozill
a/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.45 Safari/534.
13\r\nAccept-Encoding: gzip,deflate,sdch\r\nAvail-Dictionary: GeNLY2f-\r\nAccept-Language: en-US,en;q=0.8\r\n'


fields = resp.split("\r\n")
fields = fields[1:] #ignore the GET / HTTP/1.1
output = {}
for field in fields:
    if not field:
        continue
    key,value = field.split(':', 1)
    output[key] = value    
print(output)

需要进行额外的检查来确保field不为空。输出：

{'Host': ' www.google.com', 'Connection': ' keep-alive', 'Accept': ' application/xml,application/xhtml+xml,text/html;q=
0.9,text/plain;q=0.8,image/png,*/*;q=0.5', 'User-Agent': ' Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) App
leWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.45 Safari/534.13', 'Accept-Encoding': ' gzip,deflate,sdch', 'Avail-D
ictionary': ' GeNLY2f-', 'Accept-Language': ' en-US,en;q=0.8'}

- Liam Kelly

1

那段代码不会起作用。通过在split()中添加maxsplit=1来修补它，实际上会更好。你可能想要按\n分割而不是\r\n，这样它会更通用。然后如果有的话，不要忘记在结尾加上\r。 - Ousret

1

你可能想考虑使用专门的库，如kiss-headers来正确处理它们。 - Ousret

@Ousret - 更新了帖子以展示代码即使在帖子中的示例请求上也能正常工作。我确实需要快速检查字段是否为空，但对于示例代码来说，它是可行的。至于使用库，那是一个很好的默认选择。 - Liam Kelly

1

看看这个头部信息：User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:50.0) Gecko/20100101 Firefox/50.0 它会失败的。;) - Ousret

如果端口号已经在URL中明确包含，您可能需要忽略第二行（即“Host”字段和值）。例如使用fields = fields[2:]或key,value = field.split(':')将会抛出错误。 - Matthew Thomas

0

以下是一些Python软件包，旨在进行适当的HTTP协议解析：

https://dpkt.readthedocs.io/en/latest/api/api_auto.html#module-dpkt.http
https://h11.readthedocs.io/en/latest/
https://github.com/benoitc/http-parser/（基于C后端）
https://github.com/MagicStack/httptools（基于NodeJS的C后端）
https://github.com/silentsignal/netlib-offline（无耻自夸）

- buherator

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Corey Goldberg · Accepted Answer

你可以使用标准库中的email.message.Message类和email模块来实现。

通过修改你提供的问题链接中的答案，以下是一个解析HTTP头部的Python3示例。

假设你想创建一个包含所有头字段的字典：

import email
import pprint

request_string = 'GET / HTTP/1.1\r\nHost: localhost\r\nConnection: keep-alive\r\nCache-Control: max-age=0\r\nUpgrade-Insecure-Requests: 1\r\nUser-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8\r\nAccept-Encoding: gzip, deflate, sdch\r\nAccept-Language: en-US,en;q=0.8'

# pop the first line so we only process headers
_, headers = request_string.split('\r\n', 1)

# construct a message from the request string. note: the return is already a dict-like object.
message = email.message_from_string(headers)

# construct a dictionary containing the headers
headers = dict(message.items())

# pretty-print the dictionary of headers
pprint.pprint(headers, width=160)

如果你在Python提示符下运行这个代码，结果会是这样的：

{'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
 'Accept-Encoding': 'gzip, deflate, sdch',
 'Accept-Language': 'en-US,en;q=0.8',
 'Cache-Control': 'max-age=0',
 'Connection': 'keep-alive',
 'Host': 'localhost',
 'Upgrade-Insecure-Requests': '1',
 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'}