如何在Python 3中从URL读取HTML

Question

如何在Python 3中从URL读取HTML

pythonhtmlurl

95

我查看了之前的类似问题，但只让自己更加困惑。

在Python 3.4中，我想从给定的URL读取一个HTML页面作为字符串。

在Perl中，我使用LWP::Simple的get()方法来实现此操作。

Matplotlib 1.3.1的一个示例中说：import urllib; u1=urllib.urlretrieve(url)。但是，在python3中找不到urlretrieve。

我尝试使用u1 = urllib.request.urlopen(url)，它似乎获得了一个HTTPResponse对象，但我无法打印它、获取它的长度或对它进行索引。

u1.body不存在。我找不到关于python3中的HTTPResponse的描述。

是否有一个属性在HTTPResponse对象中会给我原始HTML页面的字节？

(其他问题中的无关内容包括urllib2(在我的python中不存在)，csv解析器等)

编辑：

我在之前的问题中找到了一些部分（大部分）完成工作的东西：

u2 = urllib.request.urlopen('http://finance.yahoo.com/q?s=aapl&ql=1')

for lines in u2.readlines():
    print (lines)

我说“部分地”，因为我不想阅读单独的行，而只是一个很长的字符串。

我可以将这些行连接起来，但每个打印出来的行前都有一个字符“b”。

那是从哪里来的？

再次说明，我可以在连接之前删除第一个字符，但这样做确实有点笨拙。

- user1067305

这是Python 3文档中HTTPResponse对象的描述。 - martineau

7个回答

114

尝试使用“requests”模块，它更简单。

#pip install requests for installation

import requests

url = 'https://www.google.com/'
r = requests.get(url)
r.text

更多信息在此处 > http://docs.python-requests.org/en/master/

- Aaron T.

1

“import requests” 是 Python 2 的语法，对吗？ - Fabien Snauwaert

8

你的意思是什么？在Python 3中也可以使用"import libname"语句。 - Sir Von Berker

1

从网站上来看："Requests官方支持Python 2.7和3.6+，并且在PyPy上运行良好。" - tenfishsticks

16

urllib.request.urlopen(url).read() 应该会把原始的 HTML 页面作为字符串返回。

- user2629998

2

@user1067305 奇怪... request.urlopen() 返回一个 HTTPResponse，并且它们确实有 read() 方法... - user2629998

好的！我尝试了这种方式：u2 = urllib.request.urlopen('http://finance.yahoo.com/q?s=aapl&ql=1') junk = u2.read() print(junk) - user1067305

15

import requests

url = requests.get("http://yahoo.com")
htmltext = url.text
print(htmltext)

这将类似于urllib.urlopen的工作方式。

- Ramandeep Singh

13

使用urllib读取html页面非常简单。由于您想将其作为单个字符串读取，我将向您展示如何进行操作。

导入urllib.request：

#!/usr/bin/python3.5

import urllib.request

准备我们的请求

request = urllib.request.Request('http://www.w3schools.com')

在请求网页时，一定要使用"try/except"，因为事情很容易出错。使用urlopen()来请求该页面。

try:
    response = urllib.request.urlopen(request)
except:
    print("something wrong")

Type是一个很好的函数，它会告诉我们一个变量的'type'。在这里，response是一个http.response对象。

print(type(response))

我们响应对象的read函数会将HTML以字节形式存储到变量中。再次使用type()函数可以验证这一点。

htmlBytes = response.read()

print(type(htmlBytes))

现在我们使用 decode 函数将字节变量转换为单个字符串。

htmlStr = htmlBytes.decode("utf8")

print(type(htmlStr))

如果您想将此字符串拆分成单独的行，可以使用split()函数。这样，我们就可以轻松地进行迭代以打印整个页面或执行任何其他处理。

htmlSplit = htmlStr.split('\n')

print(type(htmlSplit))

for line in htmlSplit:
    print(line)

希望这能提供更详细的答案。Python文档和教程非常好，我建议将其用作参考，因为它会回答您可能遇到的大多数问题。

- Discoveringmypath

1

不要假定它是UTF-8编码，这不是一个好主意。你应该尝试读取头文件。 - CpILL

@CpILL 很好的发现。我同意，虽然utf-8被广泛使用，但你可能会遇到问题。 - Discoveringmypath

2

对于Python 2

import urllib
some_url = 'https://docs.python.org/2/library/urllib.html'
filehandle = urllib.urlopen(some_url)
print filehandle.read()

- agamike

3

可以说明这是针对Python2的吗？因为我查过了，urllib.urlopen在Python3中已经不存在了。 - junhan

0

我认为在这些行前面添加的b''是为了表示它是一个字节字符串，这也是你要求的。要解码字节对象：

b'Some html text'.decode()

它将使用utf-8进行解码。然而，最好使用headers中指定的编码进行解码。

我不确定这是否适用于Python 3.4，但是可以这样做：

import requests 
page = requests.get('https://www.mslscript.com')
html_text = page.text
encoded_html = html_text.encode(page.encoding)
decoded_html = encoded_html.decode(page.encoding)

要在一行文本中实现这个目标很简单：

# Remove all the CRLF chars
while '\n' in decoded_html:
     decoded_html = decoded_html.replace('\n','')

# Remove all the extra spaces,
#   you could even replace with ''
while '  ' in decoded_html:
    decoded_html = decoded_html.replace('  ',' ')

# Remove tabs '\t', maybe not.
while '\t' in decoded_html:
    decoded_html = decoded_html.replace('\t','')

你还可以使用requests-async，这是一个强大的库，与Python 3.6兼容，并且在与Trio一起使用时效果特别好。而requests的最新版本适用于py -3.7。如��可能的话，你可能想要升级至至少Python 3.8版本。

- bauderr

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- davidgh · Accepted Answer

请注意，Python3不会将HTML代码视为字符串而是作为 bytearray 读取，因此您需要使用 decode 将其转换为字符串。

import urllib.request

fp = urllib.request.urlopen("http://www.python.org")
mybytes = fp.read()

mystr = mybytes.decode("utf8")
fp.close()

print(mystr)