如何使用Python读取URL的内容？

Question

如何使用Python读取URL的内容？

127

以下内容在我将其粘贴到浏览器时可用：

http://www.somesite.com/details.pl?urn=2344

但是当我尝试用Python读取URL时，什么也没有发生：

 link = 'http://www.somesite.com/details.pl?urn=2344'
 f = urllib.urlopen(link)           
 myfile = f.readline()  
 print myfile

我需要对URL进行编码吗？还是有些东西我没有注意到？

- Helen Neely

11个回答

45

对于 Python3 用户，为了节省时间，可以使用以下代码：

For python3 users, to save time, use the following code,

from urllib.request import urlopen

link = "https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html"

f = urlopen(link)
myfile = f.read()
print(myfile)

我知道有关于错误的不同主题：Name Error: urlopen is not defined，但是认为这可能会节省时间。

- i.n.n.m

这不是使用Python3从URL读取数据的最佳方式，因为它错过了“with”语句的好处。请参见我的答案：https://dev59.com/ZmUp5IYBdhLWcg3wo4yd#56295038 - Freddie

不会在 while 循环中起作用，仅限一次调用。如果你问我的话，这很糟糕。 - greendino

21

这些答案都不太适用于Python 3（在发布此帖子时测试了最新版本）。

以下是正确的方法...

import urllib.request

try:
   with urllib.request.urlopen('http://www.python.org/') as f:
      print(f.read().decode('utf-8'))
except urllib.error.URLError as e:
   print(e.reason)

以上内容适用于返回'utf-8'的内容。如果您想让Python“猜测适当的编码”，请删除.decode('utf-8')。

Documentation: https://docs.python.org/3/library/urllib.request.html#module-urllib.request

- Freddie

谢谢，原始代码是用Python 2编写的，但是您在这里的贡献已经被记录下来了。 - Helen Neely

12

一种适用于Python 2.X和Python 3.X的解决方案使用了Python 2和3兼容性库six:

from six.moves.urllib.request import urlopen
link = "http://www.somesite.com/details.pl?urn=2344"
response = urlopen(link)
content = response.read()
print(content)

- Martin Thoma

1

我们可以读取以下网站HTML内容：

from urllib.request import urlopen
response = urlopen('http://google.com/')
html = response.read()
print(html)

- Akash Kinwad

2

这与@i.n.n.m.的答案相同。 - PM0087

1

#!/usr/bin/python
# -*- coding: utf-8 -*-
# Works on python 3 and python 2.
# when server knows where the request is coming from.

import sys

if sys.version_info[0] == 3:
    from urllib.request import urlopen
else:
    from urllib import urlopen
with urlopen('https://www.facebook.com/') as \
    url:
    data = url.read()

print data

# When the server does not know where the request is coming from.
# Works on python 3.

import urllib.request

user_agent = \
    'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'

url = 'https://www.facebook.com/'
headers = {'User-Agent': user_agent}

request = urllib.request.Request(url, None, headers)
response = urllib.request.urlopen(request)
data = response.read()
print data

- ARVIND CHAUHAN

0

from urllib.request import urlopen

# if has Chinese, apply decode()
html = urlopen("https://blog.csdn.net/qq_39591494/article/details/83934260").read().decode('utf-8')
print(html)

- 荷兰哲学家Elvira

感谢您提供这段代码片段，它可能会提供一些有限的、即时的帮助。一个适当的解释将极大地提高其长期价值，因为它展示了为什么这是一个好的问题解决方案，并使其对未来读者有其他类似问题的人更有用。请[编辑]您的答案以添加一些解释，包括您所做的假设。 - codedge

0

import requests
from bs4 import BeautifulSoup

link = "https://www.timeshighereducation.com/hub/sinorbis"

res = requests.get(link)
if res.status_code == 200:
    soup = BeautifulSoup(res, 'html.parser')

# get the text content of the webpage
text = soup.get_text()

print(text)

使用BeautifulSoup的HTML解析器，我们可以提取网页的内容。

- Nirmal Sankalana

-1

我使用了以下代码：

import urllib

def read_text():
      quotes = urllib.urlopen("https://s3.amazonaws.com/udacity-hosted-downloads/ud036/movie_quotes.txt")
      contents_file = quotes.read()
      print contents_file

read_text()

- ggglni

-1

# retrieving data from url
# only for python 3

import urllib.request

def main():
  url = "http://docs.python.org"

# retrieving data from URL
  webUrl = urllib.request.urlopen(url)
  print("Result code: " + str(webUrl.getcode()))

# print data from URL 
  print("Returned data: -----------------")
  data = webUrl.read().decode("utf-8")
  print(data)

if __name__ == "__main__":
  main()

- ksono

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- woozyking · Accepted Answer

回答你的问题：

import urllib

link = "http://www.somesite.com/details.pl?urn=2344"
f = urllib.urlopen(link)
myfile = f.read()
print(myfile)

你需要使用read()而不是readline()。

编辑（2018-06-25）：自Python 3以来，传统的urllib.urlopen()已被urllib.request.urlopen()取代（有关详细信息，请参见https://docs.python.org/3/library/urllib.request.html#urllib.request.urlopen的说明）。

如果你正在使用Python 3，请查看Martin Thoma或i.n.n.m在此问题中的答案https://dev59.com/ZmUp5IYBdhLWcg3wo4yd#28040508（Python 2/3兼容性）https://dev59.com/ZmUp5IYBdhLWcg3wo4yd#45886824（Python 3）。

或者，只需在这里获取这个库：http://docs.python-requests.org/en/latest/，并且认真使用它 :)

import requests

link = "http://www.somesite.com/details.pl?urn=2344"
f = requests.get(link)
print(f.text)