如何使用Python从网站下载所有zip文件

Question

如何使用Python从网站下载所有zip文件

3

我正在尝试从此网页下载所有压缩文件：https://www.google.com/googlebooks/uspto-patents-grants-text.html。请注意，我不是专业编码人员，如果我犯了一些愚蠢的错误，请原谅我。

以下是我的代码：

from bs4 import BeautifulSoup            
import requests

url = "https://www.google.com/googlebooks/uspto-patents-grants-text.html"
html = requests.get(url)
soup = BeautifulSoup(html.text, "html.parser")

for link in soup.find_all('a', href=True):
    href = link['href']

    if any(href.endswith(x) for x in ['.zip']):
    #if any(href.endswith('.zip')):
        print("Downloading '{}'".format(href))
        remote_file = requests.get(url + href)

        with open(href, 'wb') as f:
            for chunk in remote_file.iter_content(chunk_size=1024): 
                if chunk: 
                    f.write(chunk)

我运行代码时收到的错误信息是：

File "C:/Users/#USER#/#FILEPATH#/Python/patentzipscraper2.py", line 16, in with open(href, 'wb') as f: OSError: [Errno 22] 无效参数: http://storage.googleapis.com/patents/grant_full_text/2015/ipg150106.zip 然而，当我在浏览器中输入那个地址时，可以下载压缩文件。我猜想这与压缩文件的格式有关，我不能直接下载/打开它们，但我不确定具体原因。我之前的代码是为下载可直接下载的文件（如.txt）而编写的。

希望您能提供有关如何下载这些压缩文件的任何帮助。

- John Doe

你想要下载从1976年至今的所有数据吗？ - MishaVacic

你正在尝试创建一个名为'python 'http://storage.googleapis.com/patents/grant_full_text/2015/ipg150106.zip'的写入文件。很可能open不喜欢这个名称。 - Steven Rumbalski

另一个奇怪之处是 remote_file = requests.get(url + href)，但是 url + href 解析为

"https://www.google.com/googlebooks/uspto-patents-grants-text.htmlhttp://storage.googleapis.com/patents/grant_full_text/2015/‌ipg150106.zip"

。难道不应该是 remote_file = requests.get(href) 吗？ - Steven Rumbalski

也许可以这样写：with open(os.path.basename(href), 'wb') as f: 这样你就可以将文件写入到 'ipg150106.zip' 中。 - Steven Rumbalski

嗨，我尝试了这些更改中的每一个，但我仍然得到相同的错误（除非我使用os.path.basename，那么我会收到一个错误，说： with open(os.path.basename(href), 'wb') as f: NameError: name 'os' is not defined我应该以不同的方式写入文件吗？ - John Doe

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Dmitriy Fialkovskiy · Answer 1

在你的代码中实现类似以下的内容：

import urllib

archive = urllib.request.URLopener()
archive.retrieve("http://yoursite.com/file.zip", "file.zip")