Python S3下载zip文件

13

我在S3上上传了zip文件,现在想要下载并处理它们。我不需要永久保存这些文件,但是需要暂时处理它们。请问如何操作呢?


如果您只想下载文件而不解压缩任何文件,也可以使用“download_file”方法,如此答案所示:https://dev59.com/_rX3oIgBc1ULPQZFwZVW#71474927 - Aelius
6个回答

31

因为可工作的软件>全面的文档

Boto2

import zipfile
import boto
import io

# Connect to s3
# This will need your s3 credentials to be set up 
# with `aws configure` using the aws CLI.
#
# See: https://aws.amazon.com/cli/
conn = boto.s3.connect_s3()

# get hold of the bucket
bucket = conn.get_bucket("my_bucket_name")

# Get hold of a given file
key = boto.s3.key.Key(bucket)
key.key = "my_s3_object_key"

# Create an in-memory bytes IO buffer
with io.BytesIO() as b:

    # Read the file into it
    key.get_file(b)

    # Reset the file pointer to the beginning
    b.seek(0)

    # Read the file as a zipfile and process the members
    with zipfile.ZipFile(b, mode='r') as zipf:
        for subfile in zipf.namelist():
            do_stuff_with_subfile()

Boto3

import zipfile
import boto3
import io

# this is just to demo. real use should use the config 
# environment variables or config file.
#
# See: http://boto3.readthedocs.org/en/latest/guide/configuration.html

session = boto3.session.Session(
    aws_access_key_id="ACCESSKEY", 
    aws_secret_access_key="SECRETKEY"
)

s3 = session.resource("s3")
bucket = s3.Bucket('stackoverflow-brice-test')
obj = bucket.Object('smsspamcollection.zip')

with io.BytesIO(obj.get()["Body"].read()) as tf:

    # rewind the file
    tf.seek(0)

    # Read the file as a zipfile and process the members
    with zipfile.ZipFile(tf, mode='r') as zipf:
        for subfile in zipf.namelist():
            print(subfile)

在MacOSX上使用Python3进行测试。


谢谢您的回答。您知道如何在boto3上实现这个吗? - jaycode
@brice 当我尝试实际运行 with open(subfile, 'r') as file: 时,出现了“没有这个文件或目录”的错误。 - partydog
我不相信这种方法适用于非常大的(~>2GB)zip文件。当您尝试使用以下行读取zip文件时,“with io.BytesIO(obj.get()["body].read()) as tf:”,您将收到“Python int too large to convert to C long”的错误提示。我一直无法找到一种可靠的方法来打开一个大于2GB的S3 zip文件。 - Doug Bower
@partydog,这是因为它只是打印压缩文件中的文件名。 - Binx
我们如何在pandas中读取这个文件?我尝试将子文件作为参数添加,但它会抛出以下错误 - FileNotFoundError:[Errno 2]没有这样的文件或目录: - Mohseen Mulla
找到了自己问题的答案 - 如果有人需要帮助: with zipfile.ZipFile(tf, mode='r') as zipf: for line in zipf.read("xyz.csv").split(b"\n"): print(line) - Mohseen Mulla

4
Pandas提供了一种快捷方式,可从最佳答案中省去大部分代码,并允许您无需关注文件路径是在s3、gcp还是本地计算机上。
import pandas as pd  

obj = pd.io.parsers.get_filepath_or_buffer(file_path)[0]
with io.BytesIO(obj.read()) as byte_stream:
    # Use your byte stream, to, for example, print file names...
    with zipfile.ZipFile(byte_stream, mode='r') as zipf:
        for subfile in zipf.namelist():
            print(subfile)

3
如果速度是一个问题,一个好的方法是选择一个距离您的S3存储桶相对较近(在同一地区)的EC2实例,并使用该实例来解压/处理您的压缩文件。这将减少延迟并使您能够相当有效地处理它们。完成工作后,您可以删除每个提取的文件。
注意:只有在您可以接受使用EC2实例的情况下,此方法才适用。

1

从S3存储桶中读取zip文件中的特定文件。

import boto3
import os
import zipfile
import io
import json


'''
When you configure awscli, you\'ll set up a credentials file located at 
~/.aws/credentials. By default, this file will be used by Boto3 to authenticate.
'''
os.environ['AWS_PROFILE'] = "<profile_name>"
os.environ['AWS_DEFAULT_REGION'] = "<region_name>"

# Let's use Amazon S3
s3_name = "<bucket_name>"
zip_file_name = "<zip_file_name>"
file_to_open = "<file_to_open>"
s3 = boto3.resource('s3')
obj = s3.Object(s3_name, zip_file_name )

with io.BytesIO(obj.get()["Body"].read()) as tf:
    # rewind the file
    tf.seek(0)
    # Read the file as a zipfile and process the members
    with zipfile.ZipFile(tf, mode='r') as zipf:
        file_contents= zipf.read(file_to_open).decode("utf-8")
        print(file_contents)


参考自 @brice 的答案。

1
我相信您已经听说过 boto,它是 Python 与 Amazon Web Services 接口
您可以从 s3 获取 keyfile
import boto
import zipfile.ZipFile as ZipFile

s3 = boto.connect_s3() # connect
bucket = s3.get_bucket(bucket_name) # get bucket
key = bucket.get_key(key_name) # get key (the file in s3)
key.get_file(local_name) # set this to temporal file

with ZipFile(local_name, 'r') as myzip:
    # do something with myzip

os.unlink(local_name) # delete it

你也可以使用tempfile。更多细节请参见创建和读取临时文件

0

在 @brice 的回答上进行补充


如果你想逐行读取文件中的任何数据,这里是代码:

with zipfile.ZipFile(tf, mode='r') as zipf:
    for line in zipf.read("xyz.csv").split(b"\n"):
        print(line)
        break # to break off after the first line

希望这能帮到你!


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接