使用Boto3从S3下载文件夹

Question

使用Boto3从S3下载文件夹

pythonamazon-web-servicesamazon-s3downloadboto3

61

使用 Boto3 Python SDK，我可以通过方法 bucket.download_file() 下载文件

是否有一种方式可以下载整个文件夹？

- El Fadel Anas

2

可能是重复的 - https://dev59.com/JlwZ5IYBdhLWcg3wC8j4#31960438 - Yoav Gaudin

3

可能是与Boto3下载S3存储桶中的所有文件重复的问题。 - Vincent de Lagabbe

10个回答

51

以下是Konstantinos Katsantonis所提供的答案进行了轻微修改后的版本:

import boto3
s3 = boto3.resource('s3') # assumes credentials & configuration are handled outside python in .aws directory or environment variables

def download_s3_folder(bucket_name, s3_folder, local_dir=None):
    """
    Download the contents of a folder directory
    Args:
        bucket_name: the name of the s3 bucket
        s3_folder: the folder path in the s3 bucket
        local_dir: a relative or absolute directory path in the local file system
    """
    bucket = s3.Bucket(bucket_name)
    for obj in bucket.objects.filter(Prefix=s3_folder):
        target = obj.key if local_dir is None \
            else os.path.join(local_dir, os.path.relpath(obj.key, s3_folder))
        if not os.path.exists(os.path.dirname(target)):
            os.makedirs(os.path.dirname(target))
        if obj.key[-1] == '/':
            continue
        bucket.download_file(obj.key, target)

这个方法可以下载嵌套的子目录，我曾经用它下载了一个包含超过3000个文件的目录。你也可以在 Boto3 to download all files from a S3 Bucket找到其他解决方案，但我不知道它们是否更好。

- bjc

19

您也可以使用cloudpathlib，它针对S3包装了boto3。对于您的用例来说，它非常简单：

from cloudpathlib import CloudPath

cp = CloudPath("s3://bucket/folder/folder2/")
cp.download_to("local_folder")

- hume

有人知道吗，如果AWS计费时会将此视为一个请求吗？！ - Alex

可能不会。使用 boto3 循环遍历每个键应该与此相同（也许需要添加一个调用以列出对象，但在两种情况下都需要它）。 - hume

对我来说，只有没有结尾的/才有效...在上面的示例中，它应该是：cp = CloudPath("s3://bucket/folder/folder2")。 - Luiz Tauffer

@hume 我可以将相对路径传递给CloudPath吗？例如："s3://bucket///device/"？ - trungducng

1

@trungducng，就像普通的Path一样，有一个glob方法可以用来循环遍历这些文件，并在每个文件上单独调用download_to。https://cloudpathlib.drivendata.org/stable/api-reference/s3path/#cloudpathlib.s3.s3path.S3Path.glob - hume

2

为什么这不是最佳解决方案！！这个工具真的很棒。 - user2755526

5

另一种方法是在@bjc的答案基础上构建，利用内置的Path库并为您解析s3 uri：

import boto3
from pathlib import Path
from urllib.parse import urlparse

def download_s3_folder(s3_uri, local_dir=None):
    """
    Download the contents of a folder directory
    Args:
        s3_uri: the s3 uri to the top level of the files you wish to download
        local_dir: a relative or absolute directory path in the local file system
    """
    s3 = boto3.resource("s3")
    bucket = s3.Bucket(urlparse(s3_uri).hostname)
    s3_path = urlparse(s3_uri).path.lstrip('/')
    if local_dir is not None:
        local_dir = Path(local_dir)
    for obj in bucket.objects.filter(Prefix=s3_path):
        target = obj.key if local_dir is None else local_dir / Path(obj.key).relative_to(s3_path)
        target.parent.mkdir(parents=True, exist_ok=True)
        if obj.key[-1] == '/':
            continue
        bucket.download_file(obj.key, str(target))

- Matthew Cox

5

使用boto3，您可以设置aws凭据并从S3下载数据集

import boto3
import os 

# set aws credentials 
s3r = boto3.resource('s3', aws_access_key_id='xxxxxxxxxxxxxxxxx',
    aws_secret_access_key='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx')
bucket = s3r.Bucket('bucket_name')

# downloading folder 
prefix = 'dirname'
for object in bucket.objects.filter(Prefix = 'dirname'):
    if object.key == prefix:
        os.makedirs(os.path.dirname(object.key), exist_ok=True)
        continue;
    bucket.download_file(object.key, object.key)

如果您无法找到您的 access_key 和 secret_access_key，请参考此页面。
希望对您有所帮助。
谢谢。

- Soulduck

2

最好避免将密钥放在代码文件中。最坏的情况下，您可以将密钥放在单独的受保护文件中并导入它们。也可以使用未缓存任何凭据的boto3，而是使用s3fs或仅依赖于配置文件（https://www.reddit.com/r/aws/comments/73212m/has_anyone_found_a_way_to_hide_boto3_credentials/）。 - Zach Rieck

3

您可以从Python调用awscli cp命令来下载整个文件夹。

 import os
 import subprocess

 remote_folder_name = 's3://my-bucket/my-dir'
 local_path = '.'
 if not os.path.exists(local_path):
     os.makedirs(local_path)
 subprocess.run(['aws', 's3', 'cp', remote_folder_name, local_path, '--recursive'])

关于此解决方案的一些注意事项：

您应该安装awscli (pip install awscli) 并进行配置。更多信息请参见这里
如果您不想在未更改文件时覆盖现有文件，则可以使用 sync 替换 cp subprocess.run(['aws', 's3', 'sync', remote_folder_name, local_path])
已在 python 3.6 上测试。在较早版本的python上，您可能需要将 subprocess.run 替换为 subprocess.call 或 os.system
此代码执行的cli命令是aws s3 cp s3://my-bucket/my-dir . --recursive

- Roman Mirochnik

1

我在使用这个版本时遇到了一些问题。修改了变量 destination，并且添加了过滤文件类型的变量。

from sre_constants import SUCCESS
import boto3 
from os import path, makedirs 
from botocore.exceptions import ClientError
from boto3.exceptions import S3TransferFailedError

def download_s3_folder(s3_folder, local_dir, aws_access_key_id, aws_secret_access_key, aws_bucket, debug_en, datatype):
""""" Download the contents of a folder directory into a local area """""

success = True
# Start do processo de copia
print('[INFO] Downloading %s from bucket %s...' % (s3_folder, aws_bucket))

# Metodo que lista todos os objetos do Bucket. 
def get_all_s3_objects(s3, **base_kwargs):
    continuation_token = None
    while True:
        list_kwargs = dict(MaxKeys=1000, **base_kwargs)
        if continuation_token:
            list_kwargs['ContinuationToken'] = continuation_token
        response = s3.list_objects_v2(**list_kwargs)
        yield from response.get('Contents', [])
        if not response.get('IsTruncated'):
            break
        continuation_token = response.get('NextContinuationToken')

s3_client = boto3.client('s3',
                         aws_access_key_id=aws_access_key_id,
                         aws_secret_access_key=aws_secret_access_key)

all_s3_objects_gen = get_all_s3_objects(s3_client, Bucket=aws_bucket)

# Loop into os objetos do S3,
for obj in all_s3_objects_gen:
    source = obj['Key']
    if source.startswith(s3_folder):
        # Transform path to using fo SO
        destination = path.join(local_dir,source).replace('/','\\')
        
        if not path.exists(path.dirname(destination)):
            makedirs(path.dirname(destination))
        try:
            #print('copy')
            if destination.endswith(datatype):
                #print(destination)
                print('Copia do arquivo "%s" Sucesso' % (destination))
                s3_client.download_file(aws_bucket, source, destination)
        except (ClientError, S3TransferFailedError) as e:
            print('[ERROR] Could not download file "%s": %s' % (source, e))
            success = False
        if debug_en:
            print(f"[DEBUG] Downloading: {source} --> {destination}")

return success

- Joji Miki Okada

1

我写了一个脚本来下载任何扩展名的文件（代码中是 .csv），您可以根据需要下载的文件类型更改文件扩展名。

import boto3
import os
import shutil

session = boto3.Session(
    aws_access_key_id='',
    aws_secret_access_key='',
)


def download_directory(bucket_name, s3_folder_name):
    s3_resource = session.resource('s3')
    bucket = s3_resource.Bucket(bucket_name)
    objs = list(bucket.objects.filter(Prefix=s3_folder_name))
    for obj in objs:
        print("Try to Downloading " + obj.key)
        if not os.path.exists(os.path.dirname(obj.key)):
            os.makedirs(os.path.dirname(obj.key))
        out_name = obj.key.split('/')[-1]
        if out_name[-4:] == ".csv":
            bucket.download_file(obj.key, out_name)
            print(f"Downloaded {out_name}")
            dest_path = ('/').join(obj.key.split('/')[0:-1])
            shutil.move(out_name, dest_path)
            print(f"Moved File to {dest_path}")
        else:
            print(f"Skipping {out_name}")


download_directory("mybucket", "myfolder")

如果您不确定如何准确操作，请随时向我寻求帮助。

- Pratik Bhadane

1

上述解决方案很好，依赖于S3资源。
以下解决方案通过应用s3_client实现了相同的目标。
您可能会发现它对您有用（我已经测试过它，它工作得很好）。

import boto3
from os import path, makedirs
from botocore.exceptions import ClientError
from boto3.exceptions import S3TransferFailedError

def download_s3_folder(s3_folder, local_dir, aws_access_key_id, aws_secret_access_key, aws_bucket, debug_en):
    """ Download the contents of a folder directory into a local area """

    success = True

    print('[INFO] Downloading %s from bucket %s...' % (s3_folder, aws_bucket))

    def get_all_s3_objects(s3, **base_kwargs):
        continuation_token = None
        while True:
            list_kwargs = dict(MaxKeys=1000, **base_kwargs)
            if continuation_token:
                list_kwargs['ContinuationToken'] = continuation_token
            response = s3.list_objects_v2(**list_kwargs)
            yield from response.get('Contents', [])
            if not response.get('IsTruncated'):
                break
            continuation_token = response.get('NextContinuationToken')

    s3_client = boto3.client('s3',
                             aws_access_key_id=aws_access_key_id,
                             aws_secret_access_key=aws_secret_access_key)

    all_s3_objects_gen = get_all_s3_objects(s3_client, Bucket=aws_bucket)

    for obj in all_s3_objects_gen:
        source = obj['Key']
        if source.startswith(s3_folder):
            destination = path.join(local_dir, source)
            if not path.exists(path.dirname(destination)):
                makedirs(path.dirname(destination))
            try:
                s3_client.download_file(aws_bucket, source, destination)
            except (ClientError, S3TransferFailedError) as e:
                print('[ERROR] Could not download file "%s": %s' % (source, e))
                success = False
            if debug_en:
                print('[DEBUG] Downloading: %s --> %s' % (source, destination))

    return success

- Shahar Gino

0

这是我受到konstantinos-katsantonis和bjc答案启发的方法。

import os
import boto3
from pathlib import Path

def download_s3_dir(bucketName, remote_dir, local_dir):
    assert remote_dir.endswith('/')
    assert local_dir.endswith('/')
    s3_resource = boto3.resource('s3')
    bucket = s3_resource.Bucket(bucketName) 
    objs = bucket.objects.filter(Prefix=remote_dir)
    sorted_objs = sorted(objs, key=attrgetter("key"))
    for obj in sorted_objs:
        path = Path(os.path.dirname(local_dir + obj.key))
        path.mkdir(parents=True, exist_ok=True)
        if not obj.key.endswith("/"):
            bucket.download_file(obj.key, str(path) + "/" + os.path.split(obj.key)[1])

- Greg7000

它对我没有用。我得到： `AssertionError Traceback (most recent call last) Input In [34], in <cell line: 1>() ----> 1 download_s3_dir(bucket_name, remote_folder_name, local_path)Input In [23], in download_s3_dir(bucketName, remote_dir, local_dir) 5 def download_s3_dir(bucketName, remote_dir, local_dir): 6 assert remote_dir.endswith('/') ----> 7 assert local_dir.endswith('/') 8 s3_resource = boto3.resource('s3') 9 bucket = s3_resource.Bucket(bucketName) AssertionError:` - user88484

@user88484，请确保你的remote_dir和loca_dir以'/'结尾。 - Greg7000

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Konstantinos Katsantonis · Accepted Answer

简单粗暴但有效：

import boto3
import os 

def downloadDirectoryFroms3(bucketName, remoteDirectoryName):
    s3_resource = boto3.resource('s3')
    bucket = s3_resource.Bucket(bucketName) 
    for obj in bucket.objects.filter(Prefix = remoteDirectoryName):
        if not os.path.exists(os.path.dirname(obj.key)):
            os.makedirs(os.path.dirname(obj.key))
        bucket.download_file(obj.key, obj.key) # save to same path

假设您想从s3下载目录foo/bar，则for循环将迭代所有路径以Prefix=foo/bar开头的文件。