从Lambda中的S3通知事件获取非ASCII文件名

Question

从Lambda中的S3通知事件获取非ASCII文件名

python-2.7amazon-s3utf-8aws-lambdapython-unicode

9

AWS S3通知事件中的key字段表示文件名，因此需要进行URL转义。

当文件名包含空格或非ASCII字符时，这一点就很明显了。

例如，我已将以下文件名上传到S3：

my file řěąλλυ.txt

收到的通知如下：

{ 
  "Records": [
    "s3": {
        "object": {
            "key": u"my+file+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt"
        }
    }
  ]
}

我尝试使用以下方法进行解码：

key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key']).decode('utf-8')

但是这会产生以下结果：

my file ÅÄÄÎ»Î»Ï.txt

当我试图使用Boto从S3获取文件时，我遇到了404错误。

- Alastair McCormack

3个回答

8

如果有其他人希望找到 JavaScript 解决方案，请看我最终采用的方法：

function decodeS3EventKey (key = '') {
  return decodeURIComponent(key.replace(/\+/g, ' '))
}

经过有限的测试，似乎可以正常工作：

test+image+%C3%BCtf+%E3%83%86%E3%82%B9%E3%83%88.jpg 解码为 test image ütf テスト.jpg
my+file+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt 解码为 my file řěąλλυ.txt

- Marco Lüthy

我花了很多时间调试“空格”问题。这让我确认了我发现的是正确的。非常感谢！ - vincent

4

针对Python 3：

from urllib.parse import unquote_plus
result = unquote_plus('input/%D0%BF%D1%83%D1%81%D1%82%D0%BE%D0%B8%CC%86.pdf')
print(result)

# will prints 'input/пустой.pdf'

- valex

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Alastair McCormack · Accepted Answer

简述

在解析URL之前，您需要将URL编码的Unicode字符串转换为字节串，然后将其解码为UTF-8。

例如，对于名为my file řěąλλυ.txt的S3对象：

>>> utf8_urlencoded_key = event['Records'][0]['s3']['object']['key'].encode('utf-8')
# encodes the Unicode string to utf-8 encoded [byte] string. The key shouldn't contain any non-ASCII at this point, but UTF-8 will be safer.
'my+file+%C5%99%C4%9B%C4%85%CE%BB%CE%BB%CF%85.txt'

>>> key_utf8 = urllib.unquote_plus(utf8_urlencoded_key)
# the previous url-escaped UTF-8 are now converted to UTF-8 bytes
# If you passed a Unicode object to unquote_plus, you'd have got a 
# Unicode with UTF-8 encoded bytes!
'my file \xc5\x99\xc4\x9b\xc4\x85\xce\xbb\xce\xbb\xcf\x85.txt'

# Decodes key_utf-8 to a Unicode string
>>> key = key_utf8.decode('utf-8')
u'my file \u0159\u011b\u0105\u03bb\u03bb\u03c5.txt'
# Note the u prefix. The utf-8 bytes have been decoded to Unicode points.

>>> type(key)
<type 'unicode'>

>>> print(key)
my file řěąλλυ.txt

背景

AWS犯了一个大忌，改变了默认编码 - https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/

您应该从decode()收到的错误是：

UnicodeEncodeError: 'ascii' codec can't encode characters in position 8-19: ordinal not in range(128)

key 的值是 Unicode 编码。在 Python2.x 中，即使这没有任何意义，你也可以对 Unicode 进行解码。在 Python2.x 中，解码 Unicode 时，Python 首先会尝试将其编码为 [byte] 字符串，然后再使用给定的编码进行解码。在 Python2.x 中，默认编码应该是 ASCII，而显然无法包含所用到的字符。

如果你能够从 Python 中得到适当的 UnicodeEncodeError 错误提示，那么或许你就能找到合适的答案了。在 Python3 中，你将无法调用 .decode() 方法。