我在S3上有一个存储桶,在其中有大量文本文件。
我想要在文本文件中搜索某些文本。它只包含原始数据,而且每个文本文件都有不同的名称。
例如,我有一个存储桶名称:
abc/myfolder/abac.txt
xyx/myfolder1/axc.txt
& 我想在上述文本文件中搜索文本“我是人类”。
如何做到这一点?这是否可能?
我在S3上有一个存储桶,在其中有大量文本文件。
我想要在文本文件中搜索某些文本。它只包含原始数据,而且每个文本文件都有不同的名称。
例如,我有一个存储桶名称:
abc/myfolder/abac.txt
xyx/myfolder1/axc.txt
& 我想在上述文本文件中搜索文本“我是人类”。
如何做到这一点?这是否可能?
唯一的方法是通过CloudSearch来实现,它可以使用S3作为数据源。它通过快速检索来构建索引。这应该能很好地工作,但请仔细检查定价模型,确保对您来说不会太昂贵。
另一种选择就是像Jack所说的那样 - 否则,您需要将文件从S3传输到EC2并在那里构建一个搜索应用程序。
Download the module: https://github.com/mixpeek/mixpeek-python
Import the module and your API keys:
from mixpeek import Mixpeek, S3
from config import mixpeek_api_key, aws
Instantiate the S3 class (which uses boto3 and requests):
s3 = S3(
aws_access_key_id=aws['aws_access_key_id'],
aws_secret_access_key=aws['aws_secret_access_key'],
region_name='us-east-2',
mixpeek_api_key=mixpeek_api_key
)
Upload one or more existing S3 files:
# upload all S3 files in bucket "demo"
s3.upload_all(bucket_name="demo")
# upload one single file called "prescription.pdf" in bucket "demo"
s3.upload_one(s3_file_name="prescription.pdf", bucket_name="demo")
Now simply search using the Mixpeek module:
# mixpeek api direct
mix = Mixpeek(
api_key=mixpeek_api_key
)
# search
result = mix.search(query="Heartgard")
print(result)
Where result can be:
[
{
"_id": "REDACTED",
"api_key": "REDACTED",
"highlights": [
{
"path": "document_str",
"score": 0.8759502172470093,
"texts": [
{
"type": "text",
"value": "Vetco Prescription\nVetcoClinics.com\n\nCustomer:\n\nAddress: Canine\n\nPhone: Australian Shepherd\n\nDate of Service: 2 Years 8 Months\n\nPrescription\nExpiration Date:\n\nWeight: 41.75\n\nSex: Female\n\n℞ "
},
{
"type": "hit",
"value": "Heartgard"
},
{
"type": "text",
"value": " Plus Green 26-50 lbs (Ivermectin 135 mcg/Pyrantel 114 mg)\n\nInstructions: Give one chewable tablet by mouth once monthly for protection against heartworms, and the treatment and\ncontrol of roundworms, and hookworms. "
}
]
}
],
"metadata": {
"date_inserted": "2021-10-07 03:19:23.632000",
"filename": "prescription.pdf"
},
"score": 0.13313256204128265
}
]
然后您解析结果。
你可以使用Filestash(免责声明:我是作者),安装自己的实例并连接到您的S3存储桶。如果您有大量数据,请给它一些时间来索引整个内容,然后您就可以使用它了。
现在有一种无服务器且更便宜的选择可用。
我建议您将数据放入S3中的Parquet格式中,这将使S3上的数据大小非常小且超级快!
我知道这已经很老了,但希望有人会发现我的解决方案很方便。
这是一个使用boto3的Python脚本。
def search_word (info, search_for):
res = False
if search_for in info:
res = True
elif search_for not in info:
res = False
return res
import boto3
import json
aws_access_key_id='AKIAWG....'
aws_secret_access_key ='p9yrNw.....'
client = boto3.client('s3', aws_access_key_id=aws_access_key_id, aws_secret_access_key = aws_secret_access_key)
s3 = boto3.resource('s3')
bucket_name = 'my.bucket.name'
bucket_prefix='2022/05/'
search_for = 'looking@emailaddress.com'
search_results = []
search_results_keys = []
response = client.list_objects_v2(
Bucket=bucket_name,
Prefix=bucket_prefix
)
for i in response['Contents']:
mini = {}
obj = client.get_object(
Bucket=bucket_name,
Key=i['Key']
)
body = obj['Body'].read().decode("utf-8")
key = i['Key']
if search_word(body, search_for):
mini = {}
mini[key] = body
search_results.append(mini)
search_results_keys.append(key)
# YOU CAN EITHER PRINT THE KEY (FILE NAME/DIRECTORY), OR A MAP WHERE THE KEY IS THE FILE NAME/DIRECTORY. AND THE VALUE IS THE TXT OF THE FILE
print(search_results)
print(search_results_keys)