使用Python在elasticsearch-dsl中聚合字段

Question

使用Python在elasticsearch-dsl中聚合字段

19

有人能告诉我如何编写Python语句来汇总（求和和计数）我的文档中的内容吗？

脚本

from datetime import datetime
from elasticsearch_dsl import DocType, String, Date, Integer
from elasticsearch_dsl.connections import connections

from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q

# Define a default Elasticsearch client
client = connections.create_connection(hosts=['http://blahblahblah:9200'])

s = Search(using=client, index="attendance")
s = s.execute()

for tag in s.aggregations.per_tag.buckets:
    print (tag.key)

输出

File "/Library/Python/2.7/site-packages/elasticsearch_dsl/utils.py", line 106, in __getattr__
'%r object has no attribute %r' % (self.__class__.__name__, attr_name))
AttributeError: 'Response' object has no attribute 'aggregations'

是什么导致了这个问题？"aggregations"关键字是否错误？还需要导入其他软件包吗？如果在"attendance"索引的文档中有一个名为emailAddress的字段，如何计算具有该字段值的文档数量？

- VISQL

1

请问您是否已经成功回答了自己的问题？我现在也面临着完全相同的问题 - 我不知道如何在elasticsearch-dsl中进行计数聚合。 - Jacobian

是的，自那时以来我已经克服了一些障碍。在DSL的编码人员的帮助下，我正在使用我认为是Python中的一种解决方法来实现这一点。不幸的是，我还没有时间去使用纯DSL的方式来完成这个任务，而是一直在利用to_dict。我会尝试粘贴一个好的例子。 - VISQL

2个回答

2

我还没有评论的权限，但是想对Matthew在VISQL的答案中提到的from_dict做一点小修正。如果你想保持搜索属性，使用update_from_dict而不是from_dict。

根据文档，from_dict会创建一个新的搜索对象，但update_from_dict会就地修改，这正是你想要的，如果Search已经有了属性，如index、using等。

所以你需要在搜索之前声明查询主体，然后像这样创建搜索：

query_body = {
    "size": 0,
    "aggs": {
        "by_house": {
            "terms": {
                "field": "house_number",
                "size": 0
            }
        }
    }
}

s = Search(using=client, index="airbnb", doc_type="sleep_overs").update_from_dict(query_body)

- ekmcd

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- VISQL · Accepted Answer

首先，我注意到现在我写的内容实际上没有定义任何聚合。关于如何使用它的文档对我来说不是很易读。使用我上面写的内容，我将进行扩展。我正在更改索引名称以获得更好的示例。

from datetime import datetime
from elasticsearch_dsl import DocType, String, Date, Integer
from elasticsearch_dsl.connections import connections

from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q

# Define a default Elasticsearch client
client = connections.create_connection(hosts=['http://blahblahblah:9200'])

s = Search(using=client, index="airbnb", doc_type="sleep_overs")
s = s.execute()

# invalid! You haven't defined an aggregation.
#for tag in s.aggregations.per_tag.buckets:
#    print (tag.key)

# Lets make an aggregation
# 'by_house' is a name you choose, 'terms' is a keyword for the type of aggregator
# 'field' is also a keyword, and 'house_number' is a field in our ES index
s.aggs.bucket('by_house', 'terms', field='house_number', size=0)

我们将为每个门牌号创建一个存储桶。因此，存储桶的名称将是门牌号。ElasticSearch（ES）将始终给出符合该存储桶条件的文档计数。Size = 0表示要返回所有结果，因为ES的默认设置仅返回10个结果（或者您的开发人员设置了其他值）。

# This runs the query.
s = s.execute()

# let's see what's in our results

print s.aggregations.by_house.doc_count
print s.hits.total
print s.aggregations.by_house.buckets

for item in s.aggregations.by_house.buckets:
    print item.doc_count

我之前的错误是认为Elastic Search查询默认具有聚合功能。实际上，你需要自己定义它们，然后执行它们。然后你的响应可以通过你提到的聚合器进行拆分。

上述内容的CURL应该如下所示：
注意：我使用的是SENSE，这是Google Chrome的一个ElasticSearch插件/扩展程序/附加组件。在SENSE中，你可以使用//来注释掉一些内容。

POST /airbnb/sleep_overs/_search
{
// the size 0 here actually means to not return any hits, just the aggregation part of the result
    "size": 0,
    "aggs": {
        "by_house": {
            "terms": {
// the size 0 here means to return all results, not just the the default 10 results
                "field": "house_number",
                "size": 0
            }
        }
    }
}

解决方法。DSL论坛上的某个人告诉我放弃翻译，直接使用这种方法。这样更简单，你只需要在CURL中编写困难的部分。这就是我称之为“解决方法”的原因。

# Define a default Elasticsearch client
client = connections.create_connection(hosts=['http://blahblahblah:9200'])
s = Search(using=client, index="airbnb", doc_type="sleep_overs")

# how simple we just past CURL code here
body = {
    "size": 0,
    "aggs": {
        "by_house": {
            "terms": {
                "field": "house_number",
                "size": 0
            }
        }
    }
}

s = Search.from_dict(body)
s = s.index("airbnb")
s = s.doc_type("sleepovers")
body = s.to_dict()

t = s.execute()

for item in t.aggregations.by_house.buckets:
# item.key will the house number
    print item.key, item.doc_count

希望这能帮到你。我现在使用CURL来设计所有内容，然后使用Python语句来分离结果以获取我想要的内容。这对于具有多个级别（子聚合）的聚合非常有帮助。