使用Hive UDF函数计算运行总和

Question

使用Hive UDF函数计算运行总和

5

我是Hive的新手，在这里提前为以下内容中可能的愚蠢问题道歉。我有一张如下的表：

SELECT a.storeid, a.smonth, a.sales FROM table a;
1001    1       35000.0
1002    2       35000.0
1001    2       25000.0
1002    3       110000.0
1001    3       40000.0
1002    1       40000.0

我的目标输出如下：

1001    1       35000.0 35000.0
1001    2       25000.0 60000.0
1001    3       40000.0 100000.0
1002    1       40000.0 40000.0
1002    2       35000.0 75000.0
1002    3       110000.0 185000.0

我编写了一个简单的Hive UDF Sum类来实现上述功能，并在查询中使用了SORT BY storeid，smonth：

SELECT a.storeid, a.smonth, a.sales, rsum(sales)
FROM (SELECT * FROM table SORT BY storeid, smonth) a;

显然，它不会生成上面的输出，因为只有一个mapper，并且调用相同的UDF实例，它对整个集合生成一个运行总和。我的目标是在每个storeid中重置udf类中的runningSum实例变量，这样evaluate函数将返回上述输出。我使用了以下方法： 1. 传递storeid变量rsum（sales，storeid），然后我们可以在UDF类中正确处理该情况。 2. 使用两个mappers，如下所示查询：

set mapred.reduce.tasks=2;
SELECT a.storeid, a.smonth, a.sales, rsum(sales)
FROM (SELECT * FROM table DISTRIBUTE BY storeid SORT BY storeid, smonth) a;

1002    1       40000.0 40000.0
1002    2       35000.0 75000.0
1002    3       110000.0 185000.0
1001    1       35000.0 35000.0
1001    2       25000.0 60000.0
1001    3       40000.0 100000.0

为什么总是在顶部出现1002？除了以上的方法，我想寻求您对其他实现相同功能的不同方法（如子查询/连接）的建议。另外，您提出的方法的时间复杂度是多少？

- Code Warrior

4个回答

4

或者，您可以查看此 Hive 票据，其中包含多个功能扩展。
其中包括一个累计和实现（GenericUDFSum）。

此函数（称为“rsum”）需要两个参数，即记录根据其 ID 的哈希值（将记录在 reducers 中分区）和要求和的相应值：

select t.storeid, t.smonth, t.sales, rsum(hash(t.storeid),t.sales) as sales_sum 
  from (select storeid, smonth, sales from sm distribute by hash(storeid) 
    sort by storeid, smonth) t;

1001  1  35000.0  35000.0
1001  2  25000.0  60000.0
1001  3  40000.0  100000.0
1002  1  40000.0  40000.0
1002  2  35000.0  75000.0
1002  3  110000.0 185000.0

- Lorand Bendig

0

这应该能解决问题：

SELECT 
    a.storeid, 
    a.smonth,
    a.sales,
    SUM(a.sales) 
OVER (
    PARTITION BY a.storeid 
    ORDER BY a.smonth asc 
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
FROM 
    table a;

来源：https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics

- Grzegorz Skibinski

0

从表格中选择storeid、smonth和sales字段，并在每个storeid分组内按照smonth排序，计算出sales的累加和rsum。

- Tutu Kumari

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Bector · Accepted Answer

Hive提供了更好的方式来在单行中完成此操作 -
请按照以下步骤来实现您的目标输出

创建一个能够包含您的数据集的Hive表 -

1001    1       35000.0
1002    2       35000.0
1001    2       25000.0
1002    3       110000.0
1001    3       40000.0
1002    1       40000.0

现在只需在您的Hive终端中运行以下命令即可： ```bash command：your_command_here; ```

SELECT storeid, smonth, sales, SUM(sales) OVER (PARTITION BY storeid ORDER BY smonth) FROM table_name;

输出结果将会像这样 -

1001  1  35000.0  35000.0
1001  2  25000.0  60000.0
1001  3  40000.0  100000.0
1002  1  40000.0  40000.0
1002  2  35000.0  75000.0
1002  3  110000.0 185000.0

我希望这可以帮助您获得所需的输出结果。