使用Julia数据工具进行Asof连接

Question

使用Julia数据工具进行Asof连接

3

我希望能够在Julia中进行类似于pandas的merge_asof或QuestDB的ASOF JOIN。关键是，我还需要应用分组操作。

我很乐意使用Julia的Table.jl工具。DataFrame的leftjoin接近了，但需要精确的键值匹配，并且不支持分组（据我所知）。SplitApplyCombine.jl的leftgroupjoin允许您传入自己的比较函数，但我不太清楚如何使用该函数指定“最接近小于”值或“最接近大于”值。

对于一个简单的例子，在两个带有time列的表格left和right上，我可以使用以下函数：

function find_nearest_before(val, data)
    findlast(x -> x <= val, data)
end

[find_nearest_before(t, right.time) for t in left.time]

这样可以得到在right中与left相匹配的索引。但是，我不太清楚如何将其与分组一起使用。

编辑

添加一个示例以使问题更清晰。第一个表sensor_pings报告传感器何时看到某物。第二个表in_sensor_FOV告诉我们在给定时间内实际上在传感器视野（FOV）中的对象是什么。假设传感器在其FOV中只有一个对象（反之未必正确）。

julia> using TypedTables

julia> sensor_pings = Table(time=[4,5,7,8,9,10,11,13,15,16], sensor_id=[2,1,1,3,2,3,1,2,3,2])
Table with 2 columns and 10 rows:
      time  sensor_id
    ┌────────────────
 1  │ 4     2
 2  │ 5     1
 3  │ 7     1
 4  │ 8     3
 5  │ 9     2
 6  │ 10    3
 7  │ 11    1
 8  │ 13    2
 9  │ 15    3
 10 │ 16    2

julia> in_sensor_FOV = Table(time=[1.3,2.6,3.8,5.9,7.3,8.0,12.3,14.7], sensor_id=[3,1,2,3,2,2,3,1], object_in_sensor_FOV=[:a,:b,:c,:b,:c,:a,:c,:b])
Table with 3 columns and 8 rows:
     time  sensor_id  object_in_sensor_FOV
   ┌──────────────────────────────────────
 1 │ 1.3   3          a
 2 │ 2.6   1          b
 3 │ 3.8   2          c
 4 │ 5.9   3          b
 5 │ 7.3   2          c
 6 │ 8.0   2          a
 7 │ 12.3  3          c
 8 │ 14.7  1          b

所需操作的最终结果将如下所示：

julia> Table(time=[4,5,7,8,9,10,11,13,15,16], sensor_id=[2,1,1,3,2,3,1,2,3,2], object_in_sensor_FOV=[:c,:b,:b,:b,:a,:b,:b,:a,:c,:a])

Table with 3 columns and 10 rows:
      time  sensor_id  object_in_sensor_FOV
    ┌──────────────────────────────────────
 1  │ 4     2          c
 2  │ 5     1          b
 3  │ 7     1          b
 4  │ 8     3          b
 5  │ 9     2          a
 6  │ 10    3          b
 7  │ 11    1          b
 8  │ 13    2          a
 9  │ 15    3          c
 10 │ 16    2          a

- natemcintosh

为什么你不能使用不精确匹配来连接表格，然后在左键上执行分组操作？这应该与一步完成相同。 - Andrej Oskin

一个好主意，但是分组操作必须先进行。想象一下我的right表有time、 region和item这些列，我想按region来进行分组。如果我先进行不精确匹配，可能会将一个地区的行与另一个地区的行匹配起来。 - natemcintosh

2个回答

1

编写这样的东西相当容易，您只需要实现双光标即可。

using TypedTables
using Setfield

sensor_pings = Table(time=[4,5,7,8,9,10,11,13,15,16], sensor_id=[2,1,1,3,2,3,1,2,3,2])

in_sensor_FOV = Table(time=[1.3,2.6,3.8,5.9,7.3,8.0,12.3,14.7], sensor_id=[3,1,2,3,2,2,3,1], object_in_sensor_FOV=[:a,:b,:c,:b,:c,:a,:c,:b])

function mergeasof(t1, t2)
    objects = similar(t2.object_in_sensor_FOV, length(t1.time))
    d = ntuple(_ -> :z, 3) # :z is a sentinel value, means that there were no objects up to this moment. Can be anything
    i2 = 1
    # Double cursor
    for i1 in axes(t1, 1)
        tm1 = t1.time[i1]
        # updating `d` to the current time step 
        while i2 <= length(t2.time)
            t2.time[i2] > tm1 && break
            @set! d[t2.sensor_id[i2]] = t2.object_in_sensor_FOV[i2]
            i2 += 1
        end
        objects[i1] = d[t1.sensor_id[i1]]
    end

    return Table(time = t1.time, sensor_id = t1.sensor_id, object_in_sensor_FOV = objects)
end

julia> mergeasof(sensor_pings, in_sensor_FOV)
Table with 3 columns and 10 rows:
      time  sensor_id  object_in_sensor_FOV
    ┌──────────────────────────────────────
 1  │ 4     2          c
 2  │ 5     1          b
 3  │ 7     1          b
 4  │ 8     3          b
 5  │ 9     2          a
 6  │ 10    3          b
 7  │ 11    1          b
 8  │ 13    2          a
 9  │ 15    3          c
 10 │ 16    2          a

它应该相当快，并且可以适应任意数量的列（只是更繁琐些）。

不过需要注意以下几点：

此函数要求表格按时间排序。
它可以被适应为正向搜索，但可能更加繁琐。
我使用了有三个传感器的事实。如果传感器数量已知，则应在 ntuple 函数中使用。如果未知或很大或索引是任意的，则可以使用 Dict 替代 ntuple。

d = Dict{Int, Symbol}()

应该删除 @set!。

d[t2.sensor_id[i2]] = t2.object_in_sensor_FOV[i2]

而不是

objects[i1] = d[t1.sensor_id[i1]]

你应该使用。

objects[i1] = get(d, t1.sensor_id[i1], :z)

- Andrej Oskin

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Nils Gudat · Accepted Answer

这是使用DataFrames进行操作的一种方法 - 这肯定不是效率的巅峰，但如果你的数据足够小，可以承受第一个leftjoin，那么它可能已经足够好了。

首先将in_sensor_FOV与sensor_pings连接起来：

julia> df = leftjoin(sensor_pings, in_sensor_FOV, on = :sensor_id, makeunique = true);

接下来，对于sensor_pings中的每个传感器，你将会得到多行数据。但如果你的数据量很大，这种方法可能会失败。

然后获取时间差：

julia> transform!(df, [:time, :time_1] => ((x, y) -> x - y) => :time_diff);

现在你的findlast方法iiuc表示我们只考虑时间差为正的行：

julia> df = df[df.time_diff .> 0.0, :];

然后我们按传感器和时间差进行排序，并选择每个传感器的第一行：

julia> res = combine(groupby(sort(df, [:sensor_id, :time_diff]), [:sensor_id, :time]), names(df[:, Not([:sensor_id, :time])]) .=> first .=> names(df[:, Not([:sensor_id, :time])]));

结果（排序以产生相同的输出）：

julia> sort(select(res, [:time, :sensor_id, :object_in_sensor_FOV]), :time)
10×3 DataFrame
 Row │ time   sensor_id  object_in_sensor_FOV 
     │ Int64  Int64      Symbol               
─────┼────────────────────────────────────────
   1 │     4          2  c
   2 │     5          1  b
   3 │     7          1  b
   4 │     8          3  b
   5 │     9          2  a
   6 │    10          3  b
   7 │    11          1  b
   8 │    13          2  a
   9 │    15          3  c
  10 │    16          2  a