如何使用jq将数组分成块？

Question

如何使用jq将数组分成块？

23

我有一个包含数组的非常大的JSON文件。是否可以使用jq将此数组分成固定大小的几个较小的数组？假设我的输入如下：[1,2,3,4,5,6,7,8,9,10]，我想将其拆分为长度为3的块。使用jq得到的期望输出为：

[1,2,3]
[4,5,6]
[7,8,9]
[10]

实际情况是，我的输入数组有将近三百万个元素，全部是UUID。

- Echo Nolan

1

jq 'group_by(. % 3)' <<< '[1,2,3,4,5,6,7,8,9,10]' 将数组分成三组。现在如果我能得到完整输入数组的长度就好了... - l0b0

5个回答

4

以下是Cédric Connes(github:connesc)提供的流式定义窗口/3，它概括了_nwise，并展示了一种可绕过使用流末标记的“装箱技术”，因此如果流包含非JSON值nan，则可以使用。还包括使用窗口/3定义_nwise/1的内容。

窗口/3的第一个参数被解释为流。$size是窗口大小，$step指定要跳过的值的数量。例如：

window(1,2,3; 2; 1)

yields:

[1,2]
[2,3]

window/3 and _nsize/1

def window(values; $size; $step):
  def checkparam(name; value): if (value | isnormal) and value > 0 and (value | floor) == value then . else error("window \(name) must be a positive integer") end;
  checkparam("size"; $size)
| checkparam("step"; $step)
  # We need to detect the end of the loop in order to produce the terminal partial group (if any).
  # For that purpose, we introduce an artificial null sentinel, and wrap the input values into singleton arrays in order to distinguish them.
| foreach ((values | [.]), null) as $item (
    {index: -1, items: [], ready: false};
    (.index + 1) as $index
    # Extract items that must be reused from the previous iteration
    | if (.ready | not) then .items
      elif $step >= $size or $item == null then []
      else .items[-($size - $step):]
      end
    # Append the current item unless it must be skipped
    | if ($index % $step) < $size then . + $item
      else .
      end
    | {$index, items: ., ready: (length == $size or ($item == null and length > 0))};
    if .ready then .items else empty end
  );

def _nwise($n): window(.[]; $n; $n);

Source:

https://gist.github.com/connesc/d6b87cbacae13d4fd58763724049da58

- peak

1

读者们：如果您想使用流式处理，可以将最后一行替换为 def streamsplit($n): window(inputs | .[1]; $n $n) 并传递参数 --stream。这样可以确保程序最大内存占用不超过 2844k。 - Echo Nolan

@EchoNolan：请注意，与本页其他位置提供的面向流的nwise/2相比，window/2要慢一些。可以使用相同的技术来使用“inputs”，但无论哪种情况，都要记得使用-n命令行选项。 - peak

3

这是一个简单的方法，我亲测有效：

def chunk(n):
    range(length/n|ceil) as $i | .[n*$i:n*$i+n];

示例用法：

jq -n \
'def chunk(n): range(length/n|ceil) as $i | .[n*$i:n*$i+n];
[range(5)] | chunk(2)'
[
  0,
  1
]
[
  2,
  3
]
[
  4
]

奖励：它不使用递归，也不依赖于_nwise，因此它也适用于jaq。

- Nitsan Avni

2

如果数组太大而无法舒适地放入内存中，那么我会采用 @CharlesDuffy 建议的策略--即使用流导向版本的 nwise 将数组元素流式传输到 jq 的第二次调用中。例如：

def nwise(stream; $n):
  foreach (stream, nan) as $x ([];
    if length == $n then [$x] else . + [$x] end;
    if (.[-1] | isnan) and length>1 then .[:-1]
    elif length == $n then .
    else empty
    end);

上述内容的“driver”将是：

nwise(inputs; 3)

请记得使用-n命令行选项。要从任意数组创建流：

$ jq -cn --stream '
    fromstream( inputs | (.[0] |= .[1:])
                | select(. != [[]]) )' huge.json

因此，shell管道可能如下所示：

$ jq -cn --stream '
    fromstream( inputs | (.[0] |= .[1:])
                | select(. != [[]]) )' huge.json |
  jq -n -f nwise.jq

这种方法非常高效。使用nwise/2将一串300万项分组成3项一组。

/usr/bin/time -lp

第二次调用jq的结果如下：

user         5.63
sys          0.04
   1261568  maximum resident set size

注意：此定义使用nan作为流结束标记。由于nan不是JSON值，因此处理JSON流时不会出现问题。

- peak

如果你认为它有用，我会将其重新引入。 - Charles Duffy

既然你坦率地谈到了hackery，我建议保留它——这展示了如何使用“while”、“try”和“input” :-) - peak

在我的实验中，当. == [nan]（如果总项目数可以被$n整除）时，似乎会发出一个尾随的空列表。也许检查流结束的行应该是if .[-1] | isnan then if (. | length) > 1 then .[:-1] else empty end？请参见https://gist.github.com/charles-dyfis-net/3f596d0aa19f187b80e6ee167dfc55c3以获取测试过程。 - Charles Duffy

@CharlesDuffy - 感谢您发现了这个问题，我已经以稍微不同的方式修复了它，以防它变得更慢。 - peak

1

为了将输出分割成单独的文件，请将第二个jq调用更改为例如jq -cn -f nwise.jq | awk '{print > "doc00" NR ".json"}'，参见https://dev59.com/Hqnka4cB1Zd3GeqPIA4-#48801628。 - mkjeldsen

1

下面的内容确实有些技巧性，但是即使在一个任意长的列表中也是非常节省内存的技巧性。

jq -c --stream 'select(length==2)|.[1]' <huge.json \
| jq -nc 'foreach inputs as $i (null; null; [$i,try input,try input])'

管道的第一部分将输入的JSON文件流式传输，每个元素发出一行，假设数组由原子值组成（其中[]和{}被包括在原子值中）。因为它以流模式运行，所以即使是单个文档，它也不需要将整个内容存储在内存中。

管道的第二部分重复读取最多三个项目，并将它们组装成一个列表。

这样可以避免在同一时间需要超过三个数据块。

- Charles Duffy

在这里使用while会导致错误。最好的方法是编写：jq -nc 'foreach inputs as $i (null; null; [$i,try input,try input])'。 - peak

我之前忽略了处理结束时的错误，但这是一个明显的改进。感谢您的优化。 - Charles Duffy

请在本页面其他位置查看用于一般流式数组的jq调用。 - peak

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- peak · Accepted Answer

有一个（未记录的）内置函数_nwise，它满足功能要求：

$ jq -nc '[1,2,3,4,5,6,7,8,9,10] | _nwise(3)'

[1,2,3]
[4,5,6]
[7,8,9]
[10]

另外：

$ jq -nc '_nwise([1,2,3,4,5,6,7,8,9,10];3)' 
[1,2,3]
[4,5,6]
[7,8,9]
[10]

顺便提一下，_nwise 可以用于数组和字符串。

（我认为这个功能没有记录是因为对于一个合适的名称存在疑问。）

TCO版本

不幸的是，内置版本定义得很草率，对于大型数组的性能不佳。这里是一个优化版本（它应该与非递归版本一样有效）：

def nwise($n):
 def _nwise:
   if length <= $n then . else .[0:$n] , (.[$n:]|_nwise) end;
 _nwise;

对于一个大小为300万的数组，这是相当高效的：在旧版Mac上为3.91秒，最大驻留大小为162746368。

请注意，这个版本（使用尾调用优化递归）实际上比本页面其他地方显示的使用foreach的nwise/2版本更快。