将IEnumerable<T>分成固定大小的块（返回一个IEnumerable<IEnumerable<T>>，其中内部序列长度固定）

Question

将IEnumerable<T>分成固定大小的块（返回一个IEnumerable<IEnumerable<T>>，其中内部序列长度固定）

63

我想把一个 IEnumerable<T> 分成固定大小的块。

我有以下代码，但由于所有列表的创建/复制，它似乎不太优雅：

private static IEnumerable<IEnumerable<T>> Partition<T>(this IEnumerable<T> items, int partitionSize)
{
    List<T> partition = new List<T>(partitionSize);
    foreach (T item in items)
    {
        partition.Add(item);
        if (partition.Count == partitionSize)
        {
            yield return partition;
            partition = new List<T>(partitionSize);
        }
    }
    // Cope with items.Count % partitionSize != 0
    if (partition.Count > 0) yield return partition;
}

有没有更符合习惯用语的表达方式？

编辑：虽然这被标记为将数组分成子序列数组的重复，但它不是 - 那个问题涉及到分割一个数组，而这个问题涉及到IEnumerable<T>。此外，那个问题要求最后一个子序列填充。这两个问题密切相关，但并不相同。

- Alastair Maw

3

这里已经有一个类似的问题，并提供了几种不同的解决方案，可以在Stack上查看：https://dev59.com/Sm865IYBdhLWcg3whe9f - Colin Pear

https://dev59.com/RnRC5IYBdhLWcg3wAcXU - Dzmitry Martavoi

不再允许回答，但可以尝试这个链接：https://dev59.com/z3A75IYBdhLWcg3wlqOh#29462069 - MBoros

1

以下是使用C# 7的内联函数实现惰性分区器的优雅解决方案：https://gist.github.com/pmunin/533c10f0020b21230177cfb5a2d75bb4 - Philipp Munin

8个回答

20

感觉你需要两个迭代器块 ("yield return 方法")。我写了这个扩展方法：

static class Extensions
{
  public static IEnumerable<IEnumerable<T>> Partition<T>(this IEnumerable<T> items, int partitionSize)
  {
    return new PartitionHelper<T>(items, partitionSize);
  }

  private sealed class PartitionHelper<T> : IEnumerable<IEnumerable<T>>
  {
    readonly IEnumerable<T> items;
    readonly int partitionSize;
    bool hasMoreItems;

    internal PartitionHelper(IEnumerable<T> i, int ps)
    {
      items = i;
      partitionSize = ps;
    }

    public IEnumerator<IEnumerable<T>> GetEnumerator()
    {
      using (var enumerator = items.GetEnumerator())
      {
        hasMoreItems = enumerator.MoveNext();
        while (hasMoreItems)
          yield return GetNextBatch(enumerator).ToList();
      }
    }

    IEnumerable<T> GetNextBatch(IEnumerator<T> enumerator)
    {
      for (int i = 0; i < partitionSize; ++i)
      {
        yield return enumerator.Current;
        hasMoreItems = enumerator.MoveNext();
        if (!hasMoreItems)
          yield break;
      }
    }

    IEnumerator IEnumerable.GetEnumerator()
    {
      return GetEnumerator();      
    }
  }
}

- Jeppe Stig Nielsen

是的，完全正确，尽管takemyoxygen的答案更加简洁，所以我接受了那个答案，尽管你提到了多次调用MoveNext()的限制。（我认为大多数枚举器都很满意，不是吗？） - Alastair Maw

2

这真的是目前可用的最佳解决方案！！！我尝试过许多其他方案！原因：没有副作用（请参见http://blogs.msdn.com/b/pfxteam/archive/2012/11/16/plinq-and-int32-maxvalue.aspx），惰性/流式处理，快速和内存高效。 - SalientBrain

4

对返回的项执行 ToList() 方法并没有抓住整个问题的要点。 - MBoros

2

@SalientBrain 当在每个批次上调用 ToList 时，如果批次很大，则会对内存效率产生一定影响。尽管这是我见过的最佳解决方案，但不幸的是，我认为不可能拥有一个完全流式的解决方案（即批次和每个批次中的项都是流式的）。 - Steven Rands

2

我知道这篇文章已经有4年了，但是使用LINQ的Take和Skip扩展方法来替换这个实现中的一大部分是否合适呢？ - Gusdor

我个人并不太喜欢这个方案 - .ToList() 在此时非常低效，因为批处理大小（虽然事先已知！）没有考虑在内，所以其内部数组会被扩展 n 次，并且涉及复制等操作。如果你想要比已接受的答案更具弹性的解决方案，那么 MoreLinq 方法肯定更简单且更高效。 - Alastair Maw

15

也许呢？

public static IEnumerable<IEnumerable<T>> Partition<T>(this IEnumerable<T> items, int partitionSize)
{
    return items.Select((item, inx) => new { item, inx })
                .GroupBy(x => x.inx / partitionSize)
                .Select(g => g.Select(x => x.item));
}

还有一个已经实现好的：morelinq的Batch。

- L.B

我看到 Batch 基本上就是我所做的事情：http://code.google.com/p/morelinq/source/browse/MoreLinq/Batch.cs（只是内部使用数组而不是列表）。 - Alastair Maw

12

由于此方法在返回任何结果之前会将所有内容都加载到内存中，而且在使用哈希表对内容进行分组时还会占用更多的内存，因此它会导致内存占用过高。 - Alastair Maw

8

最疯狂的解决方案（使用响应式扩展库）：

public static IEnumerable<IList<T>> Partition<T>(this IEnumerable<T> items, int partitionSize)
{
    return items
            .ToObservable() // Converting sequence to observable sequence
            .Buffer(partitionSize) // Splitting it on spececified "partitions"
            .ToEnumerable(); // Converting it back to ordinary sequence
}

我知道我改变了签名，但无论如何，我们都知道我们将有一些固定大小的集合作为一个块。

顺便说一下，如果你使用迭代器块，请不要忘记将你的实现拆分成两个方法以急切地验证参数！

- Sergey Teplyakov

您IP地址为143.198.54.68，由于运营成本限制，当前对于免费用户的使用频率限制为每个IP每72小时10次对话，如需解除限制，请点击左下角设置图标按钮（手机用户先点击左上角菜单按钮）。 - Alastair Maw

@AlastairMaw 我们有一个真实的案例需要使用固定大小的集合。我有一个查询，其中“IN（..）”语句中有1000多个值，导致出现错误：“ORA-01795：列表中表达式的最大数量为1000”。因此，我需要将语句分成每个具有最大1000个项目的块，以后再与“OR”条件合并。 - ozanmut

5

对于优雅的解决方案，你也可以查看 MoreLinq.Batch.

它将源序列分批到指定大小的桶中。

例如：

int[] ints = new int[] {1,2,3,4,5,6};
var batches = ints.Batch(2); // batches -> [0] : 1,2 ; [1]:3,4 ; [2] :5,6

- Tilak

正如其他提到的答案所指出的那样，http://code.google.com/p/morelinq/source/browse/MoreLinq/Batch.cs 正好做了我所做的事情。好的。 - Alastair Maw

是的，你说得对。你的代码很优雅，而且实现了同样的功能。我没有检查其他链接。我只是使用了这个库，因此在这里作为一个替代方案进行了说明。 - Tilak

1

我已经接受了takemyoxygen的答案，因为我认为它更优雅，不需要中间列表复制。 - Alastair Maw

1

public static IEnumerable<IEnumerable<T>> Partition<T>(this IEnumerable<T> items, 
                                                       int partitionSize)
{
    int i = 0;
    return items.GroupBy(x => i++ / partitionSize).ToArray();
}

- JustAnotherUser

6

在返回结果之前，这将评估所有项并将所有内容加载到内存中，这在某种程度上破坏了使用IEnumerable<T>的目的。如果我想要这样做，我只需一开始就传入List<T>并完成操作即可。 - Alastair Maw

1

.Select(x => x) 真的必要吗？ - Jeppe Stig Nielsen

你需要在离开之前评估表达式，否则会导致错误的结果。 - nawfal

0

您可以通过使用 Enumerable.GroupBy的重载并利用整数除法来实现此操作。

return items.Select((element, index) => new { Element = element, Index = index })
    .GroupBy(obj => obj.Index / partitionSize, (_, partition) => partition);

- Adam Maras

1

很好，但你必须写(_, partition) => partition.Select(x => x.element)而不是(_, partition) => partition。 - Roman Pekar

1

这样做效率相对较低 - 它必须将整个 IEnumerable<T> 加载到内存中（假设它本来就是有限长度的），并且可能会使用浪费空间的哈希表来进行分组。 - Alastair Maw

0

在System.Collections.Concurrent命名空间中的分区器类如何？

- Christoffer

也许我有点傻，但这个例子看起来真的很庞大，针对一个如此简单的任务。实际上，它是如何比我已经掌握的方法更优雅地工作的呢？ - Alastair Maw

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- takemyoxygen · Accepted Answer

您可以尝试像这样自己实现上述批处理方法：

    static class MyLinqExtensions 
    { 
        public static IEnumerable<IEnumerable<T>> Batch<T>( 
            this IEnumerable<T> source, int batchSize) 
        { 
            using (var enumerator = source.GetEnumerator()) 
                while (enumerator.MoveNext()) 
                    yield return YieldBatchElements(enumerator, batchSize - 1); 
        } 

        private static IEnumerable<T> YieldBatchElements<T>( 
            IEnumerator<T> source, int batchSize) 
        { 
            yield return source.Current; 
            for (int i = 0; i < batchSize && source.MoveNext(); i++) 
                yield return source.Current; 
        } 
    }

我从http://blogs.msdn.com/b/pfxteam/archive/2012/11/16/plinq-and-int32-maxvalue.aspx中找到了��段代码。

更新：请注意，此实现不仅惰性地评估批次，而且也惰性地评估批次中的项，这意味着只有在枚举所有先前的批次之后才枚举批次时，它才会产生正确的结果。例如：

public static void Main(string[] args)
{
    var xs = Enumerable.Range(1, 20);
    Print(xs.Batch(5).Skip(1)); // should skip first batch with 5 elements
}

public static void Print<T>(IEnumerable<IEnumerable<T>> batches)
{
    foreach (var batch in batches)
    {
        Console.WriteLine($"[{string.Join(", ", batch)}]");
    }
}

将输出：

[2, 3, 4, 5, 6] //only first element is skipped.
[7, 8, 9, 10, 11]
[12, 13, 14, 15, 16]
[17, 18, 19, 20]

因此，如果你的用例在按顺序处理批次时假设批次是连续评估的，那么上面的惰性解决方案将起作用。否则，如果不能保证严格的顺序批处理（例如当您想要并行处理批处理时），您可能需要一种急切地枚举批内容的解决方案，类似于上面问题中提到的或在MoreLINQ中提到的解决方案。