在C#中过滤大型列表的最有效方法是什么？

Question

在C#中过滤大型列表的最有效方法是什么？

7

我们有一个ASP.NET MVC Web应用程序，通过Entity Framework连接到SQL Server数据库。这个应用程序的主要任务之一是允许用户快速搜索并筛选一个包含存档值的巨大数据库表。

表结构非常简单: Timestamp (DateTime)、StationId (int)、DatapointId (int)、Value (double)。该表大约有1000万到1亿行。我使用了覆盖索引等方法来优化DB表，但是当根据 DatapointId、StationId、Time 进行筛选并跳过和获取我想在页面上显示的部分时，用户体验仍然相当滞后。

因此，我尝试了另一种方法：由于我们的服务器拥有很多 RAM，我认为我们可以在Web应用程序启动时将整个归档表加载到List中，然后直接从此List获取数据而不是进行往返数据库。这个方法效果很好，将整个存档表(当前大约有1000万条记录)加载到List中只需约9秒钟。ArchiveRow是一个简单的对象，看起来像这样：

public class ArchiveResponse {
    public  int Length { get; set; }
    public int numShown { get; set; }
    public  int numFound { get; set; }
    public  int numTotal { get; set; }
    public  List<ArchiveRow> Rows { get; set; }
}

相应地：

public class ArchiveRow {
    public  int s { get; set; }
    public  int d { get; set; }
    public  DateTime t { get; set; }
    public  double v { get; set; }
}

当我现在试图使用Linq查询从列表中获取所需的数据时，查询速度已经比查询数据库快了，但是当按多个条件过滤时，它仍然相当慢。例如，当我按一个StationId和12个DatapointIds进行过滤时，检索25行窗口需要约5秒钟的时间。我已经从使用Where过滤器切换到使用连接，但我认为还有改进的空间。是否有更好的方法实现这样的缓存机制，同时尽可能保持内存消耗低？是否有其他集合类型更适合这个目的？

因此，以下是过滤并从ArchiveCache列表中获取相关数据的代码：

// Total number of entries in archive cache
var numTotal = ArchiveCache.Count();

// Initial Linq query
ParallelQuery<ArchiveCacheValue> query = ArchiveCache.AsParallel();

// The request may contain StationIds that the user is interested in,
// so here's the filtering by StationIds with a join:
if (request.StationIds.Count > 0)
{
    query = from a in ArchiveCache.AsParallel()
             join b in request.StationIds.AsParallel()
             on a.StationId equals b
             select a;
}

// The request may contain DatapointIds that the user is interested in,
// so here's the filtering by DatapointIds with a join:
if (request.DatapointIds.Count > 0)
{
    query = from a in query.AsParallel()
             join b in request.DatapointIds.AsParallel()
             on a.DataPointId equals b
             select a;
}

// Number of matching entries after filtering and before windowing
int numFound = query.Count();

// Pagination: Select only the current window that needs to be shown on the page
var result = query.Skip(request.Start == 0 ? 0 : request.Start - 1).Take(request.Length);

// Number of entries on the current page that will be shown
int numShown = result.Count();

// Build a response object, serialize it to Json and return to client
// Note: The projection with the Rows is not a bottleneck, it is only done to
// shorten 'StationId' to 's' etc. At this point there are only 25 to 50 rows,
// so that is no problem and happens in way less than 1 ms
ArchiveResponse myResponse = new ArchiveResponse();
myResponse.Length = request.Length;
myResponse.numShown = numShown;
myResponse.numFound = numFound;
myResponse.numTotal = numTotal;
myResponse.Rows = result.Select(x => new archRow() { s = x.StationId, d = x.DataPointId, t = x.DateValue, v = x.Value }).ToList();

return JsonSerializer.ToJsonString(myResponse);

一些更多的细节：站点数量通常在5到50之间，很少超过50。数据点的数量小于7000。Web应用程序设置为64位，并在web.config中设置了<gcAllowVeryLargeObjects enabled="true" />。

我期待着进一步的改进和建议。也许有一种完全不同的基于数组或类似的方法，可以在不使用linq的情况下表现得更好？

- Rob

也许可以编辑并添加表设计的详细信息和执行效率差的 SQL 查询，这样人们就可以建议可能的改进措施。按照您的描述，人们原本会期望其性能完美。 - Alex K.

谢谢，Alex。我知道你的想法，但我故意没有描述数据库部分，因为我想明确改进这个缓存机制。 - Rob

1

我会给底层查询参数一些标识符或哈希，以便您可以从过滤器缓存数据集，窗口化将重复使用它（不要在标识符中包含分页信息）。然后，您可以缓存 query.Count()，每次都会进行完整迭代。挑战在于过滤参数的差异以及相同集合再次访问的可能性（用于下一页）。如果用户倾向于获取多个页面，请考虑返回批处理页面。 - Adam Houldsworth

1

LINQ将使用“IEnumberable<T>”接口来处理条目列表。对于1000万个条目，我想这可能会与使用“List<T>”有所不同。值得实现“List<T>”的LINQ扩展方法子集，并检查是否可以提高性能。 - dumetrulo

1

我唯一的其他建议是远离LINQ / IQueryable，尽量手动迭代一次。你可以使用像Parallel.ForEach这样的工具来帮助实现这一点。 - Adam Houldsworth

显示剩余3条评论

4个回答

3

我建议将这些值存储在一个struct ArchiveRow的array或vector中，以确保所有数据都在连续的内存中。这样可以大大受益于引用局部性（即高效利用L1缓存）。同时也避免了list（指针/引用）的开销。（更新：我快速查阅了一下C# List，看起来List与C++ vector（即数组）相同，而C# LinkedList与C++ list相同...有点混乱 - 即没有指针开销在这里C# List<>）。

尽可能使结构体最小化。在可能的情况下使用uint32或甚至uint16，datetime使用32位，根据您的值，甚至可以使用float代替double。此外，在结构体中将最宽的值放在最前面（以获得更好的对齐）。

即使是暴力方法（扫描整个数组，几百MB的连续内存），也应该在不到1秒的时间内完成。

为了进一步优化，您可以对数据进行排序（或更好的是从数据库中检索已排序的数据），这将使您能够进行二进制搜索，并且也将使结果集的值保持紧密在一起。例如：按StationId、DataPointId排序。

如果数据变得更大，您可以将相同的结构体存储在磁盘上（以二进制文件的形式），并通过内存映射访问它。

此外，您只需要扫描前25个条目。然后可以存储最后检查的位置（对于该会话/查询），当下一页被请求时，从那里开始获取下一个25个条目。这也将节省存储完整结果集所需的内存。

如果站点数量较少，并且数据按StationId排序，则还可以在导入时保留一个小列表或跳转表，其中包含每个StationId的起始位置，并直接跳转到该位置以扫描该站点的数据点。

将整个归档表（当前约有1000万个条目）加载到List中大约需要9秒。

如果您还没有这样做，请务必在列表上设置初始容量，以避免多次重新分配。

- Danny_ds

1

Evk的答案有一些优化，我只是想指出你原来的代码中一些非常不优化的地方，也许删除它们就可以加快速度：

你在内存中进行了三次查询！

// execute query first time
int numFound = query.Count();

var result = query.Skip(request.Start == 0 ? 0 : request.Start - 1).Take(request.Length);

// executing query second time!
int numShown = result.Count();

// executing query third time!    
// result.Select [...] .ToList()

你真的应该做一次计算，然后在结果上进行计数和分页。

- nvoigt

1

另一种优化方式。在内存中使用linq的join不是很高效。

所以我会创建来自request.StationIds和request.DatapointIds的哈希集，并使用简单的哈希集包含。像这样。

if (request.StationIds.Count > 0)
{
    var stationIdSet = new HashSet<int>(request.StationIds);
    query =  (from a in ArchiveCache
             stationIdSet.Contains(a.StationId)
             select a);
               // .AsParallel()
               // .MaxDegreeOfParallelism(...);
}

对于运行9m条记录和30个站点ID的顺序版本，其表现优于大约150-250毫秒的连接。

对于最大并行度为2的并行版本，连接和哈希集都表现更差。

注意：对于像这样的简单操作，AsParallel会产生开销，可能不是最佳选择。

- vasil oreshenski

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Evk · Accepted Answer

您可以针对特定的查询类型调整存储。首先，从内存存档中创建字典：

ArchiveCacheByDatapoint = ArchiveCache.GroupBy(c => c.DataPointId)
            .ToDictionary(c => c.Key, c => c.ToList());
ArchiveCacheByStation = ArchiveCache.GroupBy(c => c.StationId)
            .ToDictionary(c => c.Key, c => c.ToList());

然后在查询中使用这些字典：

bool hasStations = request.StationIds.Length > 0;
bool hasDatapoints = request.DatapointIds.Length > 0;            
int numFound = 0;
List<ArchiveCacheValue> result;
if (hasDatapoints && hasStations) {
    // special case - filter by both
    result = new List<ArchiveCacheValue>();
    // store station filter in hash set
    var stationsFilter = new HashSet<int>(request.StationIds);
    // first filter by datapoints, because you have more different datapoints than stations
    foreach (var datapointId in request.DatapointIds.OrderBy(c => c)) {                    
        foreach (var item in ArchiveCacheByDatapoint[datapointId]) {                        
            if (stationsFilter.Contains(item.StationId)) {
                // both datapoint and station matches filter - found item
                numFound++;
                if (numFound >= request.Start && result.Count < request.Length) {
                    // add to result list if matches paging criteria
                    result.Add(item);
                }
            }
        }
    }
}
else if (hasDatapoints) {                
    var query = Enumerable.Empty<ArchiveCacheValue>();                
    foreach (var datapoint in request.DatapointIds.OrderBy(c => c))
    {
        var list = ArchiveCacheByDatapoint[datapoint];
        numFound += list.Count;
        query = query.Concat(list);
    }
    // execute query just once
    result = query.Skip(request.Start).Take(request.Length).ToList();
}
else if (hasStations) {                
    var query = Enumerable.Empty<ArchiveCacheValue>();
    foreach (var station in request.StationIds.OrderBy(c => c))
    {
        var list = ArchiveCacheByStation[station];
        numFound += list.Count;
        query = query.Concat(list);
    }
    // execute query just once
    result = query.Skip(request.Start).Take(request.Length).ToList();
}
else {
    // no need to do Count()
    numFound = ArchiveCache.Count;
    // no need to Skip\Take here really, ArchiveCache is list\array
    // so you can use indexes which will be faster
    result = ArchiveCache.Skip(request.Start).Take(request.Length).ToList();
}

// Number of entries on the current page that will be shown
int numShown = result.Count;

我已经测量了它，对于所有类型的查询（仅按部分查询、仅按数据点查询以及按部分和数据点同时查询），在我的机器上运行时间为1毫秒（有时高达10毫秒），共处理了1亿个项目。