绘制大型时间序列数据

8

问题概述:

有没有易于实现的算法可以减少表示时间序列所需的点数而不改变其在图中的显示方式?

动机背景:

我正在尝试交互式地可视化从嵌入式系统记录的10到15个通道的数据,采样率为大约20 kHz。日志可以覆盖1小时以上的时间,这意味着我要处理介于1e8和1e9之间的点。此外,我关心可能持续时间非常短(即小于1毫秒)的潜在异常,因此简单的抽稀不是一个选项。

毫不奇怪,如果您试图将比专用GPU内存更大的数据数组交给大多数绘图库,它们会有些难过。在我的系统上,情况实际上比这更糟;使用一个随机浮点向量作为测试案例,在我的刷新速率降至1 FPS以下之前,仅通过stock Matlab绘图函数和Python + matplotlib得到了大约5e7个点。

现有问题及解决方案:

这个问题与许多现有问题有些相似,例如:

但处理更大的数据集和/或更严格的保真度,代价是互动性(可以得到60 FPS的流畅平滑的平移和缩放,但实际上,我会满足1 FPS)。

显然,需要某种形式的数据减少。在搜寻现有工具解决我的问题时,我发现了两种范例:

  • 稀疏数据但跟踪异常值: 一个很好的例子是Matlab+dsplot,它将数据稀疏化到一定数量的均匀间隔点上,但随后通过高通FIR滤波器的标准差添加回异常值。虽然这可能是解决几类数据问题的可行方案,但如果在滤波器截止频率之后有大量频率内容,则可能存在困难,并且可能需要调整。

  • 绘制最小值和最大值: 使用此方法,将时间序列分成对应于每个水平像素的区间,并仅绘制每个区间中的最小值和最大值。Matlab + Plot (Big) 是一个很好的例子,但使用O(n)计算最小值和最大值使其在处理1e8或1e9个数据点时变得有些缓慢。使用mex函数或python中的二叉搜索树可以解决这个问题,但实现起来比较复杂。

是否有更简单的解决方案可以实现我的要求?

编辑(2018-02-18):重构问题,重点放在实现算法而不是工具上。


我并不认为这个问题是不相关的 - 我认为双方都有道理。那些都不重要,Camilo的回答很明智。如果你想绘制十亿个点,无论使用什么工具,速度都会很慢。如果你需要更快的速度,那么你需要一个算法来删除一些点。对于时间序列,您可以提出一个相当有说服力的算法,该算法删除所有对应于小于某个最小界限的跳跃的观察值。您可能需要一个额外的启发式方法来限制每千次删除的数量... 这比科学更像艺术 :-) - Colin T Bowers
在再次查看指南后,我认为离题标记可能是合适的。我已经大幅编辑了问题,将重点放在易于实现的算法上,而不是现有工具上。它仍然相当开放,但希望能更贴近主题。 - user2465116
1个回答

6
我曾经遇到一个问题,需要展示数百个传感器每分钟采集的几年时间序列数据。有时候(比如在清理数据时),我希望看到所有的异常值,而有时候我更关心趋势。因此,我编写了一个函数,可以使用两种方法来减少数据点的数量:Visvalingam和Douglas-Peucker。前者倾向于去除异常值,后者则保留它们。我已经优化了这个函数,使其能够处理大型数据集。
在意识到所有的绘图方法都无法处理这么多的数据点,并且那些能够处理的方法会以一种我无法控制的方式减少数据集之后,我才开始编写这个函数。该函数如下:
function [X, Y, indices, relevance] = lineSimplificationI(X,Y,N,method,option)
%lineSimplification Reduce the number of points of the line described by X
%and Y to N. Preserving the most relevant ones.
%   Using an adapted method of visvalingam and Douglas-Peucker algorithms.
%   The number of points of the line is reduced iteratively until reaching
%   N non-NaN points. Repeated NaN points in original data are deleted but
%   non-repeated NaNs are preserved to keep line breaks.
%   The two available methods are
%
%   Visvalingam: The relevance of a point is proportional to the area of
%   the triangle defined by the point and its two neighbors.
%   
%   Douglas-Peucker: The relevance of a point is proportional to the
%   distance between it and the straight line defined by its two neighbors.
%   Note that the implementation here is iterative but NOT recursive as in 
%   the original algorithm. This allows to better handle large data sets.
%
%   DIFFERENCES: Visvalingam tend to remove outliers while Douglas-Peucker
%   keeps them.
%
%   INPUTS:
%         X: X coordinates of the line points
%         Y: Y coordinates of the line points
%    method: Either 'Visvalingam' or 'DouglasPeucker' (default)
%    option: Either 'silent' (default) or 'verbose' if additional outputs
%            of the calculations are desired.
%
% OUTPUTS:
%         X: X coordinates of the simplified line points
%         Y: Y coordinates of the simplified line points
%   indices: Indices to the positions of the points preserved in the
%            original X and Y. Therefore Output X is equal to the input
%            X(indices).
% relevance: Relevance of the returned points. It can be used to furder
%            simplify the line dinamically by keeping only points with 
%            higher relevance. But this will produce bigger distortions of 
%            the line shape than calling again lineSimplification with a 
%            smaller value for N, as removing a point changes the relevance
%            of its neighbors.
%
% Implementation by Camilo Rada - camilo@rada.cl
%

    if nargin < 3
        error('Line points positions X, Y and target point count N MUST be specified');
    end
    if nargin < 4
        method='DouglasPeucker';
    end
    if nargin < 5
        option='silent';
    end

    doDisplay=strcmp(option,'verbose');

    X=double(X(:));
    Y=double(Y(:));
    indices=1:length(Y);

    if length(X)~=length(Y)
        error('Vectors X and Y MUST have the same number of elements');
    end

    if N>=length(Y)
        relevance=ones(length(Y),1);
        if doDisplay
            disp('N is greater or equal than the number of points in the line. Original X,Y were returned. Relevances were not computed.')
        end
        return
    end
    % Removing repeated NaN from Y
    % We find all the NaNs with another NaN to the left
    repeatedNaNs= isnan(Y(2:end)) & isnan(Y(1:end-1));
    %We also consider a repeated NaN the first element if NaN
    repeatedNaNs=[isnan(Y(1)); repeatedNaNs(:)];
    Y=Y(~repeatedNaNs);
    X=X(~repeatedNaNs);
    indices=indices(~repeatedNaNs);

    %Removing trailing NaN if any
    if isnan(Y(end))
        Y=Y(1:end-1);
        X=X(1:end-1);
        indices=indices(1:end-1);
    end

    pCount=length(X);

    if doDisplay
        disp(['Initial point count = ' num2str(pCount)])
        disp(['Non repeated NaN count in data = ' num2str(sum(isnan(Y)))])
    end

    iterCount=0;

    while pCount>N
        iterCount=iterCount+1;
        % If the vertices of a triangle are at the points (x1,y1) , (x2, y2) and
        % (x3,y3) the are uf such triangle is
        % area = abs((x1*(y2-y3)+x2*(y3-y1)+x3*(y1-y2))/2)
        % now the areas of the triangles defined by each point of X,Y and its two
        % neighbors are

        twiceTriangleArea =abs((X(1:end-2).*(Y(2:end-1)-Y(3:end))+X(2:end-1).*(Y(3:end)-Y(1:end-2))+X(3:end).*(Y(1:end-2)-Y(2:end-1))));

        switch method
            case 'Visvalingam'
                % In this case the relevance is given by the area of the
                % triangle formed by each point end the two points besides
                relevance=twiceTriangleArea/2;
            case 'DouglasPeucker'
                % In this case the relevance is given by the minimum distance
                % from the point to the line formed by its two neighbors
                neighborDistances=ppDistance([X(1:end-2) Y(1:end-2)],[X(3:end) Y(3:end)]);
                relevance=twiceTriangleArea./neighborDistances;
            otherwise
                error(['Unknown method: ' method]);
        end
        relevance=[Inf; relevance; Inf];
        %We remove the pCount-N least relevant points as long as they are not contiguous

        [srelevance, sortorder]= sort(relevance,'descend');
        firstFinite=find(isfinite(srelevance),1,'first');
        startPos=uint32(firstFinite+N+1);
        toRemove=sort(sortorder(startPos:end));
        if isempty(toRemove)
            break;
        end

        %Now we have to deal with contigous elements, as removing one will
        %change the relevance of the neighbors. Therefore we have to
        %identify pairs of contigous points and only remove the one with
        %leeser relevance

        %Contigous will be true for an element if the next or the previous
        %element is also flagged for removal
        contiguousToKeep=[diff(toRemove(:))==1; false] | [false; (toRemove(1:end-1)-toRemove(2:end))==-1];
        notContiguous=~contiguousToKeep;

        %And the relevances asoociated to the elements flagged for removal
        contRel=relevance(toRemove);

        % Now we rearrange contigous so it is sorted in two rows, therefore
        % if both rows are true in a given column, we have a case of two
        % contigous points that are both flagged for removal
        % this process is demenden of the rearrangement, as contigous
        % elements can end up in different colums, so it has to be done
        % twice to make sure no contigous elements are removed
         nContiguous=length(contiguousToKeep);

        for paddingMode=1:2
            %The rearragngement is only possible if we have an even number of
            %elements, so we add one dummy zero at the end if needed
            if paddingMode==1
                if mod(nContiguous,2)
                    pcontiguous=[contiguousToKeep; false];
                    pcontRel=[contRel; -Inf];
                else
                    pcontiguous=contiguousToKeep;
                    pcontRel=contRel;
                end
            else
                if mod(nContiguous,2)
                    pcontiguous=[false; contiguousToKeep];
                    pcontRel=[-Inf; contRel];
                else
                    pcontiguous=[false; contiguousToKeep(1:end-1)];
                    pcontRel=[-Inf; contRel(1:end-1)];                    
                end
            end

            contiguousPairs=reshape(pcontiguous,2,[]);
            pcontRel=reshape(pcontRel,2,[]);

            %finding colums with contigous element
            contCols=all(contiguousPairs);
            if ~any(contCols) && paddingMode==2
                break;
            end
            %finding the row of the least relevant element of each column
            [~, lesserElementRow]=max(pcontRel);

            %The index in contigous of the first element of each pair is
            if paddingMode==1
                firstElementIdx=((1:size(contiguousPairs,2))*2)-1;
            else
                firstElementIdx=((1:size(contiguousPairs,2))*2)-2;
            end            

            % and the index in contigous of the most relevant element of each
            % pair is
            lesserElementIdx=firstElementIdx+lesserElementRow-1;

            %now we set the least relevant element as NOT continous, so it is
            %removed
            contiguousToKeep(lesserElementIdx(contCols))=false;
        end
        %and now we delete the relevant continous points from the toRemove
        %list
        toRemove=toRemove(contiguousToKeep | notContiguous);

        if any(diff(toRemove(:))==1) && doDisplay
            warning([num2str(sum(diff(toRemove(:))==1)) ' continous elements removed in one iteration.'])
        end
        toRemoveLogical=false(pCount,1);
        toRemoveLogical(toRemove)=true;

        X=X(~toRemoveLogical);
        Y=Y(~toRemoveLogical);
        indices=indices(~toRemoveLogical);

        pCount=length(X);
        nRemoved=sum(toRemoveLogical);
        if doDisplay
            disp(['Iteration ' num2str(iterCount) ', Point count = ' num2str(pCount) ' (' num2str(nRemoved) ' removed)'])
        end
        if nRemoved==0
            break;
        end
    end
end

function d = ppDistance(p1,p2)
    d=sqrt((p1(:,1)-p2(:,1)).^2+(p1(:,2)-p2(:,2)).^2);
end

这似乎是一个相当不错的启发式算法。它捕获了其他几个启发式算法所错过的瞬间。我看到的最大缺点是 O(n*log(n)) 的排序复杂度,这使得在缩放到可以绘制所有内容的点之前添加更多细节变得非常困难。 - user2465116
@user2465116 你可以使用相关性评分来实现该功能。所以当你缩放时,你可以根据每个点的相关性评分选择更多/更少的点而无需再次调用函数。 - Camilo Rada

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接