在不使用插值的情况下对列表中的条目进行降采样

Question

在不使用插值的情况下对列表中的条目进行降采样

pythonlistdownsampling

8

我有一个Python列表，其中包含许多条目，我需要使用以下方法之一进行下采样：

最大行数。例如，将包含1234个条目的列表限制为1000个。
原始行数的比例。例如，使列表长度为原始长度的1/3。

（我需要能够同时使用两种方式，但一次只能使用一种方式）。

我相信对于最大行数，我可以计算所需的比例并将其传递给比例缩小器：

def downsample_to_max(self, rows, max_rows):
        return downsample_to_proportion(rows, max_rows / float(len(rows)))

我只需要一个降采样函数，有什么提示吗？

编辑：列表包含对象，而不是数字值，因此我不需要插值。删除对象是可以的。

解决方案：

def downsample_to_proportion(self, rows, proportion):

    counter = 0.0
    last_counter = None
    results = []

    for row in rows:

        counter += proportion

        if int(counter) != last_counter:
            results.append(row)
            last_counter = int(counter)

    return results

感谢您的选择。

- Dave

6个回答

3

如果输入已经是序列类型，使用切片语法比使用 islice() + list() 更高效：

def downsample_to_proportion(rows, proportion):
    return rows[::int(1 / proportion)]

- BlackJack

2

这个解决方案可能对原帖作者来说有点过于复杂，但我想分享一下我一直在使用的代码来解决这个和类似问题。

它有点冗长（大约90行），但如果你经常需要这样做，想要一个易于使用的单行代码，并且需要一个纯Python依赖环境，那么我认为它可能会有用。

基本上，你只需要将你的列表传递给函数，并告诉它你想要新列表的长度，函数将执行以下操作之一：

缩小你的列表，如果新长度更小，则删除项目，就像之前的答案建议的那样。
拉伸/放大你的列表（与缩小相反），如果新长度更大，还可以选择是否：
- 在已知值之间进行线性插值（如果列表包含整数或浮点数，则自动选择）
- 复制每个值，使它们占据新列表的比例大小（如果列表包含非数字，则自动选择）
- 拉开原始值，并在其中留下间隙

所有内容都包含在一个函数中，因此如果需要，只需将其复制并粘贴到脚本中即可立即开始使用。

例如，你可以这样说：

origlist = [0,None,None,30,None,50,60,70,None,None,100]
resizedlist = ResizeList(testlist, 21)
print(resizedlist)

并获取

[0, 5.00000000001, 9.9999999999900009, 15.0, 20.000000000010001, 24.999999999989999, 30, 35.0, 40.0, 45.0, 50.0, 55.0, 60.0, 65.0, 70, 75.000000000010004, 79.999999999989996, 85.0, 90.000000000010004, 94.999999999989996, 100]

请注意，由于浮点数限制，可能会出现小的不准确性。此外，我是为Python 2.x编写的，因此要在Python 3.x上使用它，只需添加一行代码，即xrange = range。

这里有一个巧妙的技巧，可以在列表的子列表中的位置之间进行插值。例如，您可以轻松地在RGB颜色元组之间进行插值，以创建x个步骤的颜色渐变。假设有一个包含3个RGB颜色元组的列表，并且有一个名为GRADIENTLENGTH的变量，则可以按如下方式执行：

crosssections = zip(*rgbtuples)
grad_crosssections = ( ResizeList(spectrum,GRADIENTLENGTH) for spectrum in crosssections )
rgb_gradient = [list(each) for each in zip(*grad_crosssections)]

可能需要进行一些优化，我不得不进行了相当多的实验。如果你觉得你可以改进它，请随意编辑我的帖子。以下是代码：

def ResizeList(rows, newlength, stretchmethod="not specified", gapvalue=None):
    """
    Resizes (up or down) and returns a new list of a given size, based on an input list.
    - rows: the input list, which can contain any type of value or item (except if using the interpolate stretchmethod which requires floats or ints only)
    - newlength: the new length of the output list (if this is the same as the input list then the original list will be returned immediately)
    - stretchmethod: if the list is being stretched, this decides how to do it. Valid values are:
      - 'interpolate'
        - linearly interpolate between the known values (automatically chosen if list contains ints or floats)
      - 'duplicate'
        - duplicate each value so they occupy a proportional size of the new list (automatically chosen if the list contains non-numbers)
      - 'spread'
        - drags the original values apart and leaves gaps as defined by the gapvalue option
    - gapvalue: a value that will be used as gaps to fill in between the original values when using the 'spread' stretchmethod
    """
    #return input as is if no difference in length
    if newlength == len(rows):
        return rows
    #set auto stretchmode
    if stretchmethod == "not specified":
        if isinstance(rows[0], (int,float)):
            stretchmethod = "interpolate"
        else:
            stretchmethod = "duplicate"
    #reduce newlength 
    newlength -= 1
    #assign first value
    outlist = [rows[0]]
    writinggapsflag = False
    if rows[1] == gapvalue:
        writinggapsflag = True
    relspreadindexgen = (index/float(len(rows)-1) for index in xrange(1,len(rows))) #warning a little hacky by skipping first index cus is assigned auto
    relspreadindex = next(relspreadindexgen)
    spreadflag = False
    gapcount = 0
    for outlistindex in xrange(1, newlength):
        #relative positions
        rel = outlistindex/float(newlength)
        relindex = (len(rows)-1) * rel
        basenr,decimals = str(relindex).split(".")
        relbwindex = float("0."+decimals)
        #determine equivalent value
        if stretchmethod=="interpolate":
            #test for gap
            maybecurrelval = rows[int(relindex)]
            maybenextrelval = rows[int(relindex)+1]
            if maybecurrelval == gapvalue:
                #found gapvalue, so skipping and waiting for valid value to interpolate and add to outlist
                gapcount += 1
                continue
            #test whether to interpolate for previous gaps
            if gapcount > 0:
                #found a valid value after skipping gapvalues so this is where it interpolates all of them from last valid value to this one
                startvalue = outlist[-1]
                endindex = int(relindex)
                endvalue = rows[endindex]
                gapstointerpolate = gapcount 
                allinterpolatedgaps = Resize([startvalue,endvalue],gapstointerpolate+3)
                outlist.extend(allinterpolatedgaps[1:-1])
                gapcount = 0
                writinggapsflag = False
            #interpolate value
            currelval = rows[int(relindex)]
            lookahead = 1
            nextrelval = rows[int(relindex)+lookahead]
            if nextrelval == gapvalue:
                if writinggapsflag:
                    continue
                relbwval = currelval
                writinggapsflag = True
            else:
                relbwval = currelval + (nextrelval - currelval) * relbwindex #basenr pluss interindex percent interpolation of diff to next item
        elif stretchmethod=="duplicate":
            relbwval = rows[int(round(relindex))] #no interpolation possible, so just copy each time
        elif stretchmethod=="spread":
            if rel >= relspreadindex:
                spreadindex = int(len(rows)*relspreadindex)
                relbwval = rows[spreadindex] #spread values further apart so as to leave gaps in between
                relspreadindex = next(relspreadindexgen)
            else:
                relbwval = gapvalue
        #assign each value
        outlist.append(relbwval)
    #assign last value
    if gapcount > 0:
        #this last value also has to interpolate for previous gaps       
        startvalue = outlist[-1]
        endvalue = rows[-1]
        gapstointerpolate = gapcount 
        allinterpolatedgaps = Resize([startvalue,endvalue],gapstointerpolate+3)
        outlist.extend(allinterpolatedgaps[1:-1])
        outlist.append(rows[-1])
        gapcount = 0
        writinggapsflag = False
    else:
        outlist.append(rows[-1])
    return outlist

- Karim Bahgat

非常感谢您。这是一个极好的实现。 - Jason Wiener

1

保持一个计数器，每次增加第二个值。每次向下取整，并产生该索引处的值。

- Ignacio Vazquez-Abrams

1

请问您能否详细说明一下？谢谢。 - Dave

1

从0开始计数。当计数器小于列表长度时：产生索引值为计数器的元素，然后将计数器加1。 - Ignacio Vazquez-Abrams

1

random.choices()无法解决您的问题吗？更多示例可以在这里找到。

- Code42

0

关于Ignacio Vazquez-Abrams的回答：

从7个可用数字中打印3个数字：

msg_cache = [1, 2, 3, 4, 5, 6]
msg_n = 3
inc = len(msg_cache) / msg_n
inc_total = 0
for _ in range(0, msg_n):
    msg_downsampled = msg_cache[math.floor(inc_total)]
    print(msg_downsampled)
    inc_total += inc

输出：

0
2
4

对于将许多日志消息下采样为较小子集非常有用。

- Contango

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- tzaman · Accepted Answer

您可以使用itertools中的islice：

from itertools import islice

def downsample_to_proportion(rows, proportion=1):
    return list(islice(rows, 0, len(rows), int(1/proportion)))

使用方法：

x = range(1,10)
print downsample_to_proportion(x, 0.3)
# [1, 4, 7]