Python列表过滤：从列表中删除子集

Question

Python列表过滤：从列表中删除子集

14

使用Python如何通过有序子集匹配来缩小一个由列表组成的列表[[..],[..],..]？

在这个问题的背景下，如果列表M包含列表L的所有元素并且顺序相同，则列表L是列表M的子集。例如，列表[1,2]是列表[1,2,3]的子集，但不是列表[2,1,3]的子集。

示例输入：

a. [[1, 2, 4, 8], [1, 2, 4, 5, 6], [1, 2, 3], [2, 3, 21], [1, 2, 3, 4], [1, 2, 3, 4, 5, 6, 7]]
b. [[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [1], [1, 2, 3, 4], [1, 2], [17, 18, 19, 22, 41, 48], [2, 3], [1, 2, 3], [50, 69], [1, 2, 3], [2, 3, 21], [1, 2, 3], [1, 2, 4, 8], [1, 2, 4, 5, 6]]

期望结果：

a. [[1, 2, 4, 8], [2, 3, 21], [1, 2, 3, 4, 5, 6, 7]]
b. [[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [17, 18, 19, 22, 41, 48], [50, 69],  [2, 3, 21], [1, 2, 4, 8], [1, 2, 4, 5, 6]]

更多示例:

L = [[1, 2, 3, 4, 5, 6, 7], [1, 2, 5, 6]] - 不进行 reduce 操作

L = [[1, 2, 3, 4, 5, 6, 7], ~~[1, 2, 3]~~, [1, 2, 4, 8]] - 进行 reduce 操作

L = [[1, 2, 3, 4, 5, 6, 7], [7, 6, 5, 4, 3, 2, 1]] - 不进行 reduce 操作

(对于错误的数据集造成的混淆表示抱歉。)

- Oliver

1

什么是超集列表？它是任何不出现在另一个集合中的子集。 - hughdbrown

[1,2,4,5,6] 不应该在结果中吗？ - dugres

不，根据问题定义，[1,2,4,5,6]是[1, 2, 3, 4, 5, 6, 7]的“子集”。 - João Silva

我认为你需要制定一个明确的测试用例集 - 我很乐意根据它们编写代码。似乎我的两个答案都不完全正确。 - quamrana

我不明白。[1,2,4,5,6]在一个测试数据集中被省略了，因为[1,2,3,4,5,6,7]，但在这个测试数据中没有被省略？[[1, 2, 3, 4, 5, 6, 7], [1, 2, 4, 5, 6]] 我是否错误地理解了“无减少”注释？ - hughdbrown

当存在完全相同的副本时，应采取什么行为？需要什么样的期望结果？ - Michael Higgins

10个回答

7

这段代码应该非常节省内存。除了存储初始的列表，这段代码几乎没有使用额外的内存（不会创建临时集合或列表副本）。

def is_subset(needle,haystack):
   """ Check if needle is ordered subset of haystack in O(n)  """

   if len(haystack) < len(needle): return False

   index = 0
   for element in needle:
      try:
         index = haystack.index(element, index) + 1
      except ValueError:
         return False
   else:
      return True

def filter_subsets(lists):
   """ Given list of lists, return new list of lists without subsets  """

   for needle in lists:
      if not any(is_subset(needle, haystack) for haystack in lists
         if needle is not haystack):
         yield needle

my_lists = [[1, 2, 4, 8], [1, 2, 4, 5, 6], [1, 2, 3], 
            [2, 3, 21], [1, 2, 3, 4], [1, 2, 3, 4, 5, 6, 7]]    
print list(filter_subsets(my_lists))

>>> [[1, 2, 4, 8], [2, 3, 21], [1, 2, 3, 4, 5, 6, 7]]

还有，只是为了好玩，一个单行代码:

def filter_list(L):
    return [x for x in L if not any(set(x)<=set(y) for y in L if x is not y)]

- Kenan Banks

这行代码很好: "index = haystack.index(element, index)"。但是，我每次都缩短了列表。 - hughdbrown

1

我猜这段代码会认为 [1,1,1,1,1,1] 是 [1] 的子集。你需要使用 "index = 1 + haystack.index(element, index)"。 - hughdbrown

@hugh，你的例子需要先检查长度，但你是对的。在这段代码中，[1,1,1]是[2,1,3]的一个子集。现在进行更改。 - Kenan Banks

与@iElectric的解决方案相同问题，序列丢失。输入：[[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [1], [1, 2, 3, 4], [1, 2], [17, 18, 19, 22, 41, 48], [2, 3], [1, 2, 3], [50, 69], [1, 2, 3], [2, 3, 21], [1, 2, 3], [1, 2, 4, 8], [1, 2, 4, 5, 6]]输出：[[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [17, 18, 19, 22, 41, 48], [50, 69], [2, 3, 21], [1, 2, 4, 8]] - Oliver

1

1、2、4、5、6是1、2、3、4、5、6、7的有序子集。根据您的规格，它应该被移除。 - Kenan Banks

这将删除精确副本的两个版本。 - Michael Higgins

1

如果一个列表不是任何其他列表的子集，则它是超级列表。如果列表的每个元素都可以按顺序在另一个列表中找到，则它是另一个列表的子集。

这是我的代码：

def is_sublist_of_any_list(cand, lists):
    # Compare candidate to a single list
    def is_sublist_of_list(cand, target):
        try:
            i = 0
            for c in cand:
                i = 1 + target.index(c, i)
            return True
        except ValueError:
            return False
    # See if candidate matches any other list
    return any(is_sublist_of_list(cand, target) for target in lists if len(cand) <= len(target))

# Compare candidates to all other lists
def super_lists(lists):
    return [cand for i, cand in enumerate(lists) if not is_sublist_of_any_list(cand, lists[:i] + lists[i+1:])]

if __name__ == '__main__':
    lists = [[1, 2, 4, 8], [1, 2, 4, 5, 6], [1, 2, 3], [2, 3, 21], [1, 2, 3, 4], [1, 2, 3, 4, 5, 6, 7]]
    superlists = super_lists(lists)
    print superlists

以下是结果：

[[1, 2, 4, 8], [2, 3, 21], [1, 2, 3, 4, 5, 6, 7]]

编辑：您后来的数据集的结果。

>>> lists = [[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [1], [1, 2, 3, 4], [1, 2], [17,
 18, 19, 22, 41, 48], [2, 3], [1, 2, 3], [50, 69], [1, 2, 3], [2, 3, 21], [1, 2,
 3], [1, 2, 4, 8], [1, 2, 4, 5, 6]]
>>> superlists = super_lists(lists)
>>> expected = [[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [17, 18, 19, 22, 41, 48], [5
0, 69],  [2, 3, 21], [1, 2, 4, 8]]
>>> assert(superlists == expected)
>>> print superlists
[[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [17, 18, 19, 22, 41, 48], [50, 69], [2, 3,
21], [1, 2, 4, 8]]

- hughdbrown

相同的问题，序列丢失了。 - Oliver

同样的问题是什么？“序列丢失”是什么意思？这是否意味着它不能产生期望的结果？如果不是，请提供一个示例。上面的代码生成了所示的结果。 - hughdbrown

好的，我在您的新数据集上尝试了一下，它产生了您期望/想要的完全相同的结果。 - hughdbrown

这么晚了还问这个问题不太好，我感到很抱歉。在期望的结果中，我漏掉了 [1,2,4,5,6]。 - Oliver

0

编辑：我真的需要提高我的阅读理解能力。这是实际问题的答案。它利用了“A是B的超类”意味着“len（A）> len（B）或A == B”的事实。

def advance_to(it, value):
    """Advances an iterator until it matches the given value. Returns False
    if not found."""
    for item in it:
        if item == value:
            return True
    return False

def has_supersequence(seq, super_sequences):
    """Checks if the given sequence has a supersequence in the list of
    supersequences.""" 
    candidates = map(iter, super_sequences)
    for next_item in seq:
        candidates = [seq for seq in candidates if advance_to(seq, next_item)]
    return len(candidates) > 0

def find_supersequences(sequences):
    """Finds the supersequences in the given list of sequences.

    Sequence A is a supersequence of sequence B if B can be created by removing
    items from A."""
    super_seqs = []
    for candidate in sorted(sequences, key=len, reverse=True):
        if not has_supersequence(candidate, super_seqs):
            super_seqs.append(candidate)
    return super_seqs

print(find_supersequences([[1, 2, 4, 8], [1, 2, 4, 5, 6], [1, 2, 3],
    [2, 3, 21], [1, 2, 3, 4], [1, 2, 3, 4, 5, 6, 7]]))
#Output: [[1, 2, 3, 4, 5, 6, 7], [1, 2, 4, 8], [2, 3, 21]]

如果您需要保留序列的原始顺序，那么find_supersequences()函数就需要跟踪序列的位置并在之后对输出进行排序。

- Ants Aasma

这不遵守列表顺序，例如如果给定[[1,2,3,4]，[2,4,3]，[3,4,5]]，则结果为[[1,2,3,4]，[2,4,3]]，而我希望它返回初始输入。 - Oliver

@Triptych：他在原始问题中没有说明这一点。 - Ants Aasma

我确实指出了顺序的重要性，“必须尊重顺序”。但这并不是什么大问题。感谢您提供可能的解决方案。 - Oliver

@Oli_UK：如果顺序不重要的话，使用集合是明显的优选。迭代解决方案会是一个错误。你能澄清这一点吗？ - hughdbrown

两个解决方案都不行：没有点赞。如果您的第一个解决方案生成列表而不是集合，那么它将是正确的。这里有一个一行代码的修复方法：superlists = [cand for i, cand in enumerate(initial_lists) if not any(set(target).issuperset(set(cand)) for target in initial_lists[:i]+initial_lists[i+1:])] - hughdbrown

显示剩余3条评论

0

list0=[[1, 2, 4, 8], [1, 2, 4, 5, 6], [1, 2, 3], [2, 3, 21], [1, 2, 3, 4], [1, 2, 3, 4, 5, 6, 7]]

for list1 in list0[:]:
    for list2 in list0:
        if list2!=list1:
            len1=len(list1)
            c=0
            for n in list2:
                if n==list1[c]:
                    c+=1
                if c==len1:
                    list0.remove(list1)
                    break

此处使用列表0的副本对其进行原地过滤。如果预期结果与原始结果大小相近且仅需删除少量“子集”时，这是个不错的选择。

如果预期结果较小且原始列表很大，则可以选择更加省内存的方法，并且该方法不会复制原始列表。

list0=[[1, 2, 4, 8], [1, 2, 4, 5, 6], [1, 2, 3], [2, 3, 21], [1, 2, 3, 4], [1, 2, 3, 4, 5, 6, 7]]
result=[]

for list1 in list0:
    subset=False
    for list2 in list0:
        if list2!=list1:
            len1=len(list1)
            c=0
            for n in list2:
                if n==list1[c]:
                    c+=1
                if c==len1:
                    subset=True
                    break
            if subset:
                break
    if not subset:
        result.append(list1)

- dugres

如果您能跟踪自己所处的位置，就不需要比较list1和list2。使用enumerate()存储索引并创建子列表，省去了那个列表："for i, list1 in enumerate(list0):\n for list2 in (list0[:i] + list0[i+1]):\n\n" - hughdbrown

没错，不过我不确定它是否值得。 - dugres

与其他解决方案中所述的问题相同。 - Oliver

0

这似乎有效：

original=[[1, 2, 4, 8], [1, 2, 4, 5, 6], [1, 2, 3], [2, 3, 21], [1, 2, 3, 4], [1, 2, 3, 4, 5, 6, 7]]

target=[[1, 2, 4, 8], [2, 3, 21], [1, 2, 3, 4, 5, 6, 7]]

class SetAndList:
    def __init__(self,aList):
        self.list=aList
        self.set=set(aList)
        self.isUnique=True
    def compare(self,aList):
        s=set(aList)
        if self.set.issubset(s):
            #print self.list,'superceded by',aList
            self.isUnique=False

def listReduce(lists):
    temp=[]
    for l in lists:
        for t in temp:
            t.compare(l)
        temp.append( SetAndList(l) )

    return [t.list for t in temp if t.isUnique]

print listReduce(original)
print target

这将打印出计算出的列表和目标，以进行可视化比较。

取消 compare 方法中的打印行注释，以查看各种列表如何被取代。

已在 Python 2.6.2 中测试。

- quamrana

无法完全缩小。如果给定[[2, 16, 17]，[1, 2, 3, 4, 5, 6, 7]，[1]，[1, 2, 3, 4]，[1, 2]，[17, 18, 19, 22, 41, 48]，[2, 3]，[1, 2, 3]，[50, 69]，[1, 2, 3]，[2, 3, 21]，[1, 2, 3]，[1, 2, 4, 8]，[1, 2, 4, 5, 6]]输出：[[2, 16, 17]，[1, 2, 3, 4, 5, 6, 7]，[1, 2, 3, 4]，[17, 18, 19, 22, 41, 48]，[50, 69]，[2, 3, 21]，[1, 2, 3]，[1, 2, 4, 8]，[1, 2, 4, 5, 6]] 无法将[1,2,3]缩小为更大的组之一。 - Oliver

@OP：请查看我的下一个回答，时间是8月24日。 - quamrana

0

我实现了一个不同的issubseq，因为你的代码没有说明[1, 2, 4, 5, 6]是[1, 2, 3, 4, 5, 6, 7]的子序列，例如（除了非常慢之外）。我想出的解决方案如下：

 def is_subseq(a, b):
    if len(a) > len(b): return False
    start = 0
    for el in a:
        while start < len(b):
            if el == b[start]:
                break
            start = start + 1
        else:
            return False
    return True

def filter_partial_matches(sets):
     return [s for s in sets if all([not(is_subseq(s, ss)) for ss in sets if s != ss])]

一个简单的测试用例，给出你的输入和输出：

>>> test = [[1, 2, 4, 8], [1, 2, 4, 5, 6], [1, 2, 3], [2, 3, 21], [1, 2, 3, 4], [1, 2, 3, 4, 5, 6, 7]]
>>> another_test = [[1, 2, 3, 4], [2, 4, 3], [3, 4, 5]]
>>> filter_partial_matches(test)
[[1, 2, 4, 8], [2, 3, 21], [1, 2, 3, 4, 5, 6, 7]]
>>> filter_partial_matches(another_test)
[[1, 2, 3, 4], [2, 4, 3], [3, 4, 5]]

希望能对你有所帮助！

- João Silva

与其他解决方案中评论的问题相同，序列丢失。 - Oliver

0

经过新的测试案例之后，得出了更精细的答案：

original= [[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [1], [1, 2, 3, 4], [1, 2], [17, 18, 19, 22, 41, 48], [2, 3], [1, 2, 3], [50, 69], [1, 2, 3], [2, 3, 21], [1, 2, 3], [1, 2, 4, 8], [1, 2, 4, 5, 6]]

class SetAndList:
    def __init__(self,aList):
        self.list=aList
        self.set=set(aList)
        self.isUnique=True
    def compare(self,other):
        if self.set.issubset(other.set):
            #print self.list,'superceded by',other.list
            self.isUnique=False

def listReduce(lists):
    temp=[]
    for l in lists:
        s=SetAndList(l)
        for t in temp:
            t.compare(s)
            s.compare(t)
        temp.append( s )
        temp=[t for t in temp if t.isUnique]

    return [t.list for t in temp if t.isUnique]

print listReduce(original)

您没有提供所需的输出，但我猜这是正确的，因为[1,2,3]未出现在输出中。

- quamrana

重新阅读了一遍问题后（自上次阅读以来问题可能已经改变），我发现我的解决方案仍然不正确。我错过了“[1,2]是列表[1,2,3]的子集，但不是列表[2,1,3]的子集”的要求。 - quamrana

0

感谢所有提供解决方案并处理我有时错误的数据集的人。使用@hughdbrown的解决方案，我对其进行了修改以符合我的要求：

修改是使用滑动窗口在目标上，以确保找到子集序列。我认为我应该使用比“Set”更合适的词来描述我的问题。

def is_sublist_of_any_list(cand, lists):
    # Compare candidate to a single list
    def is_sublist_of_list(cand, target):
        try:
            i = 0            
            try:
                start = target.index(cand[0])
            except:
                return False

            while start < (len(target) + len(cand)) - start:
                if cand == target[start:len(cand)]:
                    return True
                else:
                    start = target.index(cand[0], start + 1)
        except ValueError:
            return False

    # See if candidate matches any other list
    return any(is_sublist_of_list(cand, target) for target in lists if len(cand) <= len(target))

# Compare candidates to all other lists
def super_lists(lists):
    a = [cand for i, cand in enumerate(lists) if not is_sublist_of_any_list(cand, lists[:i] + lists[i+1:])]
    return a

lists = [[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [1], [1, 2, 3, 4], [1, 2], [17, 18, 19, 22, 41, 48], [2, 3], [1, 2, 3], [50, 69], [1, 2, 3], [2, 3, 21], [1, 2, 3], [1, 2, 4, 8], [1, 2, 4, 5, 6]]
expect = [[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [17, 18, 19, 22, 41, 48], [50, 69],  [2, 3, 21], [1, 2, 4, 8], [1, 2, 4, 5, 6]]

def test():
    out = super_lists(list(lists))

    print "In  : ", lists
    print "Out : ", out

    assert (out == expect)

结果：

In  :  [[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [1], [1, 2, 3, 4], [1, 2], [17, 18, 19, 22, 41, 48], [2, 3], [1, 2, 3], [50, 69], [1, 2, 3], [2, 3, 21], [1, 2, 3], [1, 2, 4, 8], [1, 2, 4, 5, 6]]
Out :  [[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [17, 18, 19, 22, 41, 48], [50, 69], [2, 3, 21], [1, 2, 4, 8], [1, 2, 4, 5, 6]]

- Oliver

最后一次尝试：我在我的最新提交中有更简单的代码。 - hughdbrown

0

所以你真正想知道的是一个列表是否是另一个列表的子串，也就是说，所有匹配元素都是连续的。这里有一段代码，它将候选列表和目标列表转换为逗号分隔的字符串，并进行子串比较，以查看候选项是否出现在目标列表中。

def is_sublist_of_any_list(cand, lists):
    def comma_list(l):
        return "," + ",".join(str(x) for x in l) + ","
    cand = comma_list(cand)
    return any(cand in comma_list(target) for target in lists if len(cand) <= len(target))


def super_lists(lists):
    return [cand for i, cand in enumerate(lists) if not is_sublist_of_any_list(cand, lists[:i] + lists[i+1:])]

函数comma_list()在列表上放置前导和尾随逗号，以确保整数得到完全分隔。否则，例如[1]将是[100]的子集。

- hughdbrown

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- iElectric · Accepted Answer

这可以简化，但是：

l = [[1, 2, 4, 8], [1, 2, 4, 5, 6], [1, 2, 3], [2, 3, 21], [1, 2, 3, 4], [1, 2, 3, 4, 5, 6, 7]]
l2 = l[:]

for m in l:
    for n in l:
        if set(m).issubset(set(n)) and m != n:
            l2.remove(m)
            break

print l2
[[1, 2, 4, 8], [2, 3, 21], [1, 2, 3, 4, 5, 6, 7]]