从Python的二维列表中删除连续的重复项?

3

如何从二维列表中根据特定元素(在这种情况下是第二个元素)删除连续的重复项。

我尝试了几种使用itertools的组合,但都没有成功。

有人能建议我如何解决吗?

输入:


192.168.1.232  >>>>>   173.194.36.64 , 14 , 15 , 16
192.168.1.232  >>>>>   173.194.36.64 , 14 , 15 , 17
192.168.1.232  >>>>>   173.194.36.119 , 23 , 30 , 31
192.168.1.232  >>>>>   173.194.36.98 , 24 , 40 , 41
192.168.1.232  >>>>>   173.194.36.98 , 24 , 40 , 62
192.168.1.232  >>>>>   173.194.36.74 , 25 , 42 , 43
192.168.1.232  >>>>>   173.194.36.74 , 25 , 42 , 65
192.168.1.232  >>>>>   173.194.36.74 , 26 , 44 , 45
192.168.1.232  >>>>>   173.194.36.74 , 26 , 44 , 66
192.168.1.232  >>>>>   173.194.36.78 , 27 , 46 , 47

输出


192.168.1.232  >>>>>   173.194.36.64 , 14 , 15 , 16
192.168.1.232  >>>>>   173.194.36.119 , 23 , 30 , 31
192.168.1.232  >>>>>   173.194.36.98 , 24 , 40 , 41
192.168.1.232  >>>>>   173.194.36.74 , 25 , 42 , 43
192.168.1.232  >>>>>   173.194.36.78 , 27 , 46 , 47

这是期望的输出。

更新


上面给出的是列表的漂亮打印形式。

实际列表看起来像这样。

>>> for x  in connection_frame:
    print x


['192.168.1.232', '173.194.36.64', 14, 15, 16]
['192.168.1.232', '173.194.36.64', 14, 15, 17]
['192.168.1.232', '173.194.36.119', 23, 30, 31]
['192.168.1.232', '173.194.36.98', 24, 40, 41]
['192.168.1.232', '173.194.36.98', 24, 40, 62]
['192.168.1.232', '173.194.36.74', 25, 42, 43]
['192.168.1.232', '173.194.36.74', 25, 42, 65]
['192.168.1.232', '173.194.36.74', 26, 44, 45]
['192.168.1.232', '173.194.36.74', 26, 44, 66]
['192.168.1.232', '173.194.36.78', 27, 46, 47]
['192.168.1.232', '173.194.36.78', 27, 46, 67]
['192.168.1.232', '173.194.36.78', 28, 48, 49]
['192.168.1.232', '173.194.36.78', 28, 48, 68]
['192.168.1.232', '173.194.36.79', 29, 50, 51]
['192.168.1.232', '173.194.36.79', 29, 50, 69]
['192.168.1.232', '173.194.36.119', 32, 52, 53]
['192.168.1.232', '173.194.36.119', 32, 52, 74]

2
你正在处理哪种实际数据类型?比如这些行是字符串、元组等吗? - aruisdante
@thecreator232 输出的元素顺序是否重要? - thefourtheye
@aruisdante:是的,实际上它是一个列表内嵌另一个列表。 - thecreator232
也许如果您展示实际的数据结构而不是抽象表示,会更有帮助。即您的输入/输出并不是有效的Python代码。 - aruisdante
1
@thecreator232,如果顺序不重要,我们如何拥有有意义的连续条目? - wnnmaw
显示剩余7条评论
3个回答

3

如果你想保留顺序并且只弹出连续条目,我不知道有任何花哨的内置工具可以使用。因此,这里是“蛮力”方法:

>>> remList = []
>>> for i in range(len(connection_frame)):
...     if (i != len(connection_frame)-)1 and (connection_frame[i][1] == connection_frame[i+1][1]):
...         remList.append(i)
...
for i in remList:
    connection_frame.pop(i)
['192.168.1.232', '173.194.36.119', 32, 52, 53]
['192.168.1.232', '173.194.36.79', 29, 50, 51]
['192.168.1.232', '173.194.36.78', 28, 48, 49]
['192.168.1.232', '173.194.36.78', 27, 46, 67]
['192.168.1.232', '173.194.36.78', 27, 46, 47]
['192.168.1.232', '173.194.36.74', 26, 44, 45]
['192.168.1.232', '173.194.36.74', 25, 42, 65]
['192.168.1.232', '173.194.36.74', 25, 42, 43]
['192.168.1.232', '173.194.36.98', 24, 40, 41]
['192.168.1.232', '173.194.36.64', 14, 15, 16]
>>>
>>> for conn in connection_frame:
...     print conn
...
['192.168.1.232', '173.194.36.64', 14, 15, 17]
['192.168.1.232', '173.194.36.119', 23, 30, 31]
['192.168.1.232', '173.194.36.98', 24, 40, 62]
['192.168.1.232', '173.194.36.74', 26, 44, 66]
['192.168.1.232', '173.194.36.78', 28, 48, 68]
['192.168.1.232', '173.194.36.79', 29, 50, 69]
['192.168.1.232', '173.194.36.119', 32, 52, 74]
>>>

或者如果你想使用列表推导式一次性完成:

>>> new_frame = [conn for conn in connection_frame if not connection_frame.index(conn) in [i for i in range(len(connection_frame)) if (i != len(connection_frame)-1) and (connection_frame[i][1] == connection_frame[i+1][1])]]
>>>
>>> for conn in new_frame:
...     print conn
...
['192.168.1.232', '173.194.36.64', 14, 15, 17]
['192.168.1.232', '173.194.36.119', 23, 30, 31]
['192.168.1.232', '173.194.36.98', 24, 40, 62]
['192.168.1.232', '173.194.36.74', 26, 44, 66]
['192.168.1.232', '173.194.36.78', 28, 48, 68]
['192.168.1.232', '173.194.36.79', 29, 50, 69]
['192.168.1.232', '173.194.36.119', 32, 52, 74]

@thecreator232,你需要改什么?让我知道,这样我就可以在这里更新它。 - wnnmaw
如果(connection_frame[i][1] == connection_frame[i+1][1]) and (connection_frame[i][2] == connection_frame[i+1][2]) : connection_frame.remove(connection_frame[i+1]) - thecreator232
1
@thecreator232 好的,我明白你在做什么。在迭代列表时更改它要小心,可能会得到一些奇怪的结果,而且通常被认为是不好的形式,所以我不会在这里更改它。 - wnnmaw
谢谢提醒。 - thecreator232

2

使用 itertools.groupby()

import itertools

data = """192.168.1.232  >>>>>   173.194.36.64 , 14 , 15 , 16
192.168.1.232  >>>>>   173.194.36.64 , 14 , 15 , 17
192.168.1.232  >>>>>   173.194.36.119 , 23 , 30 , 31
192.168.1.232  >>>>>   173.194.36.98 , 24 , 40 , 41
192.168.1.232  >>>>>   173.194.36.98 , 24 , 40 , 62
192.168.1.232  >>>>>   173.194.36.74 , 25 , 42 , 43
192.168.1.232  >>>>>   173.194.36.74 , 25 , 42 , 65
192.168.1.232  >>>>>   173.194.36.74 , 26 , 44 , 45
192.168.1.232  >>>>>   173.194.36.74 , 26 , 44 , 66
192.168.1.232  >>>>>   173.194.36.78 , 27 , 46 , 47""".split("\n")

for k, g in itertools.groupby(data, lambda l:l.split()[2]):
  print next(g)

这将打印出

192.168.1.232  >>>>>   173.194.36.64 , 14 , 15 , 16
192.168.1.232  >>>>>   173.194.36.119 , 23 , 30 , 31
192.168.1.232  >>>>>   173.194.36.98 , 24 , 40 , 41
192.168.1.232  >>>>>   173.194.36.74 , 25 , 42 , 43
192.168.1.232  >>>>>   173.194.36.78 , 27 , 46 , 47

(这里使用的是字符串列表,但适应于列表嵌套结构也很轻松。)

如果OP的数据结构确实是字符串列表,那么这个方法可以奏效。但他刚刚说过它可能是一个字符串列表的列表(例如[['192.168.1.1', '>>>>', ...], ...]),这将使答案稍微复杂一些。而且,它绝对会删除非连续的重复项。 - aruisdante
@aruisdante:答案末尾有一条关于此事的评论(您可能需要重新加载才能看到它)。 - NPE
是的,我看到了,但这将删除非连续重复项,这不是 OP 想要的。 - aruisdante
1
@aruisdante:不,这并未去除非连续的重复项。我在这里漏掉了什么吗? - NPE
如果data是一个嵌套列表,那么代码如下所示: result = (next(g) for _, g in groupby(data, key=lambda x: x[1])) - jfs

0

Pandas.groupbyitertools.groupby 的一种替代方案,它还允许您通过提供行号而不是迭代器来跟踪原始列表的连续/非连续元素。类似于这样:

df = pandas.DataFrame(connection_frame)
print df
Out:
                0                  1    2    3    4
0   '192.168.1.232'    '173.194.36.64'   14   15   16
1   '192.168.1.232'    '173.194.36.64'   14   15   17
2   '192.168.1.232'   '173.194.36.119'   23   30   31
3   '192.168.1.232'    '173.194.36.98'   24   40   41
4   '192.168.1.232'    '173.194.36.98'   24   40   62
5   '192.168.1.232'    '173.194.36.74'   25   42   43
6   '192.168.1.232'    '173.194.36.74'   25   42   65
7   '192.168.1.232'    '173.194.36.74'   26   44   45
8   '192.168.1.232'    '173.194.36.74'   26   44   66
9   '192.168.1.232'    '173.194.36.78'   27   46   47
10  '192.168.1.232'    '173.194.36.78'   27   46   67
11  '192.168.1.232'    '173.194.36.78'   28   48   49
12  '192.168.1.232'    '173.194.36.78'   28   48   68
13  '192.168.1.232'    '173.194.36.79'   29   50   51
14  '192.168.1.232'    '173.194.36.79'   29   50   69
15  '192.168.1.232'   '173.194.36.119'   32   52   53
16  '192.168.1.232'   '173.194.36.119'   32   52   74

然后,您可以按第二列对它们进行分组,并将组打印出来

gps = df.groupby(2).groups
print gps
Out: 
{' 14': [0, 1],
 ' 23': [2],
 ' 24': [3, 4],
 ' 25': [5, 6],
 ' 26': [7, 8],
 ' 27': [9, 10],
 ' 28': [11, 12],
 ' 29': [13, 14],
 ' 32': [15, 16]}

看到每行的编号了吗?在每个gps列表中删除连续重复项有很多方法。这是其中之一:

valid_rows = list()
for g in gps.values():
   old_row = g[0]
   valid_rows.append(old_row)
   for row_id in range(1, len(g)):
      new_row = g[row_id]
      if new_row - old_row != 1:
         valid_rows.append(new_row)
      old_row = new_row
 print valid_rows
 Out: [5, 3, 9, 7, 0, 2, 15, 13, 11]

最后,通过valid_rows对pandas DataFrame进行索引。
print df.ix[sorted(valid_rows)]
Out:


0   '192.168.1.232'    '173.194.36.64'   14   15   16
2   '192.168.1.232'   '173.194.36.119'   23   30   31
3   '192.168.1.232'    '173.194.36.98'   24   40   41
5   '192.168.1.232'    '173.194.36.74'   25   42   43
7   '192.168.1.232'    '173.194.36.74'   26   44   45
9   '192.168.1.232'    '173.194.36.78'   27   46   47
11  '192.168.1.232'    '173.194.36.78'   28   48   49
13  '192.168.1.232'    '173.194.36.79'   29   50   51
15  '192.168.1.232'   '173.194.36.119'   32   52   53

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接