Pandas数据框唯一值

Question

Pandas数据框唯一值

3

需要一些关于从pandas数据帧中获取唯一值的帮助。

我有以下内容：

    >>> df1
     source    target metric
0  acc1.yyy  acx1.xxx  10000
1  acx1.xxx  acc1.yyy  10000

目标是基于源+目标或目标+源来删除唯一值。但使用drop_duplicates不能实现此目标。

>>> df2 = df1.drop_duplicates(subset=['source','target'])
>>> df2
     source    target metric
0  acc1.yyy  acx1.xxx  10000
1  acx1.xxx  acc1.yyy  10000

[更新]

也许“重复”不是这里的正确词汇，让我进一步解释。

id  source  target
0   bng1.xxx.00 bdr2.xxx.00
1   bng1.xxx.00 bdr1.xxx.00
2   bdr3.yyy.00 bdr3.xxx.00
3   bdr3.xxx.00 bdr3.yyy.00
4   bdr2.xxx.00 bng1.xxx.00
5   bdr1.xxx.00 bng1.xxx.00

根据上述要求，我希望删除那些源等于目标且目标等于源的条目。

0 and 4 = same pair
1 and 5 = same pair
2 and 3 = same pair

end goal will be to keep 0 1 2 or 4 5 3 .

- Cmarv

1

不太明白你想做什么。请澄清“基于源+目标或目标+源删除唯一值”的含义。提供输入和输出的示例会更有帮助。 - Denziloe

我需要取出 acc1.yyy + acx1.xxx 这一对，并确保没有任何条目与其匹配，或者与 acx1.xxx + acc1.yyy 这一对匹配。 - Cmarv

指标列怎么办？如果有重复，应该使用哪个值？请再编辑您的问题以包括一个示例输入和所需输出。 - Denziloe

度量。我已更新帖子以反映我想要实现的目标。 - Cmarv

2个回答

1

您对重复数据的定义与pandas使用的不同。在pandas中，如果相应的条目相同，则认为两行是重复的。在下面的示例中，行1和行2不是重复项，因为它们具有相应变量的不同值，而行3和4是重复项。

df = {'source':['acc1.yyy', 'acx1.xxx', 'acc1.xxx', 'acc1.xxx'], 'target': ['acx1.xxx', 'acc1.yyy', 'acc1.yyy', 'acc1.yyy']}
df = pd.DataFrame(df)
df
     # source    target
# 0  acc1.yyy  acx1.xxx
# 1  acx1.xxx  acc1.yyy
# 2  acc1.xxx  acc1.yyy
# 3  acc1.xxx  acc1.yyy
df.drop_duplicates()
     # source    target
# 0  acc1.yyy  acx1.xxx
# 1  acx1.xxx  acc1.yyy
# 2  acc1.xxx  acc1.yyy

对于你提到的情况，创建一个新列，该列是源列和目标列的元组。尝试以下操作

df.loc[:, 'src_tgt'] = pd.Series([tuple(sorted(each)) for each in list(zip(df.source.values.tolist(), df.target.values.tolist()))])
df
     # source    target               src_tgt
# 0  acc1.yyy  acx1.xxx  (acc1.yyy, acx1.xxx)
# 1  acx1.xxx  acc1.yyy  (acx1.xxx, acc1.yyy)
# 2  acc1.xxx  acc1.yyy  (acc1.xxx, acc1.yyy)
# 3  acc1.xxx  acc1.yyy  (acc1.xxx, acc1.yyy)
df.drop_duplicates(subset=['src_tgt'])
     # source    target               src_tgt
# 0  acc1.yyy  acx1.xxx  (acc1.yyy, acx1.xxx)
# 2  acc1.xxx  acc1.yyy  (acc1.xxx, acc1.yyy)

- Clock Slave

对我没用。我已经更新了帖子以反映我想要实现的内容。 - Cmarv

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jezrael · Accepted Answer

你需要先对这两列进行排序：

df1[['source','target']] = df1[['source','target']].apply(sorted,axis=1)
print (df1)
     source    target  metric
0  acc1.yyy  acx1.xxx   10000
1  acc1.yyy  acx1.xxx   10000

df2 = df1.drop_duplicates(subset=['source','target'])
print (df2)
     source    target  metric
0  acc1.yyy  acx1.xxx   10000

编辑：

看起来需要更改source列 - 删除最后3个字符：

df1['source1'] = df1.source.str[:-3]
df1[['source1','target']] = df1[['source1','target']].apply(sorted,axis=1)
print (df1)
   id          source       target      source1
0   0  bng1.xxx.00-00  bng1.xxx.00  bdr2.xxx.00
1   1  bng1.xxx.00-00  bng1.xxx.00  bdr1.xxx.00
2   2  bdr3.yyy.00-00  bdr3.yyy.00  bdr3.xxx.00
3   3  bdr3.xxx.00-00  bdr3.yyy.00  bdr3.xxx.00
4   4  bdr2.xxx.00-00  bng1.xxx.00  bdr2.xxx.00
5   5  bdr1.xxx.00-00  bng1.xxx.00  bdr1.xxx.00

df2 = df1.drop_duplicates(subset=['source1','target'])
df2 = df2.drop('source1', axis=1)
print (df2)
   id          source       target
0   0  bng1.xxx.00-00  bng1.xxx.00
1   1  bng1.xxx.00-00  bng1.xxx.00
2   2  bdr3.yyy.00-00  bdr3.yyy.00