在Stata / Python中合并相似的行

Question

在Stata / Python中合并相似的行

3

我正在为图表分析做一些数据准备工作，我的数据如下所示。

country1   country2   pair      volume
USA         CHN       USA_CHN   10
CHN         USA       CHN_USA   5 
AFG         ALB       AFG_ALB   2
ALB         AFG       ALB_AFG   5

我希望将它们合并在一起，以便：

country1   country2   pair      volume
USA         CHN       USA_CHN   15
AFG         ALB       AFG_ALB   7

有没有一种简单的方法在Stata或Python中实现这个功能？我尝试过制作一个重复的数据框并将“pair”重命名为country2_country1，然后合并它们并删除重复的量，但这是一种繁琐的方法：我想知道是否有更好的方法。如果需要了解，我的数据格式是有向图，我正在将其转换为无向图。

- noiivice

3

你可以使用 sorted_pair = "_".join(sorted(pair.split("_"))) 创建一个“排序”的关键字，这样国家的顺序总是相同的；要更进一步地帮助你，我们需要看到你的一些代码。 - Jean-François Fabre

4个回答

1

这里有一个解决方案，利用pandas自动对齐索引的特性。

df1 = df.set_index(['country1'])
df2 = df.set_index(['country2'])
df1['volume'] += df2['volume']
df1.reset_index().query('country1 > country2')

  country1 country2     pair  volume
0      USA      CHN  USA_CHN      15
3      ALB      AFG  ALB_AFG       7

这里是基于@jean-françois-fabre评论的解决方案。

split_sorted = df.pair.str.split('_').map(sorted)
df_switch = pd.concat([split_sorted.str[0], 
                       split_sorted.str[1], 
                       df['volume']], axis=1, keys=['country1', 'country2', 'volume'])
df_switch.groupby(['country1', 'country2'], as_index=False, sort=False).sum()

输出

  country1 country2  volume
0      CHN      USA      15
1      AFG      ALB       7

- Ted Petrou

1

实际上，可以通过执行 df[['country1', 'country2']] = df[['country1', 'country2']].apply(sorted, 1) 来简化第二种方法，之后 groupby(...) 将按预期工作。 - Igor Raush

非常好。没有考虑到应用。 - Ted Petrou

1

在Stata中，您可以依靠字母顺序为每对数据提供一个独特的标识符。

clear 

input str3 (country1   country2)  volume
    USA         CHN         10 
    CHN         USA          5 
    AFG         ALB          2
    ALB         AFG          5
end 

gen first = cond(country1 < country2, country1, country2) 
gen second = cond(country1 < country2, country2, country1) 
collapse (sum) volume, by(first second) 

list 

     +-------------------------+
     | first   second   volume |
     |-------------------------|
  1. |   AFG      ALB        7 |
  2. |   CHN      USA       15 |
     +-------------------------+

你可以选择将数据合并回原始数据集中。

在此处有记录和讨论。

注：提供一个清晰的数据示例是有帮助的。将其作为input数据的代码呈现甚至更有帮助。

- Nick Cox

0

注意：正如Nick Cox在下面评论中所说，当国家数量很大时，这种解决方案会变得有些疯狂。（使用200个国家，您需要准确地存储一个200位数）

以下是使用纯Stata的简洁方法。

我将国家有效地转换为二进制“标志”，形成以下映射：

AFG  0001
ALB  0010
CHN  0100
USA  1000

这是通过给每个国家编号，然后计算 2^(country_number) 来实现的。当我们将这些二进制数字相加时，结果就是两个“旗帜”的组合。例如：

AFG + CHN = 0101
CHN + AFG = 0101

注意现在国家的顺序已经不重要了！

因此，我们现在可以愉快地添加旗帜并按结果折叠，同时逐步总结体积。

这是完整的代码（有很多注释，所以看起来比实际长度长得多！）

// Convert country names into numbers, storing the resulting
// name/number mapping in a label called "countries"
encode country1, generate(from_country) label(countries)
// Do it again for the other country, using the existing
// mappings where the countries already exist, and adding to the 
// existing mapping where they don't
encode country2, generate(to_country) label(countries)
// Add these numbers as if they were binary flags
// Thus CHN (3) + USA (4) becomes:
// 010 +
// 100
// ---
// 110
// This makes adding strings commutative and unique. This means that
// the new variable doesn't care which way round the countries are
// nor can it get confused by pairs of countries adding up to the same
// number.
generate bilateral = 2^from_country + 2^to_country
// The rest is easy. Collapse by the new summed variable
// taking (arbitrarily) the lowest of the from_countries
// and the highest of the to_countries
collapse (sum) volume (min) from_country (max) to_country, by(bilateral)
// Tell Stata that these new min and max countries still have the same
// label:
label values from_country "countries"
label values to_country "countries"

- LondonRob

1

假设你有200个不同的国家，那么bilateral的最大值将是2的该幂次方。你需要保留这么多个不同的值而不会出错。因此，我不建议你像默认使用float数据类型一样隐含地使用它。（相反，如果你将类型设置为double，使其成为默认值，那么这种情况就不会发生；但这并不是普遍适用的。）我喜欢这里的聪明技巧，但我认为这不是最好的方法。 - Nick Cox

@NickCox 公正的评论。我考虑过使用列来存储每个国家的旗帜，但是这变得有点烦人了。 - LondonRob

你的修改没有客观地反映事实。Stata可以处理这么大的数字，只是默认变量类型无法做到。Stata的观点是程序员可能需要考虑变量如何保存。 - Nick Cox

在这种情况下，我不明白你的反对意见。"far from universal" 是什么意思？ - LondonRob

你可以将 set type double 设置为自己的默认值（参见 help generate），但是我在关于变量类型的问题中看到的所有证据表明，很少有人注意到这一点，更少的人会这样做。 - Nick Cox

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Igor Raush · Accepted Answer

你的键必须由两个国家的集合组成，以便无论顺序如何进行比较都能相等。在Python/Pandas中，可以按照以下方式完成。

import pandas as pd
import io

# load in your data
s = """
country1   country2   pair      volume
USA        CHN        USA_CHN   10
CHN        USA        CHN_USA   5
AFG        ALB        AFG_ALB   2
ALB        AFG        ALB_AFG   5
"""
data = pd.read_table(io.BytesIO(s), sep='\s+')

# create your key (using frozenset instead of set, since frozenset is hashable)
key = data[['country1', 'country2']].apply(frozenset, 1)

# group by the key and aggregate using sum()
print(data.groupby(key).sum())

这将导致：

            volume
(CHN, USA)      15
(AFG, ALB)       7

这并不完全是您想要的，但您应该能够从这里将其整理成正确的形状。