pandas按组分组并从应用操作中添加列

4

给定这样一个数据框:

chrom   first_bp_intron last_bp_intron  unique_junction_reads
chr1    100 200 10
chr1    100 150 40
chr1    110 200 90

有一个优雅的方法可以做到这一点吗?在列first_bp_intron上使用groupby,将unique_junction_reads中的值除以组的总和,得到新列phi5。然后对于last_bp_intron也是同样的操作,得到新列phi3

chrom   first_bp_intron last_bp_intron  unique_junction_reads   phi5    phi3
chr1    100 200 10  0.2 0.1
chr1    100 150 40  0.8 1.0
chr1    110 200 90  1.0 0.9

我的方法虽然慢但是有效,具体步骤如下:

json = '{"chrom":{"4010":"chr2","4011":"chr2","4012":"chr2","4013":"chr2","4014":"chr2","4015":"chr2","4016":"chr2","4017":"chr2","4018":"chr2","4019":"chr2","4020":"chr2","4021":"chr2","4022":"chr2","4023":"chr2","4024":"chr2","4025":"chr2"},"first_bp_intron":{"4010":50149390,"4011":50170930,"4012":50280729,"4013":50318633,"4014":50464109,"4015":50692700,"4016":50693626,"4017":50699610,"4018":50723234,"4019":50724853,"4020":50733756,"4021":50755790,"4022":50758569,"4023":50765775,"4024":51012497,"4025":51015345},"last_bp_intron":{"4010":50170841,"4011":50280408,"4012":50318460,"4013":50463926,"4014":50692579,"4015":50693598,"4016":50699435,"4017":50723042,"4018":50724470,"4019":50733632,"4020":50755762,"4021":50758364,"4022":50765390,"4023":50779724,"4024":51017681,"4025":51017681},"unique_junction_reads":{"4010":1,"4011":3,"4012":6,"4013":6,"4014":15,"4015":8,"4016":8,"4017":5,"4018":40,"4019":86,"4020":85,"4021":64,"4022":81,"4023":53,"4024":12,"4025":9}}'

sj = pd.read_json(json)

five_prime_reads = sj.groupby(('chrom', 'first_bp_intron')).apply(lambda x: x.unique_junction_reads.sum())
three_prime_reads = sj.groupby(('chrom', 'last_bp_intron')).apply(lambda x: x.unique_junction_reads.sum())


for (chrom, first_bp_intron , last_bp_intron), df in sj.groupby(['chrom', 'first_bp_intron', 'last_bp_intron']):
    print chrom, last_bp_intron,
    print '\tphi3', (df.unique_junction_reads/three_prime_reads[(chrom, last_bp_intron)]).values,
    print '\tphi5', (df.unique_junction_reads/five_prime_reads[(chrom, first_bp_intron)]).values

但我相信在pandas中有更优雅的表达方式。
以下是完整的IPython笔记本,展示了我想要做的事情:http://nbviewer.ipython.org/11418657
1个回答

11

我会使用 groupbytransform 进行类似以下的操作:

In [9]: by_first = df.groupby('first_bp_intron')
In [10]: df['phi5'] = by_first['unique_junction_reads'].transform(lambda x: x/x.sum())

In [11]: by_last = df.groupby('last_bp_intron')
In [12]: df['phi3'] = by_last['unique_junction_reads'].transform(lambda x: x/x.sum())

In [13]: df
Out[13]: 
  chrom  first_bp_intron  last_bp_intron  unique_junction_reads  phi5  phi3
0  chr1              100             200                     10   0.2   0.1
1  chr1              100             150                     40   0.8   1.0
2  chr1              110             200                     90   1.0   0.9

太棒了,transform() 正是我所需要的!但你介意解释一下 transformapply 之间的区别吗? - Olga Botvinnik
1
如果你用 apply 替换 transform,应该可以得到相同的输出结果。apply 是更通用的方法;当你想要返回类似于索引的东西时,使用 transform 更为合适。 - Karl D.

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接