Python Pandas - 条件合并

3

I have 2 dataframes in python pandas

Dataframe 1

User_id  zipcode

1        12345

2        23456

3        34567

数据框 2

ZipCodeLowerBound ZipCodeUpperBound Region

10000             19999             1

20000             29999             2

30000             39999             3

如何使用pandas合并方法将区域映射到数据框1,并且满足条件if(df1.zipcode>=df2.ZipCodeLowerBound and df1.zipcode<=df2.ZipCodeUpperBound)

4个回答

2
这将为每个地区提供一列,以及该地区中每个邮政编码是否属于该地区的掩码:
df2 = df2.set_index('Region')
mask = df2.apply(lambda r: df1.zipcode.between(r['ZipCodeLowerBound'],
                                               r['ZipCodeUpperBound']),
                 axis=1).T
mask
Out[103]: 
Region      1      2      3
0        True  False  False
1       False   True  False
2       False  False   True

然后,您可以使用该矩阵对其自身的列名称进行应用,以将其作为掩码并找回该区域:
mask.dot(mask.columns)
Out[110]: 
0    1
1    2
2    3
dtype: int64

0

一种选择是使用pyjanitorconditional_join,对于范围连接也很高效,并且比天真的交叉连接更好:

# pip install pyjanitor
# you can also install the dev version for the latest
# including the ability to use numba for faster performance
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git

import janitor
import pandas as pd

(df2
.conditional_join(
    df1, 
    ('ZipCodeLowerBound', 'zipcode', '<='), 
    ('ZipCodeUpperBound', 'zipcode', '>='))
.loc(axis=1)['User_id', 'zipcode', 'Region']
)

   User_id  zipcode  Region
0        1    12345       1
1        2    23456       2
2        3    34567       3

使用dev版本,您也可以选择列:

# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git

import janitor
import pandas as pd

(df2
.conditional_join(
    df1, 
    ('ZipCodeLowerBound', 'zipcode', '<='), 
    ('ZipCodeUpperBound', 'zipcode', '>='),
    df_columns = 'Region')
)

   Region  User_id  zipcode
0       1        1    12345
1       2        2    23456
2       3        3    34567

0
df1['Region'] = df1.User_id
df1.merge(df2, on='Region')

如果您的数据框中具有相同的列,则可以合并两个数据集,

这只是一个示例,您可以根据自己的条件进行合并并尝试它

这是合并后的输出结果

   User_id  zipcode Region  ZipCodeLowerBound   ZipCodeUpperBound
0   1        12345     1         10000             19999
1   2        23456     2         20000             29999
2   3        34567     3         30000             39999

0
import pandas as pd

df1 = pd.DataFrame({'User_id': [1,2,3],
                    'zipcode':[12345,23456,34567]})
df2 = pd.DataFrame({'ZipCodeLowerBound': [10000,20000,30000],
                    'ZipCodeUpperBound': [19999,29999,39999],
                    'Region': [1,2,3]})

region = []
for i in range(len(df1.zipcode)):
    region.append(int(df2[(df2.ZipCodeLowerBound <= df1.zipcode[i]) & (df2.ZipCodeUpperBound >= df1.zipcode[i])]['Region']))
df1['Region'] = region

print(df1)

输出:

   User_id  zipcode  Region
0        1    12345       1
1        2    23456       2
2        3    34567       3

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接