Pandas:数据框无法合并

3

我有两个数据框(可以在这里这里找到),如下:

df= pd.read_csv('Thesis/ExternalData/naics_conversion_data/SIC2CRPCats.csv', \
                engine='python', sep=r'\s{2,}', encoding='utf-8_sig')

我只提供了读取df的代码,因为它有一些独特的格式问题。

df.dtypes

SICcode     object
Catcode     object
Category    object
SICname     object
MultSIC     object
dtype: object

merged.dtypes

2012 NAICS Code     float64
2002to2007 NAICS    float64
SICcode              object
dtype: object

df.columns.tolist()
['SICcode', 'Catcode', 'Category', 'SICname', 'MultSIC']

merged.columns.tolist()
['2012 NAICS Code', '2002to2007 NAICS', 'SICcode']

df.head(3)

    SICcode     Catcode     Category                          SICname   MultSIC
0   111         A1500   Wheat, corn, soybeans and cash grain    Wheat   X
1   112         A1600   Other commodities (incl rice, peanuts)  Rice    X
2   115         A1500   Wheat, corn, soybeans and cash grain    Corn    X

merged.sort_values('SICcode')

    2012 NAICS Code     2002to2007 NAICS    SICcode
89  212210                       212210     1011
93  212234                       212234     1021
92  212231                       212231     1031
90  212221                       212221     1041
91  212222                       212222     1044
96  212299                       212299     1061
94  212234                       212234     1061
119 213114                       213114     1081
1770    541360                   541360     1081
233     238910                   238910     1081
95  212291                       212291     1094
97  212299                       212299     1099
3   111140                       111140     111
6   111160                       111160     112
4   111150                       111150     115
0   111110                       111110     116

我将尝试使用以下代码将它们合并: merged=pd.merge(merged,df, how='right', on='SICcode') 结果如下:
2012 NAICS Code        0
2002to2007 NAICS       0
SICcode             1007
Catcode              991
Category            1007
SICname             1007
MultSIC              906
dtype: int64

我怀疑问题出在df的格式上,但我不知道如何描述它(我听说过空格这个术语,也许对这种情况有所帮助),也不知道该如何解决。请问有人对此有什么想法吗?
1个回答

4
我相信这就是你问题的根本原因:
In [47]: merged[merged.SICcode == 'Aux']
Out[47]:
      2012 NAICS Code  2002to2007 NAICS SICcode
1828         551114.0          551114.0     Aux

导致不同的数据类型:
In [61]: df.dtypes
Out[61]:
SICcode      int64
Catcode     object
Category    object
SICname     object
MultSIC     object
dtype: object

In [62]: merged.dtypes
Out[62]:
2012 NAICS Code     float64
2002to2007 NAICS    float64
SICcode              object
dtype: object

In [63]: df.SICcode.unique()
Out[63]: array([ 111,  112,  115, ..., 9711, 9721, 9999], dtype=int64)

In [64]: merged.SICcode.head(10).unique()
Out[64]: array(['116', '119', '111', '115', '112', '139'], dtype=object)

所以您可以这样做:
url = 'https://raw.githubusercontent.com/108michael/ms_thesis/master/SIC2CRPCats.csv'
df = pd.read_csv(url, engine='python', sep=r'\s{2,}', encoding='utf-8_sig')

url='https://raw.githubusercontent.com/108michael/ms_thesis/master/test.merge'
merged = pd.read_csv(url, index_col=0)

# clearing data
merged.SICcode = pd.to_numeric(merged.SICcode, errors='coerce')

mrg = df.merge(merged, on='SICcode', how='left')

mrg.head()

输出:

In [51]: mrg.head()
Out[51]:
   SICcode Catcode                                       Category  \
0      111   A1500           Wheat, corn, soybeans and cash grain
1      112   A1600  Other commodities (incl rice, peanuts, honey)
2      115   A1500           Wheat, corn, soybeans and cash grain
3      116   A1500           Wheat, corn, soybeans and cash grain
4      119   A1500           Wheat, corn, soybeans and cash grain

            SICname MultSIC  2012 NAICS Code  2002to2007 NAICS
0             Wheat       X         111140.0          111140.0
1              Rice       X         111160.0          111160.0
2              Corn       X         111150.0          111150.0
3          Soybeans       X         111110.0          111110.0
4  Cash grains, NEC       X         111120.0          111120.0

1
@MichaelPerdue,很高兴能帮忙 :) - MaxU - stand with Ukraine

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接