Pandas中从宽表转换为长表对我来说太复杂了。

Question

Pandas中从宽表转换为长表对我来说太复杂了。

7

个人（从0到5）在A和B两个地点之间进行选择。我的数据是宽格式，包含个体特征（ind_var）和仅由位置变化的特征（location_var）。

例如，我有：

In [281]:

df_reshape_test = pd.DataFrame( {'location' : ['A', 'A', 'A', 'B', 'B', 'B'], 'dist_to_A' : [0, 0, 0, 50, 50, 50], 'dist_to_B' : [50, 50, 50, 0, 0, 0], 'location_var': [10, 10, 10, 14, 14, 14], 'ind_var': [3, 8, 10, 1, 3, 4]})

df_reshape_test

Out[281]:
    dist_to_A   dist_to_B   ind_var location location_var
0    0            50             3   A       10
1    0            50             8   A       10
2    0            50            10   A       10
3    50           0              1   B       14
4    50           0              3   B       14
5    50           0              4   B       14

变量“location”是由个人选择的变量。dist_to_A是从个人选择的位置到位置A的距离（dist_to_B同理）。

我希望我的数据具有以下形式：

    choice  dist_S  ind_var location    location_var
0    1        0       3         A           10
0    0       50       3         B           14
1    1        0       8         A           10
1    0       50       8         B           14
2    1        0      10         A           10
2    0       50      10         B           14
3    0       50       1         A           10
3    1        0       1         B           14
4    0       50       3         A           10
4    1        0       3         B           14
5    0       50       4         A           10
5    1        0       4         B           14

当choice == 1时，表示个人已选择该地点，dist_S是所选择的位置距离的距离。

我了解了.stack方法，但不知道如何在这种情况下应用它。感谢您的时间！

注意：这只是一个简单的示例。我正在寻找具有不同数量的位置和每个位置的个体数量的数据集，因此如果可能，我正在寻找一种灵活的解决方案。

- cd98

3个回答

3

我有点好奇你为什么想要以这种格式展示。可能有更好的方式来存储你的数据，但是还是可以这样做。

In [137]: import numpy as np

In [138]: import pandas as pd

In [139]: df_reshape_test = pd.DataFrame( {'location' : ['A', 'A', 'A', 'B', 'B
', 'B'], 'dist_to_A' : [0, 0, 0, 50, 50, 50], 'dist_to_B' : [50, 50, 50, 0, 0, 
0], 'location_var': [10, 10, 10, 14, 14, 14], 'ind_var': [3, 8, 10, 1, 3, 4]})

In [140]: print(df_reshape_test)
   dist_to_A  dist_to_B  ind_var location  location_var
0          0         50        3        A            10
1          0         50        8        A            10
2          0         50       10        A            10
3         50          0        1        B            14
4         50          0        3        B            14
5         50          0        4        B            14

In [141]: # Get the new axis separately:

In [142]: idx = pd.Index(df_reshape_test.index.tolist() * 2)

In [143]: df2 = df_reshape_test[['ind_var', 'location', 'location_var']].reindex(idx)

In [144]: print(df2)
   ind_var location  location_var
0        3        A            10
1        8        A            10
2       10        A            10
3        1        B            14
4        3        B            14
5        4        B            14
0        3        A            10
1        8        A            10
2       10        A            10
3        1        B            14
4        3        B            14
5        4        B            14

In [145]: # Swap the location for the second half

In [146]: # replace any 6 with len(df) / 2 + 1 if you have more rows.d 

In [147]: df2['choice'] = [1] * 6 + [0] * 6  # may need to play with this.

In [148]: df2.iloc[6:].location.replace({'A': 'B', 'B': 'A'}, inplace=True)

In [149]: df2 = df2.sort()

In [150]: df2['dist_S'] = np.abs((df2.choice - 1) * 50)

In [151]: print(df2)
   ind_var location  location_var  choice  dist_S
0        3        A            10       1       0
0        3        B            10       0      50
1        8        A            10       1       0
1        8        B            10       0      50
2       10        A            10       1       0
2       10        B            10       0      50
3        1        B            14       1       0
3        1        A            14       0      50
4        3        B            14       1       0
4        3        A            14       0      50
5        4        B            14       1       0
5        4        A            14       0      50

它不会很好地推广，但可能有其他（更好的）方法来解决生成选择列等不太好的部分。

- TomAugspurger

谢谢您的回复！我同意这是一种奇怪的格式，但我需要它是因为Stata的asclogit命令要求使用这种数据集形式来进行有条件的对数回归（即不同选项的变化特征）。我会尝试您提供的解决方案，并期待其他网友的回应。 - cd98

好的。你看过statsmodels了吗？它是一个用于计量经济学/统计推断的Python包。我不确定是否有人实现了条件对数几率回归，但它涵盖了所有基础知识。 - TomAugspurger

我看了一下statsmodels，但目前为止他们只有多项式Logit（个体变化特征）。然而，看起来他们正在开发条件Logit（请参见此博客）。 - cd98

我尝试将TomAugspurger的代码应用到我的真实数据集上（该数据集大约有60个不同的选择位置，而不仅仅是A和B，并且每个位置选择的个体数量也不同），但我还没有找出如何使其在我的情况下工作。我正在研究个体和选择的多级索引，看看是否可以解决问题。 - cd98

1

嗯，很抱歉听到它没有起作用。如果有任何其他问题，请告诉我。顺便说一下，pd.get_dummies()函数可能对你处理那么多位置很有帮助。 - TomAugspurger

嗨，再次感谢。get_dummies()看起来是我以后必须使用的东西。如果您想要查看，我已经发布了一个简化版本的问题在这里。 - cd98

2

好的，这个可能比我预计的时间要长，但是这里有一个更通用的答案，适用于每个人的任意数量的选择。我相信有更简单的方法，所以如果有人能够提供一些更好的代码，那就太棒了。

df = pd.DataFrame( {'location' : ['A', 'A', 'A', 'B', 'B', 'B'], 'dist_to_A' : [0, 0, 0, 50, 50, 50], 'dist_to_B' : [50, 50, 50, 0, 0, 0], 'location_var': [10, 10, 10, 14, 14, 14], 'ind_var': [3, 8, 10, 1, 3, 4]})

这提供了

    dist_to_A   dist_to_B   ind_var location   location_var
0    0           50          3     A            10
1    0           50          8     A            10
2    0           50         10     A            10
3    50          0           1     B            14
4    50          0           3     B            14
5    50          0           4     B            14

然后我们执行以下操作：

df.index.names = ['ind']

# Add choice var

df['choice'] = 1

# Create dictionaries we'll use later

ind_to_loc = dict(df['location'])
# gives ind_to_loc equal to {0 : 'A', 1 : 'A', 2 : 'A', 3 : 'B', 4 : 'B', 5: 'B'}

ind_dict = dict(df['ind_var'])
#gives  { 0: 3, 1 : 8, 2 : 10, 3: 1, 4 : 3, 5: 4}

loc_dict = dict(  df.groupby('location').agg(lambda x : int(np.mean(x)) )['location_var']  )
# gives  {'A' : 10, 'B' : 14}

现在我创建了一个多级索引，并重新索引以获得长格式。

df = df.set_index( [df.index, df['location']] )

df.index.names = ['ind', 'location']

# re-index to long shape

loc_list = ['A', 'B']
ind_list = [0, 1, 2, 3, 4, 5]
new_shape = [  (ind, loc) for ind in ind_list for loc in loc_list]
idx = pd.Index(new_shape)
df_long = df.reindex(idx, method = None)
df_long.index.names = ['ind', 'loc']

长方形的形状如下所示：

         dist_to_A  dist_to_B  ind_var location  location_var  choice
ind loc                                                              
0   A            0         50        3        A            10       1
    B          NaN        NaN      NaN      NaN           NaN     NaN
1   A            0         50        8        A            10       1
    B          NaN        NaN      NaN      NaN           NaN     NaN
2   A            0         50       10        A            10       1
    B          NaN        NaN      NaN      NaN           NaN     NaN
3   A          NaN        NaN      NaN      NaN           NaN     NaN
    B           50          0        1        B            14       1
4   A          NaN        NaN      NaN      NaN           NaN     NaN
    B           50          0        3        B            14       1
5   A          NaN        NaN      NaN      NaN           NaN     NaN
    B           50          0        4        B            14       1

现在使用字典来填充NaN值：

df_long['ind_var'] = df_long.index.map(lambda x : ind_dict[x[0]] )
df_long['location']  = df_long.index.map(lambda x : ind_to_loc[x[0]] )
df_long['location_var'] = df_long.index.map(lambda x : loc_dict[x[1]] )

# Fill in choice
df_long['choice'] = df_long['choice'].fillna(0)

最后，剩下的就是创建dist_S了。
我会在这里偷个懒，假设我可以创建如下嵌套字典：

nested_loc = {'A' : {'A' : 0, 'B' : 50}, 'B' : {'A' : 50, 'B' : 0}}

（这段话的意思是：如果你在A地点，那么A地点距离为0公里，B地点距离为50公里。）

def nested_f(x):    
    return nested_loc[x[0]][x[1]]

df_long = df_long.reset_index()
df_long['dist_S'] = df_long[['loc', 'location']].apply(nested_f, axis=1)

df_long = df_long.drop(['dist_to_A', 'dist_to_B', 'location'], axis = 1 )

df_long

能够给出所期望的结果

    ind loc ind_var location_var    choice  dist_S
0    0   A   3         10            1      0
1    0   B   3         14            0      50
2    1   A   8         10            1      0
3    1   B   8         14            0      50
4    2   A   10        10            1      0
5    2   B   10        14            0      50
6    3   A   1         10            0      50
7    3   B   1         14            1      0
8    4   A   3         10            0      50
9    4   B   3         14            1      0
10   5   A   4         10            0      50
11   5   B   4         14            1      0

- cd98

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Zhen Sun · Accepted Answer

事实上，pandas有一个名为wide_to_long的命令可以方便地完成你想要做的事情。

df = pd.DataFrame( {'location' : ['A', 'A', 'A', 'B', 'B', 'B'], 
                'dist_to_A' : [0, 0, 0, 50, 50, 50], 
                'dist_to_B' : [50, 50, 50, 0, 0, 0], 
                'location_var': [10, 10, 10, 14, 14, 14], 
                'ind_var': [3, 8, 10, 1, 3, 4]})

df['ind'] = df.index

#The `location` and `location_var` corresponds to the choices, 
#record them as dictionaries and drop them 
#(Just realized you had a cleaner way, copied from yous). 

ind_to_loc = dict(df['location'])
loc_dict = dict(df.groupby('location').agg(lambda x : int(np.mean(x)))['location_var'])
df.drop(['location_var', 'location'], axis = 1, inplace = True)
# now reshape
df_long = pd.wide_to_long(df, ['dist_to_'], i = 'ind', j = 'location') 

# use the dictionaries to get variables `choice` and `location_var` back.

df_long['choice'] = df_long.index.map(lambda x: ind_to_loc[x[0]])
df_long['location_var'] = df_long.index.map(lambda x : loc_dict[x[1]])
print df_long.sort()

这是您要求的表格:

              ind_var  dist_to_ choice  location_var
ind location                                        
0   A               3         0      A            10
    B               3        50      A            14
1   A               8         0      A            10
    B               8        50      A            14
2   A              10         0      A            10
    B              10        50      A            14
3   A               1        50      B            10
    B               1         0      B            14
4   A               3        50      B            10
    B               3         0      B            14
5   A               4        50      B            10
    B               4         0      B            14

当然，如果您希望的话，可以生成一个选择变量，该变量接受0和1。