使用pandas在多列问题中计算Likert量表结果的数量。

4
我有以下数据框:
       Question1        Question2         Question3          Question4
User1  Agree            Agree          Disagree         Strongly Disagree
User2  Disagree         Agree          Agree            Disagree
User3  Agree            Agree          Agree            Agree

有没有一种方法可以将上面列出的数据框转换为下面的形式?
              Agree         Disagree         Strongly Disagree
 Question1    2               1                  0

 Question2    2               1                  0

 Question3    2               1                  0
 Question4    1               1                  1

这与我之前的问题类似:如何从三列中分组提取数据框 我尝试使用stack/pivot查看以前的问题,但无法解决。实际数据框有20多个问题和一个Likert评分,包括强烈同意、同意、中立、不同意、强烈不同意。
3个回答

5

使用pd.get_dummies

pd.get_dummies(df.stack()).groupby(level=1).sum()

           Agree  Disagree  Strongly Disagree
Question1      2         1                  0
Question2      3         0                  0
Question3      2         1                  0
Question4      1         1                  1

提升到另一个层次
我们可以使用 numpy.bincount 来加速处理。但是我们必须注意维度。

v = df.values
f, u = pd.factorize(v.ravel())
n, m = u.size, v.shape[1]
r = np.tile(np.arange(m), n)
b0 = np.bincount(r * n + f)
pad = np.zeros(n * m - b0.size, dtype=int)
b = np.append(b0, pad)

pd.DataFrame(b.reshape(m, n), df.columns, u)

           Agree  Disagree  Strongly Disagree
Question1      2         1                  0
Question2      3         0                  0
Question3      2         1                  0
Question4      1         1                  1

另一个 numpy 选项

v = df.values
n, m = v.shape
f, u = pd.factorize(v.ravel())

pd.DataFrame(
    np.eye(u.size, dtype=int)[f].reshape(n, m, -1).sum(0),
    df.columns, u
)

           Agree  Disagree  Strongly Disagree
Question1      2         1                  0
Question2      3         0                  0
Question3      2         1                  0
Question4      1         1                  1

速度差异

%%timeit
v = df.values
f, u = pd.factorize(v.ravel())
n, m = u.size, v.shape[1]
r = np.tile(np.arange(m), n)
b0 = np.bincount(r * n + f)
pad = np.zeros(n * m - b0.size, dtype=int)
b = np.append(b0, pad)
​
pd.DataFrame(b.reshape(m, n), df.columns, u)
1000 loops, best of 3: 194 µs per loop

%%timeit
v = df.values
n, m = v.shape
f, u = pd.factorize(v.ravel())

pd.DataFrame(
    np.eye(u.size, dtype=int)[f].reshape(n, m, -1).sum(0),
    df.columns, u
)
1000 loops, best of 3: 195 µs per loop

%timeit pd.get_dummies(df.stack()).groupby(level=1).sum()
1000 loops, best of 3: 1.2 ms per loop

谢谢!这个完美地解决了我的问题,你在我上一个问题中提供的“额外奖励”部分帮助我对列进行了排序。 - yangd01234

4

您可以使用pd.Series.value_counts迭代列。如果您使用apply进行此操作,则索引将自动对齐:

df.apply(pd.Series.value_counts)
Out: 
                   Question1  Question2  Question3  Question4
Agree                    2.0        3.0        2.0          1
Disagree                 1.0        NaN        1.0          1
Strongly Disagree        NaN        NaN        NaN          1

一些后处理操作:
df.apply(pd.Series.value_counts).fillna(0).astype('int')
Out: 
                   Question1  Question2  Question3  Question4
Agree                      2          3          2          1
Disagree                   1          0          1          1
Strongly Disagree          0          0          0          1

1
df.apply(lambda x:x.value_counts()).fillna(0).astype(int)
#                   Question1  Question2  Question3  Question4
#Agree                      2          3          2          1
#Disagree                   1          0          1          1
#Strongly Disagree          0          0          0          1

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接