pandas：如何将字符串列转换为有序类别？

Question

pandas：如何将字符串列转换为有序类别？

8

我第一次使用pandas。我有一列包含调查回应的数据，它可以是非常同意，同意，不同意，非常不同意和都不知道。

以下是该列的describe()和value_counts()输出：

count      4996
unique        5
top       Agree
freq       1745
dtype: object
Agree                1745
Strongly agree        926
Strongly disagree     918
Disagree              793
Neither               614
dtype: int64

我希望对这个问题与总体得分进行线性回归。然而，我感觉应该先将该列转换为类别变量，因为它本质上是有序的。这样做正确吗？如果是，我该如何操作？

我已尝试了以下方法：

df.EasyToUseQuestionFactor = pd.Categorical.from_array(df.EasyToUseQuestion)
print df.EasyToUseQuestionFactor

这会产生看起来大致正确的输出，但似乎类别顺序不对。我能指定排序方式吗？我需要指定排序方式吗？

这是我现在的代码剩余部分：

df = pd.read_csv('./data/responses.csv')
lm1 = ols('OverallScore ~ EasyToUseQuestion', data).fit()
print lm1.rsquared

- Richard

请点击此处查看即将到来的完整分类支持（这将在0.15.0中实现，目前尚未发布）：http://pandas-docs.github.io/pandas-docs-travis/categorical.html - Jeff

3个回答

3

是的，您应该将其转换为分类数据，这样就可以解决问题了。

likert_scale = {'strongly agree':2, 'agree':1, 'neither':0, 'disagree':-1, 'strongly disagree':-2}
df['categorical_data'] = df.EasyToUseQuestion.apply(lambda x: likert_scale[x])

- jay s

1

谢谢！我不得不使用.map而不是.apply，但除此之外，这个方法很有效。 - Richard

1

pandas.factorize() 可以获得数组的数字表示。

factorize 作为顶级函数 pandas.factorize() 和方法 Series.factorize()和Index.factorize() 都可用。

import pandas as pd


df = pd.DataFrame({'answer' : ['strongly agree', 'strongly agree', 'agree', 'neither', 'disagree', 'strongly disagree']})

# df['category'] = pd.factorize(df['answer'])[0]
df['category'] = df['answer'].factorize()[0]

# print(df)

              answer  category
0     strongly agree            0
1     strongly agree            0
2              agree            1
3            neither            2
4           disagree            3
5  strongly disagree            4

- Ynjxsjmh

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- neves · Accepted Answer

现在有两种方法可以实现。使用这种方法，您的列将更易读且占用更少的内存。由于它将成为分类类型，因此仍然可以对值进行排序。

首选方法如下：

df['grades'].astype('category')

astype曾经接受categories参数，但现在已不再支持。因此，如果：

您想以非字典顺序对类别进行排序
或者想要额外的类别，这些类别不在您的数据中，您必须使用下面的解决方案。

这个建议来自于文档。

In [26]: from pandas.api.types import CategoricalDtype
In [27]: s = pd.Series(["a", "b", "c", "a"])
In [28]: cat_type = CategoricalDtype(categories=["b", "c", "d"],
   ....:                             ordered=True)
In [29]: s_cat = s.astype(cat_type)

创建额外价值的附加提示：使用df.column_name.unique()获取列中的所有现有值，并添加您不在其中的列名。