在Python pandas中,最接近R Factor variable的等效变量是什么?
这个问题似乎是来自一年前的,但由于它仍然开放,这里有一个更新。pandas引入了categorical数据类型,它的操作方式非常类似于R语言中的factors。请参见此链接获取更多信息:
http://pandas-docs.github.io/pandas-docs-travis/categorical.html
下面是从上述链接中复制出来的代码片段,展示了如何在pandas中创建一个“factor”变量。
In [1]: s = Series(["a","b","c","a"], dtype="category")
In [2]: s
Out[2]:
0 a
1 b
2 c
3 a
dtype: category
Categories (3, object): [a < b < c]
import pandas as pd
df = pd.read_csv('path_to_your_file')
df['new_factor'], _ = pd.factorize(df['old_categorical'], sort=True)
def factor(var):
var_set = set(var)
var_set = {x: y for x, y in [pair for pair in zip(var_set, range(len(var_set)))]}
return [var_set[x] for x in var]
df['new_factor1'] = df['old_categorical1'].apply(factor)
df[['new_factor2', 'new_factor3']] = df[['old_categorical2', 'old_categorical3']].apply(factor)
C # array containing category data
V # array containing numerical data
H = np.unique(C)
mydict = {}
for h in H:
mydict[h] = V[C==h]
boxplot(mydict.values(), labels=mydict.keys())
pandas.Factor
添加为因子列。但我认为这并不完全等同,特别是在缺失数据的情况下。 - agstudy