从pandas.df_dummies返回最优雅的方法

Question

从pandas.df_dummies返回最优雅的方法

10

从一个包含数值和名义数据的数据框架：

>>> from pandas import pd
>>> d = {'m': {0: 'M1', 1: 'M2', 2: 'M7', 3: 'M1', 4: 'M2', 5: 'M1'},
         'qj': {0: 'q23', 1: 'q4', 2: 'q9', 3: 'q23', 4: 'q23', 5: 'q9'},
         'Budget': {0: 39, 1: 15, 2: 13, 3: 53, 4: 82, 5: 70}}
>>> df = pd.DataFrame.from_dict(d)
>>> df
   Budget   m   qj
0      39  M1  q23
1      15  M2   q4
2      13  M7   q9
3      53  M1  q23
4      82  M2  q23
5      70  M1   q9

get_dummies函数将分类变量转换为虚拟/指标变量:

>>> df_dummies = pd.get_dummies(df)
>>> df_dummies
   Budget  m_M1  m_M2  m_M7  qj_q23  qj_q4  qj_q9
0      39     1     0     0       1      0      0
1      15     0     1     0       0      1      0
2      13     0     0     1       0      0      1
3      53     1     0     0       1      0      0
4      82     0     1     0       1      0      0
5      70     1     0     0       0      0      1

如何在不失优雅的前提下，从 df_dummies 返回到 df？

>>> (back_from_dummies(df_dummies) == df).all()
Budget    True
m         True
qj        True
dtype: bool

- user3313834

回到df？不确定您的意思是什么。 - David Maust

我只是指定返回/恢复。 - user3313834

谢谢。只是想确认一下。 - David Maust

https://dev59.com/MF8d5IYBdhLWcg3wgiT4#55757342 - TBhavnani

3个回答

3

首先，将列分开：

In [11]: from collections import defaultdict
         pos = defaultdict(list)
         vals = defaultdict(list)

In [12]: for i, c in enumerate(df_dummies.columns):
             if "_" in c:
                 k, v = c.split("_", 1)
                 pos[k].append(i)
                 vals[k].append(v)
             else:
                 pos["_"].append(i)

In [13]: pos
Out[13]: defaultdict(list, {'_': [0], 'm': [1, 2, 3], 'qj': [4, 5, 6]})

In [14]: vals
Out[14]: defaultdict(list, {'m': ['M1', 'M2', 'M7'], 'qj': ['q23', 'q4', 'q9']})

这使您能够切片不同的框架，以便针对每个虚拟列进行操作：

In [15]: df_dummies.iloc[:, pos["m"]]
Out[15]:
   m_M1  m_M2  m_M7
0     1     0     0
1     0     1     0
2     0     0     1
3     1     0     0
4     0     1     0
5     1     0     0

现在我们可以使用NumPy的argmax函数：

In [16]: np.argmax(df_dummies.iloc[:, pos["m"]].values, axis=1)
Out[16]: array([0, 1, 2, 0, 1, 0])

*注意：pandas的idxmax返回标签，我们需要位置以便使用Categoricals。

In [17]: pd.Categorical.from_codes(np.argmax(df_dummies.iloc[:, pos["m"]].values, axis=1), vals["m"])
Out[17]:
[M1, M2, M7, M1, M2, M1]
Categories (3, object): [M1, M2, M7]

现在我们可以将所有这些内容整合在一起：

In [21]: df = pd.DataFrame({k: pd.Categorical.from_codes(np.argmax(df_dummies.iloc[:, pos[k]].values, axis=1), vals[k]) for k in vals})

In [22]: df
Out[22]:
    m   qj
0  M1  q23
1  M2   q4
2  M7   q9
3  M1  q23
4  M2  q23
5  M1   q9

并将非虚拟列放回：

In [23]: df[df_dummies.columns[pos["_"]]] = df_dummies.iloc[:, pos["_"]]

In [24]: df
Out[24]:
    m   qj  Budget
0  M1  q23      39
1  M2   q4      15
2  M7   q9      13
3  M1  q23      53
4  M2  q23      82
5  M1   q9      70

作为一个函数：

def reverse_dummy(df_dummies):
    pos = defaultdict(list)
    vals = defaultdict(list)

    for i, c in enumerate(df_dummies.columns):
        if "_" in c:
            k, v = c.split("_", 1)
            pos[k].append(i)
            vals[k].append(v)
        else:
            pos["_"].append(i)

    df = pd.DataFrame({k: pd.Categorical.from_codes(
                              np.argmax(df_dummies.iloc[:, pos[k]].values, axis=1),
                              vals[k])
                      for k in vals})

    df[df_dummies.columns[pos["_"]]] = df_dummies.iloc[:, pos["_"]]
    return df

In [31]: reverse_dummy(df_dummies)
Out[31]:
    m   qj  Budget
0  M1  q23      39
1  M2   q4      15
2  M7   q9      13
3  M1  q23      53
4  M2  q23      82
5  M1   q9      70

- Andy Hayden

2

与 @David 类似，我发现 idxmax 可以为您完成大部分工作。然而，在尝试将列转换回来时，没有绝对可靠的方法可以保证您不会遇到问题，因为在某些情况下，很难确定哪些列是虚拟列，哪些不是。我发现，通过使用在数据中很少出现的分隔符，可以极大地减轻这个问题。例如，多词列名通常使用 _，因此我使用 __（双下划线）作为分隔符；我从未在实际列名中遇到过这种情况。

另外，请注意，pd.get_dummies 将所有虚拟列移动到末尾。这意味着您不能一定恢复原始列的顺序。

以下是我的方法示例。您可以通过具有 sep 的列来识别虚拟列。我们使用 df.filter 获取虚拟列组，它将允许我们使用正则表达式匹配列名（仅使用 sep 之前的名称部分即可；也有其他方法可以完成此部分）。 rename 部分去掉了列名的开头（例如 m__），以便剩余部分是值。然后，idxmax 提取了具有 1 的列名。这给我们提供了在一个原始列上撤消 pd.get_dummies 的数据帧；我们将从每个列上反转 pd.get_dummies 的数据帧与 other_cols（未被“虚拟化”的列）连接在一起。

In [1]: import pandas as pd

In [2]: df = pd.DataFrame.from_dict({'m': {0: 'M1', 1: 'M2', 2: 'M7', 3: 'M1', 4: 'M2', 5: 'M1'},
   ...:          'qj': {0: 'q23', 1: 'q4', 2: 'q9', 3: 'q23', 4: 'q23', 5: 'q9'},
   ...:          'Budget': {0: 39, 1: 15, 2: 13, 3: 53, 4: 82, 5: 70}})

In [3]: df
Out[3]: 
   Budget   m   qj
0      39  M1  q23
1      15  M2   q4
2      13  M7   q9
3      53  M1  q23
4      82  M2  q23
5      70  M1   q9

In [4]: sep = '__'

In [5]: dummies = pd.get_dummies(df, prefix_sep=sep)

In [6]: dummies
Out[6]: 
   Budget  m__M1  m__M2  m__M7  qj__q23  qj__q4  qj__q9
0      39      1      0      0        1       0       0
1      15      0      1      0        0       1       0
2      13      0      0      1        0       0       1
3      53      1      0      0        1       0       0
4      82      0      1      0        1       0       0
5      70      1      0      0        0       0       1

In [7]: dfs = []
   ...: 
   ...: dummy_cols = list(set(col.split(sep)[0] for col in dummies.columns if sep in col))
   ...: other_cols = [col for col in dummies.columns if sep not in col]
   ...: 
   ...: for col in dummy_cols:
   ...:     dfs.append(dummies.filter(regex=col).rename(columns=lambda name: name.split(sep)[1]).idxmax(axis=1))
   ...: 
   ...: df = pd.concat(dfs + [dummies[other_cols]], axis=1)
   ...: df.columns = dummy_cols + other_cols
   ...: df
   ...: 
Out[7]: 
    qj   m  Budget
0  q23  M1      39
1   q4  M2      15
2   q9  M7      13
3  q23  M1      53
4  q23  M2      82
5   q9  M1      70

- Nathan

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- David Maust · Accepted Answer

< p > idxmax 很容易做到这一点。

from itertools import groupby

def back_from_dummies(df):
    result_series = {}

    # Find dummy columns and build pairs (category, category_value)
    dummmy_tuples = [(col.split("_")[0],col) for col in df.columns if "_" in col]

    # Find non-dummy columns that do not have a _
    non_dummy_cols = [col for col in df.columns if "_" not in col]

    # For each category column group use idxmax to find the value.
    for dummy, cols in groupby(dummmy_tuples, lambda item: item[0]):

        #Select columns for each category
        dummy_df = df[[col[1] for col in cols]]

        # Find max value among columns
        max_columns = dummy_df.idxmax(axis=1)

        # Remove category_ prefix
        result_series[dummy] = max_columns.apply(lambda item: item.split("_")[1])

    # Copy non-dummy columns over.
    for col in non_dummy_cols:
        result_series[col] = df[col]

    # Return dataframe of the resulting series
    return pd.DataFrame(result_series)

(back_from_dummies(df_dummies) == df).all()