在使用seaborn绘图时，对于缺失值应该如何处理呢？

Question

在使用seaborn绘图时，对于缺失值应该如何处理呢？

pythonpython-2.7pandasdata-analysisseaborn

18

我使用lambda函数将缺失值替换为NaN，具体代码如下：

data = data.applymap(lambda x: np.nan if isinstance(x, basestring) and x.isspace() else x)

其中，data是我正在处理的数据框。

之后，我尝试使用seaborn绘制其中一个属性'alcconsumption'的图表，具体代码如下：

seaborn.distplot(data['alcconsumption'])

seaborn.distplot(data['alcconsumption'],hist=True,bins=100)
plt.xlabel('AlcoholConsumption')
plt.ylabel('Frequency(normalized 0->1)')

它给我返回了以下错误：

AttributeError: max must be larger than min in range parameter.

- datavinci

1

为什么不在绘图之前将它们删除？ - cel

怎么做？我的意思是用哪个函数？ - datavinci

6

data['alcconsumption'].dropna() - mwaskom

如果我的建议有用的话，您是否考虑将其标记为被接受的答案？ - vestland

4个回答

4

这是matplotlib/pylab直方图的已知问题！

请参见例如https://github.com/matplotlib/matplotlib/issues/6483等地方，建议采用各种解决方法，其中两个喜爱的解决方法（例如来自https://dev59.com/VmIk5IYBdhLWcg3wp_mM#19090183）为：

import numpy as np
nbins=100
A=data['alcconsumption']
Anan=A[~np.isnan(A)] # Remove the NaNs

seaborn.distplot(Anan,hist=True,bins=nbins)

或者，可以指定bin的边界（在这种情况下，通过使用Anan来实现）：

Amin=min(Anan)
Amax=max(Anan)
seaborn.distplot(A,hist=True,bins=np.linspace(Amin,Amax,nbins))

- jtlz2

3

在绘制数据之前，我会确保处理好缺失值。是否使用dropna()完全取决于数据集的性质。alcconsumption是单个系列还是数据框的一部分？在后者的情况下，使用dropna()也会删除其他列中相应的行。缺失值是少数还是众多？它们散布在系列中，还是倾向于成群结队地出现？或许您认为数据集存在趋势吗？

如果缺失值很少且分散，您可以轻松地使用dropna()。在其他情况下，我会选择用先前观察到的值填充缺失值(1)。或者甚至使用插值值来填充缺失值(2)。但要小心！用填充或插值观测值替换大量数据可能会严重干扰您的数据集，并导致非常错误的结论。

以下是一些示例，使用了您的片段...

seaborn.distplot(data['alcconsumption'],hist=True,bins=100)
plt.xlabel('AlcoholConsumption')
plt.ylabel('Frequency(normalized 0->1)')

在合成数据集上：

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

def sample(rows, names):
    ''' Function to create data sample with random returns

    Parameters
    ==========
    rows : number of rows in the dataframe
    names: list of names to represent assets

    Example
    =======

    >>> sample(rows = 2, names = ['A', 'B'])

                  A       B
    2017-01-01  0.0027  0.0075
    2017-01-02 -0.0050 -0.0024
    '''
    listVars= names
    rng = pd.date_range('1/1/2017', periods=rows, freq='D')
    df_temp = pd.DataFrame(np.random.randint(-100,100,size=(rows, len(listVars))), columns=listVars) 
    df_temp = df_temp.set_index(rng)


    return df_temp

df = sample(rows = 15, names = ['A', 'B'])
df['A'][8:12] = np.nan
df

输出：

            A   B
2017-01-01 -63.0  10
2017-01-02  49.0  79
2017-01-03 -55.0  59
2017-01-04  89.0  34
2017-01-05 -13.0 -80
2017-01-06  36.0  90
2017-01-07 -41.0  86
2017-01-08  10.0 -81
2017-01-09   NaN -61
2017-01-10   NaN -80
2017-01-11   NaN -39
2017-01-12   NaN  24
2017-01-13 -73.0 -25
2017-01-14 -40.0  86
2017-01-15  97.0  60

1. 使用前向填充 `pandas.DataFrame.fillna(method = ffill)`

ffill 会“向前填充”值，意味着它会用上一行的值替换 nan。

df = df['A'].fillna(axis=0, method='ffill')
sns.distplot(df, hist=True,bins=5)
plt.xlabel('AlcoholConsumption')
plt.ylabel('Frequency(normalized 0->1)')

2. 使用插值方法与`pandas.DataFrame.interpolate()`一起使用

根据不同的方法进行插值。时间插值适用于每日及更高分辨率数据，以插值给定长度的间隔。

df['A'] = df['A'].interpolate(method = 'time')
sns.distplot(df['A'], hist=True,bins=5)
plt.xlabel('AlcoholConsumption')
plt.ylabel('Frequency(normalized 0->1)')

正如您所看到的，不同的方法呈现了两种非常不同的结果。我希望这对您有用。如果不是，请告诉我，我会再次查看它。

- vestland

2

这可能不是所问问题的解决方案，但我使用以下代码进行检查

sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')

- PlutoSenthil

2

请在此答案中添加更多的上下文，不鼓励代码转储。 https://meta.stackoverflow.com/questions/358727/are-there-any-guidelines-to-handle-one-line-correct-code-only-answers-in-vario - rayryeng

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- ZicoNuna · Accepted Answer

5

您可以使用以下代码行来使用seaborn选择分布图中的非NaN值：

seaborn.distplot(data['alcconsumption'].notnull(),hist=True,bins=100)

- ZicoNuna

1

从1.11版本开始，seaborn表示：“此函数已弃用，并将在未来的版本中删除。” https://seaborn.pydata.org/generated/seaborn.distplot.html - Marc Maxmeister

这不应该是 seaborn.distplot(data[data['alcconsumption'].notnull()]['alcconsumption'],hist=True,bins=100) 吗？我相信 data['alcconsumption'].notnull() 输出布尔值。 - user1442363

在使用seaborn绘图时，对于缺失值应该如何处理呢？

1. 使用前向填充 pandas.DataFrame.fillna(method = ffill)

2. 使用插值方法与pandas.DataFrame.interpolate()一起使用

1. 使用前向填充 `pandas.DataFrame.fillna(method = ffill)`

2. 使用插值方法与`pandas.DataFrame.interpolate()`一起使用