有没有一种方法可以做到这一点?我似乎找不到一种简单的方法将pandas序列与绘制CDF相接口。
我认为你要查找的功能在Series对象的hist方法中,该方法包装了matplotlib的hist()函数。
这里是相关文档。
In [10]: import matplotlib.pyplot as plt
In [11]: plt.hist?
...
Plot a histogram.
Compute and draw the histogram of *x*. The return value is a
tuple (*n*, *bins*, *patches*) or ([*n0*, *n1*, ...], *bins*,
[*patches0*, *patches1*,...]) if the input contains multiple
data.
...
cumulative : boolean, optional, default : False
If `True`, then a histogram is computed where each bin gives the
counts in that bin plus all bins for smaller values. The last bin
gives the total number of datapoints. If `normed` is also `True`
then the histogram is normalized such that the last bin equals 1.
If `cumulative` evaluates to less than 0 (e.g., -1), the direction
of accumulation is reversed. In this case, if `normed` is also
`True`, then the histogram is normalized such that the first bin
equals 1.
...
例如In [12]: import pandas as pd
In [13]: import numpy as np
In [14]: ser = pd.Series(np.random.normal(size=1000))
In [15]: ser.hist(cumulative=True, density=1, bins=100)
Out[15]: <matplotlib.axes.AxesSubplot at 0x11469a590>
In [16]: plt.show()
histtype='step'
是pyplot.hist
文档中的一个参数,上面的文档进行了截断。 - Dan Frankimport pandas as pd
# If you are in jupyter
%matplotlib inline
# Define your series
s = pd.Series([9, 5, 3, 5, 5, 4, 6, 5, 5, 8, 7], name = 'value')
df = pd.DataFrame(s)
# Get the frequency, PDF and CDF for each value in the series
# Frequency
stats_df = df \
.groupby('value') \
['value'] \
.agg('count') \
.pipe(pd.DataFrame) \
.rename(columns = {'value': 'frequency'})
# PDF
stats_df['pdf'] = stats_df['frequency'] / sum(stats_df['frequency'])
# CDF
stats_df['cdf'] = stats_df['pdf'].cumsum()
stats_df = stats_df.reset_index()
stats_df
# Plot the discrete Probability Mass Function and CDF.
# Technically, the 'pdf label in the legend and the table the should be 'pmf'
# (Probability Mass Function) since the distribution is discrete.
# If you don't have too many values / usually discrete case
stats_df.plot.bar(x = 'value', y = ['pdf', 'cdf'], grid = True)
如果你从连续分布中取样或者有很多个体值,可以使用另一种示例:
# Define your series
s = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value')
# ... all the same calculation stuff to get the frequency, PDF, CDF
# Plot
stats_df.plot(x = 'value', y = ['pdf', 'cdf'], grid = True)
请注意,如果合理假设样本中每个值只出现一次(通常在连续分布的情况下遇到),则不需要使用groupby()
+agg('count')
(因为计数始终为1)。
在这种情况下,可以使用百分位排名直接得到累积分布函数。
在采取此类捷径时,请谨慎判断!:)
# Define your series
s = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value')
df = pd.DataFrame(s)
# Get to the CDF directly
df['cdf'] = df.rank(method = 'average', pct = True)
# Sort and plot
df.sort_values('value').plot(x = 'value', y = 'cdf', grid = True)
我来到这里寻找带有条形图和累积分布函数线的图表:
可以通过以下方式实现:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
series = pd.Series(np.random.normal(size=10000))
fig, ax = plt.subplots()
ax2 = ax.twinx()
n, bins, patches = ax.hist(series, bins=100, normed=False)
n, bins, patches = ax2.hist(
series, cumulative=1, histtype='step', bins=100, color='tab:orange')
plt.savefig('test.png')
ax.set_xlim((ax.get_xlim()[0], series.max()))
我在这里看到了一个优雅的解决方案here,可以使用seaborn
实现。
CDF(累积分布函数)图表基本上是一个图表,其中X轴上是排序后的值,Y轴上是累积分布。因此,我将创建一个新序列,用排序后的值作为索引,用累积分布作为值。
首先创建一个示例序列:
import pandas as pd
import numpy as np
ser = pd.Series(np.random.normal(size=100))
对这个系列进行排序:
ser = ser.sort_values()
现在,在继续之前,再次添加最后一个(也是最大的)值。这一步对于小样本大小尤其重要,以便获得无偏的CDF:
ser[len(ser)] = ser.iloc[-1]
使用排序后的值作为索引,累积分布作为值创建一个新的系列:
cum_dist = np.linspace(0.,1.,len(ser))
ser_cdf = pd.Series(cum_dist, index=ser)
最后,将该函数绘制为阶梯图:
ser_cdf.plot(drawstyle='steps')
order
已被弃用,请使用ser.sort_values()
。 - Lukasser[len(ser)] = ser.iloc[-1]
在 pandas 0.19 上不起作用。 - jlandercy这是最简单的方法。
import pandas as pd
df = pd.Series([i for i in range(100)])
df.hist( cumulative = True )
我在“纯”Pandas中找到了另一种解决方案,它不需要指定直方图中要使用的箱子数量:
import pandas as pd
import numpy as np # used only to create example data
series = pd.Series(np.random.normal(size=10000))
cdf = series.value_counts().sort_index().cumsum()
cdf.plot()
cdf=series.value_counts().sort_index().cumsum() / series.shape[0]
。 - Itamar Mushkindf[your_column].plot(kind = 'hist', histtype = 'step', density = True, cumulative = True)
您还可以提供所需的箱数。
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
heights = pd.Series(np.random.normal(size=100))
# empirical CDF
def F(x,data):
return float(len(data[data <= x]))/len(data)
vF = np.vectorize(F, excluded=['data'])
plt.plot(np.sort(heights),vF(x=np.sort(heights), data=heights))
collections.Counter
使过程更容易;(2)在计算pdf、cdf和ccdf之前记得对value
进行排序(升序)。import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
s = pd.Series(np.random.randint(1000, size=(1000)))
dic = dict(Counter(s))
df = pd.DataFrame(s.items(), columns = ['value', 'frequency'])
Calculate PDF, CDF, and CCDF:
df['pdf'] = df.frequency/sum(df.frequency)
df['cdf'] = df['pdf'].cumsum()
df['ccdf'] = 1-df['cdf']
df.plot(x = 'value', y = ['cdf', 'ccdf'], grid = True)
你可能会想知道为什么我们要在计算PDF、CDF和CCDF之前对value
进行排序。好吧,假设我们不对它们进行排序,结果会怎样呢?(请注意,dict(Counter(s))
自动排序了项目,以下我们将使顺序随机。)dic = dict(Counter(s))
df = pd.DataFrame(s.items(), columns = ['value', 'frequency'])
# randomize the order of `value`:
df = df.sample(n=1000)
df['pdf'] = df.frequency/sum(df.frequency)
df['cdf'] = df['pdf'].cumsum()
df['ccdf'] = 1-df['cdf']
df.plot(x = 'value', y = ['cdf'], grid = True)
value
的顺序未排序(升序或降序均可),那么当你绘制图表时,当x
轴按升序排列时,y
值当然会混乱无序。value
,在计算PDF、CDF和CCDF之后(而不是之前)对value
进行排序,是否能解决问题?dic = dict(Counter(s))
df = pd.DataFrame(s.items(), columns = ['value', 'frequency'])
# randomize the order of `value`:
df = df.sample(n=1000)
df['pdf'] = df.frequency/sum(df.frequency)
df['cdf'] = df['pdf'].cumsum()
df['ccdf'] = 1-df['cdf']
# Will this solve the problem?
df = df.sort_values(by='value')
df.plot(x = 'value', y = ['cdf'], grid = True)
它不必太复杂。所需的只是:
import matplotlib.pyplot as plt
import numpy as np
x = series.dropna().sort_values()
y = np.linspace(0, 1, len(x))
plt.plot(x, y)
pandas
的领域之内。使用 seaborn 的kdeplot
并将cumulative=True
。 - TomAugspurger