这个问题需要使用z分数或标准分数,它将考虑到历史平均值,正如其他人已经提到的那样,但也会考虑到历史数据的标准差,使其比仅使用平均值更加强大。
在您的情况下,z分数是通过以下公式计算的,其中趋势将是一个速率,例如每天的浏览量。
z-score = ([current trend] - [average historic trends]) / [standard deviation of historic trends]
当使用z得分时,z得分越高或越低,趋势就越不正常。例如,如果z得分非常正向,则趋势异常上升,而如果它非常负向,则趋势异常下降。因此,一旦计算出所有候选趋势的z得分,最高的10个z得分将与最不正常的增长z得分相关。
有关z得分的更多信息,请参见
维基百科。
代码
from math import sqrt
def zscore(obs, pop):
number = float(len(pop))
avg = sum(pop) / number
std = sqrt(sum(((c - avg) ** 2) for c in pop) / number)
return (obs - avg) / std
示例输出
>>> zscore(12, [2, 4, 4, 4, 5, 5, 7, 9])
3.5
>>> zscore(20, [21, 22, 19, 18, 17, 22, 20, 20])
0.0739221270955
>>> zscore(20, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1])
1.00303599234
>>> zscore(2, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1])
-0.922793112954
>>> zscore(9, [1, 2, 0, 3, 1, 3, 1, 2, 9, 8, 7, 10, 9, 5, 2, 4, 1, 1, 0])
1.65291949506
笔记
You can use this method with a sliding window (i.e. last 30 days) if you wish not to take to much history into account, which will make short term trends more pronounced and can cut down on the processing time.
You could also use a z-score for values such as change in views from one day to next day to locate the abnormal values for increasing/decreasing views per day. This is like using the slope or derivative of the views per day graph.
If you keep track of the current size of the population, the current total of the population, and the current total of x^2 of the population, you don't need to recalculate these values, only update them and hence you only need to keep these values for the history, not each data value. The following code demonstrates this.
from math import sqrt
class zscore:
def __init__(self, pop = []):
self.number = float(len(pop))
self.total = sum(pop)
self.sqrTotal = sum(x ** 2 for x in pop)
def update(self, value):
self.number += 1.0
self.total += value
self.sqrTotal += value ** 2
def avg(self):
return self.total / self.number
def std(self):
return sqrt((self.sqrTotal / self.number) - self.avg() ** 2)
def score(self, obs):
return (obs - self.avg()) / self.std()
Using this method your work flow would be as follows. For each topic, tag, or page create a floating point field, for the total number of days, sum of views, and sum of views squared in your database. If you have historic data, initialize these fields using that data, otherwise initialize to zero. At the end of each day, calculate the z-score using the day's number of views against the historic data stored in the three database fields. The topics, tags, or pages, with the highest X z-scores are your X "hotest trends" of the day. Finally update each of the 3 fields with the day's value and repeat the process next day.
新增内容
如上所述,普通的z分数不考虑数据的顺序,因此对于序列[1, 1, 1, 1, 9, 9, 9, 9]中的观察值'1'或'9',其z分数具有相同的大小。显然,在趋势发现中,最新的数据应该比旧的数据更有权重,因此我们希望'1'观察值具有比'9'观察值更大的幅度得分。为了实现这一点,我提出了一个浮动平均z分数。显然,这种方法并不能保证在统计学上是可靠的,但应该对趋势发现或类似问题有用。标准z分数和浮动平均z分数之间的主要区别在于使用浮动平均来计算平均人口值和平均人口值的平方。详见代码:
代码
class fazscore:
def __init__(self, decay, pop = []):
self.sqrAvg = self.avg = 0
self.decay = decay
for x in pop: self.update(x)
def update(self, value):
if self.avg == 0 and self.sqrAvg == 0:
self.avg = float(value)
self.sqrAvg = float((value ** 2))
else:
self.avg = self.avg * self.decay + value * (1 - self.decay)
self.sqrAvg = self.sqrAvg * self.decay + (value ** 2) * (1 - self.decay)
return self
def std(self):
return sqrt(self.sqrAvg - self.avg ** 2)
def score(self, obs):
if self.std() == 0: return (obs - self.avg) * float("infinity")
else: return (obs - self.avg) / self.std()
样例输入输出
>>> fazscore(0.8, [1, 1, 1, 1, 1, 1, 9, 9, 9, 9, 9, 9]).score(1)
-1.67770595327
>>> fazscore(0.8, [1, 1, 1, 1, 1, 1, 9, 9, 9, 9, 9, 9]).score(9)
0.596052006642
>>> fazscore(0.9, [2, 4, 4, 4, 5, 5, 7, 9]).score(12)
3.46442230724
>>> fazscore(0.9, [2, 4, 4, 4, 5, 5, 7, 9]).score(22)
7.7773245459
>>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20]).score(20)
-0.24633160155
>>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1]).score(20)
1.1069362749
>>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1]).score(2)
-0.786764452966
>>> fazscore(0.9, [1, 2, 0, 3, 1, 3, 1, 2, 9, 8, 7, 10, 9, 5, 2, 4, 1, 1, 0]).score(9)
1.82262469243
>>> fazscore(0.8, [40] * 200).score(1)
-inf
更新
正如David Kemp所指出的那样,如果给定一系列常数值,然后请求一个与其他值不同的观测值的z分数,则结果应该是非零值。事实上,返回的值应该是无穷大。因此我更改了这一行:
if self.std() == 0: return 0
至:
if self.std() == 0: return (obs - self.avg) * float("infinity")
这个变化反映在fazscore解决方案代码中。如果不想处理无限值,可以采用一个可接受的解决方案,将该行改为:
if self.std() == 0: return obs - self.avg