xgboost中的泊松回归在低频率情况下失败

9

我正在尝试在xgboost中实现增强泊松回归模型,但我发现在低频率下结果存在偏差。为了说明这一点,以下是我认为可以复制此问题的一些最小Python代码:

import numpy as np
import pandas as pd
import xgboost as xgb

def get_preds(mult):
    # generate toy dataset for illustration
    # 4 observations with linearly increasing frequencies
    # the frequencies are scaled by `mult`
    dmat = xgb.DMatrix(data=np.array([[0, 0], [0, 1], [1, 0], [1, 1]]),
                       label=[i*mult for i in [1, 2, 3, 4]],
                       weight=[1000, 1000, 1000, 1000])

    # train a poisson booster on the toy data
    bst = xgb.train(
        params={"objective": "count:poisson"},
        dtrain=dmat,
        num_boost_round=100000,
        early_stopping_rounds=5,
        evals=[(dmat, "train")],
        verbose_eval=False)

    # return fitted frequencies after reversing scaling
    return bst.predict(dmat)/mult

# test multipliers in the range [10**(-8), 10**1]
# display fitted frequencies 
mults = [10**i for i in range(-8, 1)]
df = pd.DataFrame(np.round(np.vstack([get_preds(m) for m in mults]), 0))
df.index = mults
df.columns = ["(0, 0)", "(0, 1)", "(1, 0)", "(1, 1)"]
df

# --- result ---
#               (0, 0)   (0, 1)   (1, 0)   (1, 1)
#1.000000e-08  11598.0  11598.0  11598.0  11598.0
#1.000000e-07   1161.0   1161.0   1161.0   1161.0
#1.000000e-06    118.0    118.0    118.0    118.0
#1.000000e-05     12.0     12.0     12.0     12.0
#1.000000e-04      2.0      2.0      3.0      3.0
#1.000000e-03      1.0      2.0      3.0      4.0
#1.000000e-02      1.0      2.0      3.0      4.0
#1.000000e-01      1.0      2.0      3.0      4.0
#1.000000e+00      1.0      2.0      3.0      4.0

请注意,在低频率下,预测似乎会失控。这可能与泊松lambda * 权重降至1以下有关(事实上,将权重增加到1000以上确实会将“失控”移至更低的频率),但我仍然希望预测能接近平均训练频率(2.5)。此外(上面的示例中未显示),减小eta似乎会增加预测中的偏差量。
是什么原因导致这种情况发生?是否有可用的参数可以缓解这种影响?
1个回答

9
经过一番调查,我找到了一个解决方案。在这里记录一下以备其他人遇到同样的问题。事实证明,我需要添加一个偏移量,等于(自然)对数的平均频率。如果这不是立即显而易见的,那是因为初始预测从0.5的频率开始,需要许多增强迭代才能重新调整预测到平均频率。

请参阅以下代码,更新玩具示例。正如我在原始问题中建议的那样,预测现在在较低的尺度上逼近平均频率(2.5)。

import numpy as np
import pandas as pd
import xgboost as xgb

def get_preds(mult):
    # generate toy dataset for illustration
    # 4 observations with linearly increasing frequencies
    # the frequencies are scaled by `mult`
    dmat = xgb.DMatrix(data=np.array([[0, 0], [0, 1], [1, 0], [1, 1]]),
                       label=[i*mult for i in [1, 2, 3, 4]],
                       weight=[1000, 1000, 1000, 1000])

    ## adding an offset term equal to the log of the mean frequency
    offset = np.log(np.mean([i*mult for i in [1, 2, 3, 4]]))
    dmat.set_base_margin(np.repeat(offset, 4))

    # train a poisson booster on the toy data
    bst = xgb.train(
        params={"objective": "count:poisson"},
        dtrain=dmat,
        num_boost_round=100000,
        early_stopping_rounds=5,
        evals=[(dmat, "train")],
        verbose_eval=False)

    # return fitted frequencies after reversing scaling
    return bst.predict(dmat)/mult

# test multipliers in the range [10**(-8), 10**1]
# display fitted frequencies 
mults = [10**i for i in range(-8, 1)]
## round to 1 decimal point to show the result approaches 2.5
df = pd.DataFrame(np.round(np.vstack([get_preds(m) for m in mults]), 1))
df.index = mults
df.columns = ["(0, 0)", "(0, 1)", "(1, 0)", "(1, 1)"]
df

# --- result ---
#              (0, 0)  (0, 1)  (1, 0)  (1, 1)
#1.000000e-08     2.5     2.5     2.5     2.5
#1.000000e-07     2.5     2.5     2.5     2.5
#1.000000e-06     2.5     2.5     2.5     2.5
#1.000000e-05     2.5     2.5     2.5     2.5
#1.000000e-04     2.4     2.5     2.5     2.6
#1.000000e-03     1.0     2.0     3.0     4.0
#1.000000e-02     1.0     2.0     3.0     4.0
#1.000000e-01     1.0     2.0     3.0     4.0
#1.000000e+00     1.0     2.0     3.0     4.0

感谢您的努力工作并记录下来,以便开始讨论。我注意到您正在使用xgb.DMatrix;如果我没有使用这种类型的矩阵,那么dmat.set_base_margin(np.repeat(offset, 4))是否与将offset添加到我们的因变量y相同?我还想知道'est__eval_metric'是否应该设置为'poisson-nloglik' - pabz

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接