在PySpark多项式逻辑回归中设置阈值

Question

在PySpark多项式逻辑回归中设置阈值

apache-sparkmachine-learningpysparklogistic-regressionapache-spark-ml

5

我希望进行多项逻辑回归，但是我无法正确设置`threshold`和`thresholds`参数。请参考以下数据框：

from pyspark.ml.linalg import DenseVector

test_train_df = (
sqlc
.createDataFrame([(0, DenseVector([-1.0, 1.2, 0.7])),
                  (0, DenseVector([3.1, -2.0, -2.9])),
                  (1, DenseVector([1.0, 0.8, 0.3])),
                  (1, DenseVector([4.2, 1.4, -1.7])),
                  (0, DenseVector([-1.9, 2.5, -2.3])),
                  (2, DenseVector([2.6, -0.2, 0.2])),
                  (1, DenseVector([0.3, -3.4, 1.8])),
                  (2, DenseVector([-1.0, -3.5, 4.7]))],
                 ['label', 'features'])
)

我的标签有3个类别，所以我必须设置 thresholds （复数，默认值为 None），而不是 threshold （单数，默认值为 0.5）。然后我写道：

from pyspark.ml import classification as cl

test_logit_abst = (
    cl.LogisticRegression()
    .setFamily('multinomial')
    .setThresholds([.5, .5, .5])
)

那么我想在我的数据框上适配模型:

test_logit = test_logit_abst.fit(test_train_df)

但是在执行这个最后的命令时，我遇到了一个错误：

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:

~/anaconda3/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    318                     "An error occurred while calling {0}{1}{2}.\n".
--> 319                     format(target_id, ".", name), value)
    320             else:

Py4JJavaError: An error occurred while calling o3769.fit.
: java.lang.IllegalArgumentException: requirement failed: Logistic Regression found inconsistent values for threshold and thresholds.  Param threshold is set (0.5), indicating binary classification, but Param thresholds is set with length 3. Clear one Param value to fix this problem.

During handling of the above exception, another exception occurred:

IllegalArgumentException                  Traceback (most recent call last)
<ipython-input-211-8f3443f41b6b> in <module>()
----> 1 test_logit = test_logit_abst.fit(test_train_df)

~/anaconda3/lib/python3.6/site-packages/pyspark/ml/base.py in fit(self, dataset, params)
     62                 return self.copy(params)._fit(dataset)
     63             else:
---> 64                 return self._fit(dataset)
     65         else:
     66             raise ValueError("Params must be either a param map or a list/tuple of param maps, "

~/anaconda3/lib/python3.6/site-packages/pyspark/ml/wrapper.py in _fit(self, dataset)
263
    264     def _fit(self, dataset):
--> 265         java_model = self._fit_java(dataset)
    266         return self._create_model(java_model)
267

~/anaconda3/lib/python3.6/site-packages/pyspark/ml/wrapper.py in _fit_java(self, dataset)
    260         """
    261         self._transfer_params_to_java()
--> 262         return self._java_obj.fit(dataset._jdf)
263
    264     def _fit(self, dataset):

~/anaconda3/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
1134
   1135         for temp_arg in temp_args:

~/anaconda3/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
     77                 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
     78             if s.startswith('java.lang.IllegalArgumentException: '):
---> 79                 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
     80             raise
     81     return deco

IllegalArgumentException: 'requirement failed: Logistic Regression found inconsistent values for threshold and thresholds.  Param threshold is set (0.5), indicating binary classification, but Param thresholds is set with length 3. Clear one Param value to fix this problem.'

错误提示显示设置了阈值threshold。这看起来很奇怪，因为文档中指出，设置阈值thresholds（复数形式）会清除阈值threshold（单数形式），所以值0.5应该被删除。那么，由于不存在clearThreshold()函数，如何清除threshold呢？

为了实现这一目标，我尝试通过以下方式清除threshold:

logit_abst = (
    cl.LogisticRegression()
    .setFamily('multinomial')
    .setThresholds([.5, .5, .5])
    .setThreshold(None)
)

这次的fit命令有效了，我甚至获得了模型的截距和系数：

test_logit.interceptVector
DenseVector([65.6445, 31.6369, -97.2814])

test_logit.coefficientMatrix
DenseMatrix(3, 3, [-76.4534, -19.4797, -79.4949, 12.3659, 4.642, 4.1057, 64.0876, 14.8377, 75.3892], 1)

但是，如果我尝试从test_logit_abst中获取thresholds(复数形式)，就会出现错误：

test_logit_abst.getThresholds()

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-214-fc1c8617ce80> in <module>()
----> 1 test_logit_abst.getThresholds()

~/anaconda3/lib/python3.6/site-packages/pyspark/ml/classification.py in getThresholds(self)
    363         if not self.isSet(self.thresholds) and self.isSet(self.threshold):
    364             t = self.getOrDefault(self.threshold)
--> 365             return [1.0-t, t]
    366         else:
    367             return self.getOrDefault(self.thresholds)

TypeError: unsupported operand type(s) for -: 'float' and 'NoneType'

这是什么意思？

进一步说，奇怪的是（我无法理解），颠倒参数设置的顺序会产生我之前发布的第一个错误：

logit_abst = (
    cl.LogisticRegression()
    .setFamily('multinomial')
    .setThreshold(None)
    .setThresholds([.5, .5, .5])
)

为什么更改“set”指令的顺序也会改变输出结果？

- Vanni Rovera

1

无法在Spark 2.1.1或2.2.0中重现您的任何错误。您是否在仅声明模型时就遇到了它们，就像您在帖子中所示，还是暗示您在尝试实际使用数据拟合模型时遇到了它们？如果是后者，请编辑您的帖子以澄清此事并显示产生错误的实际命令。 - desertnaut

1

我遇到的所有错误都是在我编写的代码后出现的：前两种情况下，它们是通过声明模型来发生的，在第三种情况下，当我执行命令 logit_abst.getThresholds()时才会出现。我已编辑第三个案例以使其更清晰。 - Vanni Rovera

好的，Spark 版本是哪个？ - desertnaut

我正在使用Spark 2.2.0开发。 - Vanni Rovera

向大家通知：我已根据上面四个答案更新了问题。请查看以下与@desertnaut的讨论。 - Vanni Rovera

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- desertnaut · Accepted Answer

这确实是个混乱的情况...

简短的答案是：

setThresholds（复数形式）没有清除阈值（单数形式），这似乎是一个错误
对于多项分类（即类别数>2），setThresholds不会做你期望的事情（并且可以说你不需要它）
如果你只需要在“默认”值0.5中设置一些“阈值”，那么你就没有问题——只需不使用任何相关参数或setThresholds语句
如果你真的需要在多项分类中为不同的类应用不同的决策阈值，那么你将不得不手动进行处理，通过后处理相应的概率，即转换后的数据框中的probability列（尽管在二元分类中使用setThreshold(s)效果还不错）

现在是更加详细的解答...

让我们从适应文档中的玩具数据开始讨论二元分类：

spark.version
# u'2.2.0'

from pyspark.ml.classification import LogisticRegression
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors
bdf = sc.parallelize([
     Row(label=1.0, features=Vectors.dense(0.0, 5.0)),
     Row(label=0.0, features=Vectors.dense(1.0, 2.0)),

blor = LogisticRegression(threshold=0.7, thresholds=[0.3, 0.7])
     Row(label=1.0, features=Vectors.dense(2.0, 1.0)),
     Row(label=0.0, features=Vectors.dense(3.0, 3.0))]).toDF()

这里不需要设置多个 thresholds（阈值），但是 threshold=0.7 足够了，但在下面说明与 setThreshold 的差异时会很有用。

blorModel = blor.fit(bdf) # works OK
blor.getThreshold()
# 0.7
blor.getThresholds()
# [0.3, 0.7]
blorModel.transform(bdf).show(truncate=False) # transform the training data

以下是结果：

+---------+-----+------------------------------------------+----------------------------------------+----------+
|features |label|rawPrediction                             |probability                             |prediction| 
+---------+-----+------------------------------------------+----------------------------------------+----------+
|[0.0,5.0]|1.0  |[-1.138455151184087,1.138455151184087]    |[0.242604109995602,0.757395890004398]   |1.0       |
|[1.0,2.0]|0.0  |[-0.6056346859838877,0.6056346859838877]  |[0.35305562698104337,0.6469443730189567]|0.0       | 
|[2.0,1.0]|1.0  |[0.26586039040308496,-0.26586039040308496]|[0.5660763559614698,0.4339236440385302] |0.0       | 
|[3.0,3.0]|0.0  |[1.6453673835702176,-1.6453673835702176]  |[0.8382639556951765,0.16173604430482344]|0.0       | 
+---------+-----+------------------------------------------+----------------------------------------+----------+

什么是 thresholds=[0.3, 0.7] 的含义？答案可以在第二行找到，预测结果为 0.0，尽管概率更高的是1.0（0.65）：确实，0.65比0.35高，但是它低于我们设置的阈值（0.7）对于这个类别，因此它不被归类为该类别。

现在让我们尝试看起来相同的操作，但是使用setThreshold(s)代替：

blor2 = (LogisticRegression()
  .setThreshold(0.7)
  .setThresholds([0.3, 0.7]) ) # works OK

blorModel2 = blor2.fit(bdf)
[...]
IllegalArgumentException: u'requirement failed: Logistic Regression getThreshold found inconsistent values for threshold (0.5) and thresholds (equivalent to 0.7)'

不错，对吧？

setThresholds（复数形式）确实清除了我们在上一行中设置的阈值（0.7），就像文档中所述那样，但它似乎只是将其恢复为默认值0.5...

省略.setThreshold(0.7)会导致您本人报告的第一个错误（未显示）。

反转参数设置的顺序即可解决此问题（！！！），并且还可以使getThreshold（单数形式）和getThresholds（复数形式）都能够正常工作（与您的情况相反）：

blor2 = (LogisticRegression()
  .setThresholds([0.3, 0.7])
  .setThreshold(0.7) )

blorModel2 = blor2.fit(bdf) # works OK
blor2.getThreshold()
# 0.7
blor2.getThresholds()
# [0.30000000000000004, 0.7]

现在我们来看看多项式情况；我们将再次使用文档中的示例，使用来自 Spark Github存储库的数据（也可以在本地找到，位于$SPARK_HOME/data/mllib/sample_multiclass_classification_data.txt，但我正在使用Databricks笔记本电脑）；这是一个3类情况，标签为{0.0, 1.0, 2.0}。

data_path ="/FileStore/tables/sample_multiclass_classification_data.txt"
mdf = spark.read.format("libsvm").load(data_path)

与上述二进制情况类似，我们的thresholds（复数）元素之和为1，让我们为第2类请求一个阈值为0.8：

mlor = (LogisticRegression()
       .setFamily("multinomial")
       .setThresholds([0, 0.2, 0.8])
       .setThreshold(0.8) )
mlorModel= mlor.fit(mdf)  # works OK
mlor.getThreshold()
# 0.8
mlor.getThresholds()
# [0.19999999999999996, 0.8]

看起来不错，但让我们在（训练）数据集中要求一个 预测值：

mlorModel.transform(mdf).show(truncate=False)

我只挑选了一行 - 它应该是完整输出的倒数第二行：

+-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+ 
|label|features                                            |rawPrediction                                            |probability                                                    |prediction| 
+-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+
[...]
|0.0  |(4,[0,1,2,3],[0.111111,-0.333333,0.38983,0.166667]) |[36.67790353804905,-74.71196613173531,38.034062593686244]|[0.20486526556822454,8.619113376801409E-50,0.7951347344317755] |2.0       | 
[...]
+-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+

向右滚动，您会发现尽管这里对于类2.0的预测值低于我们设置的阈值（0.8），但实际上该行的预测结果确实是2.0，与上面演示的二元情况相反...

那么应该怎么做呢？只需删除所有与阈值有关的语句，您不需要它们 - 即使setFamily也是不必要的，因为算法将自动检测到您有超过2个类别。这将与上述结果相同：

mlor = LogisticRegression() # works OK - no family, no threshold(s)

总结一下：

在二分类和多分类情况下，该算法返回的实际上是一个概率向量，长度等于类别数，其元素之和为1。
仅在二分类情况下，Spark 允许您进一步选择不是简单地选择具有最高 probability 类作为 prediction，而是应用用户定义的阈值；例如，在处理不平衡数据的情况下可能会很有用。
这个 threshold(s) 设置实际上在多分类情况下没有影响，Spark 总是返回具有最高 probability 的类作为 prediction。

尽管文档有些混乱（关于这点我曾经在其他地方争论过），并且可能存在一些错误，但是对于（3），我要说这种设计选择是有道理的；正如在其他地方中所 nicely argued 的那样：

当你为新样本的每个类输出概率时，你的练习的统计组件就结束了。选择超过某个阈值，将新观察结果分类为1 vs. 0 不再是统计学的一部分。这是决策组件的一部分。

尽管上述论点是针对二分类情况提出的，但在多分类情况下也完全适用...