Sparklyr:从朴素贝叶斯模型中提取条件概率

3

我在sparklyr上使用ml_naive_bayes运行了一个朴素贝叶斯模型,如下所示:

library(sparklyr)
library(dplyr)

sc <- spark_connect(master = 'local')
d <- structure(list(response = c(0L, 0L, 1L, 1L, 1L, 1L, 0L), state = structure(c(3L, 
2L, 2L, 1L, 2L, 3L, 3L), .Label = c("CA", "IL", "NY"), class = "factor"), 
    job_level = c("a", "a", "a", "b", "b", "a", "c"), sex = structure(c(2L, 
    1L, 2L, 1L, 2L, 1L, 2L), .Label = c("f", "m"), class = "factor")), .Names = c("response", 
"state", "job_level", "sex"), class = "data.frame", row.names = c(NA, 
-7L))
d_tbl <- copy_to(sc, d, "d")

nb_formula <- formula(response ~ state + job_level + sex)
model <- ml_naive_bayes(d_tbl, nb_formula)

如果我打印模型,就可以看到条件概率:

> model
Call: ml_naive_bayes(d_tbl, nb_formula)

A-priority probabilities:
[1] 0.4285714 0.5714286

Conditional probabilities:
                 [,1]      [,2]
state_IL    0.1666667 0.2857143
state_NY    0.3333333 0.1428571
job_level_b 0.0000000 0.2857143
job_level_c 0.1666667 0.0000000
sex_m       0.3333333 0.2857143

我怎样可以将这些条件概率提取到它们自己的对象中?我在names(model)或者str(model)中找不到它们:
> names(model)
 [1] "pi"                          "theta"                      
 [3] "features"                    "response"                   
 [5] "data"                        "ml.options"                 
 [7] "categorical.transformations" "model.parameters"           
 [9] ".call"                       ".model"    
> 
> str(model)
List of 10
 $ pi                         : num [1:2] -0.847 -0.56
 $ theta                      : num [1:5, 1:2] -1.79 -1.1 -Inf -1.79 -1.1 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:5] "state_IL" "state_NY" "job_level_b" "job_level_c" ...
  .. ..$ : NULL
 $ features                   : chr [1:5] "state_IL" "state_NY" "job_level_b" "job_level_c" ...
 $ response                   : chr "response"
 $ data                       :Classes 'spark_jobj', 'shell_jobj' <environment: 0x7fd3a0b46958> 
 $ ml.options                 :List of 7
  ..$ id.column      : chr "idaf71584c7394"
  ..$ response.column: chr "responseaf7133826d6"
  ..$ features.column: chr "featuresaf715b7dad40"
  ..$ output.column  : chr "outputaf7117f973ad"
  ..$ model.transform: NULL
  ..$ only.model     : logi FALSE
  ..$ na.action      : chr "na.omit"
  ..- attr(*, "class")= chr "ml_options"
 $ categorical.transformations:<environment: 0x7fd3a1568d58> 
 $ model.parameters           :List of 6
  ..$ features: chr "featuresaf715b7dad40"
  ..$ labels  : NULL
  ..$ response: chr "responseaf7133826d6"
  ..$ output  : chr "outputaf7117f973ad"
  ..$ id      : chr "idaf71584c7394"
  ..$ model   : chr "org.apache.spark.ml.classification.NaiveBayes"
 $ .call                      : language ml_naive_bayes(d_tbl, nb_formula)
 $ .model                     :Classes 'spark_jobj', 'shell_jobj' <environment: 0x7fd3a196fb40> 
 - attr(*, "class")= chr [1:2] "ml_model_naive_bayes" "ml_model"

也许有一种类似于sdf_predict的方法可以提取这些内容吗?
1个回答

3

如果你查看该对象使用的打印函数

sparklyr:::print.ml_model_naive_bayes

你可以看到条件概率是指数函数的θ。
printf("Conditional probabilities:\\n")
print(exp(x$theta))

所以您应该能够做到
exp(model$theta)

作为后续 -- 你知道 exp(model$theta) 输出中的两列代表什么吗?这是 P(feature|response==F)P(feature|response==T) 吗?我似乎找不到任何解释这个模型输出的文档。 - Steve

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接