Pyspark随机森林特征重要性映射在列转换后的结果

5

我希望能够使用列名绘制某些基于树的模型的特征重要性。我正在使用Pyspark。

由于我有文本分类变量和数字变量,因此我必须使用管道方法,大致如下 -

  1. use string indexer to index string columns
  2. use one hot encoder for all columns
  3. use a vectorassembler to create the feature column containing the feature vector

    Some sample code from the docs for steps 1,2,3 -

    from pyspark.ml import Pipeline
    from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, 
    VectorAssembler
    categoricalColumns = ["workclass", "education", "marital_status", 
    "occupation", "relationship", "race", "sex", "native_country"]
     stages = [] # stages in our Pipeline
     for categoricalCol in categoricalColumns:
        # Category Indexing with StringIndexer
        stringIndexer = StringIndexer(inputCol=categoricalCol, 
        outputCol=categoricalCol + "Index")
        # Use OneHotEncoder to convert categorical variables into binary 
        SparseVectors
        # encoder = OneHotEncoderEstimator(inputCol=categoricalCol + "Index", 
        outputCol=categoricalCol + "classVec")
        encoder = OneHotEncoderEstimator(inputCols= 
        [stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
        # Add stages.  These are not run here, but will run all at once later on.
        stages += [stringIndexer, encoder]
    
    numericCols = ["age", "fnlwgt", "education_num", "capital_gain", 
    "capital_loss", "hours_per_week"]
    assemblerInputs = [c + "classVec" for c in categoricalColumns] + numericCols
    assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
    stages += [assembler]
    
    # Create a Pipeline.
    pipeline = Pipeline(stages=stages)
    # Run the feature transformations.
    #  - fit() computes feature statistics as needed.
    #  - transform() actually transforms the features.
    pipelineModel = pipeline.fit(dataset)
    dataset = pipelineModel.transform(dataset)
    
  4. finally train the model

    after training and eval, I can use the "model.featureImportances" to get the feature rankings, however I dont get the feature/column names, rather just the feature number, something like this -

    print dtModel_1.featureImportances
    
    (38895,[38708,38714,38719,38720,38737,38870,38894],[0.0742343395738,0.169404823667,0.100485791055,0.0105823115814,0.0134236162982,0.194124862158,0.437744255667])
    
我该如何将其映射回最初的列名和值?这样我就可以绘制图表了。
3个回答

13

按照user6910411的方法提取元数据,如shown here所示。

attrs = sorted(
    (attr["idx"], attr["name"])
    for attr in (
        chain(*dataset.schema["features"].metadata["ml_attr"]["attrs"].values())
    )
) 

并结合特征重要性:

[
    (name, dtModel_1.featureImportances[idx])
    for idx, name in attrs
    if dtModel_1.featureImportances[idx]
]

是的,我实际上能够弄清楚它了。我稍微有些不同,我创建了一个带有idx和特征名称的pandas dataframe然后将其转换为字典,这是广播变量。code - aamirr
`pandasDF = pd.DataFrame(dataset.schema["features"].metadata["ml_attr"]["attrs"]["binary"]+dataset.schema["features"].metadata["ml_attr"]["attrs"]["numeric"]).sort_values("idx")feature_dict = dict(zip(pandasDF["idx"],pandasDF["name"]))feature_dict_broad = sc.broadcast(feature_dict)` - aamirr

3
转换后的数据集元数据具有所需的属性。以下是一个简单的方法 -
  1. create a pandas dataframe (generally feature list will not be huge, so no memory issues in storing a pandas DF)

    pandasDF = pd.DataFrame(dataset.schema["features"].metadata["ml_attr"] 
    ["attrs"]["binary"]+dataset.schema["features"].metadata["ml_attr"]["attrs"]["numeric"]).sort_values("idx")
    
  2. Then create a broadcast dictionary to map. broadcast is necessary in a distributed environment.

    feature_dict = dict(zip(pandasDF["idx"],pandasDF["name"])) 
    
    feature_dict_broad = sc.broadcast(feature_dict)
    

当我这样做时,它不显示我的数字列名称,它只是说“numeric_feature_1”,“numeric_feature_2”... 我对我的数字变量进行了一些转换。这会使它们消失吗? - Chuck

2

在创建汇编器时,您使用了变量列表(assemblerInputs)。顺序保留在“features”变量中。因此,只需使用Pandas DataFrame:

features_imp_pd = (
     pd.DataFrame(
       dtModel_1.featureImportances.toArray(), 
       index=assemblerInputs, 
       columns=['importance'])
)

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接