修剪决策树

Question

修剪决策树

23

以下是决策树的一小段，因为它非常庞大。

当节点中的最低值小于5时，如何停止树生长。这是生成决策树的代码。在SciKit - Decision Tree中，我们可以看到实现此操作的唯一方法是通过min_impurity_decrease，但我不确定它具体是如何工作的。

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier


X, y = make_classification(n_samples=1000,
                           n_features=6,
                           n_informative=3,
                           n_classes=2,
                           random_state=0,
                           shuffle=False)

# Creating a dataFrame
df = pd.DataFrame({'Feature 1':X[:,0],
                                  'Feature 2':X[:,1],
                                  'Feature 3':X[:,2],
                                  'Feature 4':X[:,3],
                                  'Feature 5':X[:,4],
                                  'Feature 6':X[:,5],
                                  'Class':y})


y_train = df['Class']
X_train = df.drop('Class',axis = 1)

dt = DecisionTreeClassifier( random_state=42)                
dt.fit(X_train, y_train)

from IPython.display import display, Image
import pydotplus
from sklearn import tree
from sklearn.tree import _tree
from sklearn import tree
import collections
import drawtree
import os  

os.environ["PATH"] += os.pathsep + 'C:\\Anaconda3\\Library\\bin\\graphviz'

dot_data = tree.export_graphviz(dt, out_file = 'thisIsTheImagetree.dot',
                                 feature_names=X_train.columns, filled   = True
                                    , rounded  = True
                                    , special_characters = True)

graph = pydotplus.graph_from_dot_file('thisIsTheImagetree.dot')  

thisIsTheImage = Image(graph.create_png())
display(thisIsTheImage)
#print(dt.tree_.feature)

from subprocess import check_call
check_call(['dot','-Tpng','thisIsTheImagetree.dot','-o','thisIsTheImagetree.png'])

更新

我认为min_impurity_decrease可以在某种程度上帮助达成目标。因为微调min_impurity_decrease实际上会修剪决策树。有人能否详细解释一下min_impurity_decrease是什么？

我正在尝试理解scikit learn中的方程式，但我不确定right_impurity和left_impurity的值是多少。

N = 256
N_t = 256
impurity = ??
N_t_R = 242
N_t_L = 14
right_impurity = ??
left_impurity = ??

New_Value = N_t / N * (impurity - ((N_t_R / N_t) * right_impurity)
                    - ((N_t_L / N_t) * left_impurity))
New_Value

更新2

我们不是按照特定值来修剪，而是根据特定条件进行修剪。例如，我们在6/4和5/5处进行分裂，但不在6000/4或5000/5处进行分裂。假设一个值与其节点中相邻的值相比下降了一定百分比，我们将对其进行修剪，而不是按照某个特定值。

      11/9
   /       \
  6/4       5/5
 /   \     /   \
6/0  0/4  2/2  3/3

- user9238790

该值代表什么？min_impurity_decrease适用于在某个节点可能发生的分裂，并且不考虑当前节点的值，而是考虑如果分裂节点后子节点中纯度的增加。 - SBylemans

左侧和右侧的杂质分别是左子节点和右子节点中样本的杂质（由 criterion 参数计算）。 - SBylemans

我认为你可能无法使用SciKit中的决策树来完成它，除非你知道当值小于5时的最大深度或样本数。也许在构建后遍历树是可能的吗？分类器对象的tree_中包含了这棵树。 - SBylemans

你需要指定你使用的准则：基尼系数或熵。你不能实现自己的函数。 - SBylemans

@KenSyme min_samples_split 是节点中的总样本数。问题是节点中的最小值。例如 [400, 1]，您希望它停止分裂。使用 min_samples_split，它仍然会分裂。 - user9238790

显示剩余5条评论

4个回答

2

编辑：正如@SBylemans和@Viktor在评论中指出的那样，这是不正确的。我不会删除答案，因为其他人可能也认为这是解决方案。

将min_samples_leaf设置为5。

min_samples_leaf：

要求在叶节点上的最小样本数：

更新：我认为不能使用min_impurity_decrease来完成。考虑以下情况：

      11/9
   /         \
  6/4       5/5
 /   \     /   \
6/0  0/4  2/2  3/3

根据你的规则，你不想分裂节点6/4，因为4小于5，但你想要分裂5/5节点。然而，分裂6/4节点可以获得0.48的信息增益，而分裂5/5节点没有信息增益。

- Seljuk Gulcan

他并没有说5代表节点中的样本数量。 - SBylemans

@Selcuk，这是不正确的！min_samples_leaf在叶节点中。问题是，查看节点中的最小值，如果它低于某个值，则停止分裂。 - user9238790

@Victor，我明白了。我误解了问题，但是我认为您想要的无法使用 min_impurity_decrease 来完成。请查看答案更新部分中的示例。 - Seljuk Gulcan

很好的解释 - user9238790

@SelçukGülcan，我在更新部分稍微修改了问题。我们能否将其更改为另一个的百分比？例如，在6/4和5/5处拆分，但不在6000/4或5000/5处拆分。假设，如果一个值低于某个百分比，而不是某个特定值。 - user9238790

1

有趣的是，min_impurity_decrease 看起来似乎不允许在您提供的代码片段中展示任何节点的增长（分裂后的杂质之和等于预分裂的杂质，因此没有杂质减少）。但是，虽然它无法给您完全符合要求的结果（如果最低值小于5，则终止节点），但可能会给您类似的结果。

如果我的测试正确，官方文档使其看起来比实际情况更复杂。只需从潜在父节点中取出较小的值，然后减去所提议的新节点的较小值之和-这是总体杂质减少量。然后将其除以整个树中的样本总数-这将为您提供如果节点被分裂所达到的分数杂质减少。

如果你有1000个样本，而一个节点的值低于5（即5个“不纯度”），那么5/1000代表了如果该节点被完美地分裂，你可以实现的最大不纯度减少。因此，将min_impurity_decrease设置为0.005将近似停止叶子节点的不纯度小于5。实际上，它会停止大多数不纯度略高于5的叶子节点（取决于所提议的分裂导致的不纯度），因此它只是一种近似，但据我所知，这是在不进行后修剪的情况下最接近的结果。

- James

1

在Scikit learn库中，您有一个名为ccp_alpha的参数作为DescissionTreeClassifier的参数。使用此参数，您可以对DecessionTrees进行后复杂度修剪。查看此https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html。

- kdkarthik

如果您能详细阐述答案或者添加一些例子，我肯定会点赞这个回答。 - gneusch

相信这会对你有所帮助 https://www.youtube.com/watch?v=SLOyyFHbiqo - kdkarthik

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- David Dale · Accepted Answer

使用min_impurity_decrease或任何其他内置停止标准无法直接限制叶子的最低值（特定类别出现次数）。

如果不更改scikit-learn源代码，我认为你唯一能实现这一点的方法是对树进行后修剪。为此，您可以遍历树并删除具有小于5个最小类计数（或您可以考虑的任何其他条件）的节点的所有子节点。我将继续您的示例：

from sklearn.tree._tree import TREE_LEAF

def prune_index(inner_tree, index, threshold):
    if inner_tree.value[index].min() < threshold:
        # turn node into a leaf by "unlinking" its children
        inner_tree.children_left[index] = TREE_LEAF
        inner_tree.children_right[index] = TREE_LEAF
    # if there are shildren, visit them as well
    if inner_tree.children_left[index] != TREE_LEAF:
        prune_index(inner_tree, inner_tree.children_left[index], threshold)
        prune_index(inner_tree, inner_tree.children_right[index], threshold)

print(sum(dt.tree_.children_left < 0))
# start pruning from the root
prune_index(dt.tree_, 0, 5)
sum(dt.tree_.children_left < 0)

这段代码将首先打印出 74，然后是 91。这意味着该代码已经通过实际删除其祖先的链接创建了 17 个新叶节点。树在之前看起来像：

现在看起来像：

所以你可以看到它确实减少了很多。