为什么sklearn的决策树会变得如此庞大（比原来大了30,000倍）？

Question

为什么sklearn的决策树会变得如此庞大（比原来大了30,000倍）？

pythonscikit-learnpickledecision-tree

7

为什么对sklearn决策树进行腌制会生成一个比原始估算器大数千倍（以内存为单位）的pickle文件？

我在工作中遇到了这个问题，其中一个随机森林估算器（带有100个决策树），在大约1,000,000个样本和7个特征的数据集上生成了一个超过2GB的pickle文件。

我能够追踪到问题是单个决策树的腌制，并且我能够使用下面生成的数据集复制该问题。

对于内存估计，我使用了pympler库。使用的Sklearn版本是1.0.1

# here using a regressor tree but I would expect the same issue to be present with a classification tree
import pickle
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_friedman1  # using a dataset generation function from sklear
from pympler import asizeof

# function that creates the dataset and trains the estimator
def make_example(n_samples: int):
    X, y = make_friedman1(n_samples=n_samples, n_features=7, noise=1.0, random_state=49)
    estimator = DecisionTreeRegressor(max_depth=50, max_features='auto', min_samples_split=5)
    estimator.fit(X, y)
    return X, y, estimator

# utilities to compute and compare the size of an object and its pickled version
def readable_size(size_in_bytes: int, suffix='B') -> str:
    num = size_in_bytes
    for unit in ['', 'k', 'M', 'G', 'T', 'P', 'E', 'Z']:
        if abs(num) < 1024.0:
            return "%3.1f %s%s" % (num, unit, suffix)
        num /= 1024.0
    return "%.1f%s%s" % (num, 'Yi', suffix)

def print_size(obj, skip_detail=False):
    obj_size = asizeof.asized(obj).size
    print(readable_size(obj_size))
    return obj_size

def compare_with_pickle(obj):
    size_obj = print_size(obj)
    size_pickle = print_size(pickle.dumps(obj))
    print(f"Ratio pickle/obj: {(size_pickle / size_obj):.2f}")
    
_, _, model100K = make_example(100_000)
compare_with_pickle(model100K)
_, _, model1M = make_example(1_000_000)
compare_with_pickle(model1M)

输出：

1.7 kB
4.9 MB
Ratio pickle/obj: 2876.22
1.7 kB
49.3 MB
Ratio pickle/obj: 28982.84

- pietroppeter

如果我必须猜测的话，我会说也许针对树中的每个节点，可以将完整子树进行pickle序列化，因此会非常冗余。 - pietroppeter

一篇有趣的文章，介绍如何使用pickletool来“分解”pickle：https://rushter.com/blog/pickle-serialization-internals/ - pietroppeter

2个回答

1

正如@pygeek的回答和后续评论所指出的那样，问题的错误假设是pickle会大幅增加对象的大小。实际上，问题在于pympler.asizeof没有正确估计树形对象的大小。

事实上，DecisionTreeRegressor对象有一个tree_属性，该属性有多个长度为tree_.node_count的数组。help(sklearn.tree._tree.Tree)可以看到有8个这样的数组（values、children_left、children_right、feature、impurity、threshold、n_node_samples、weighted_n_node_samples），每个数组的底层类型（除了可能是values数组，见下面的注释）应该是底层的64位整数或64位浮点数（底层Tree对象是一个cython对象），因此，DecisionTree大小的更好估计值是estimator.tree_.node_count*8*8。

针对上述模型计算此估计值：

def print_tree_estimate(tree):
    print(f"A tree with max_depth {tree.max_depth} can have up to {2**(tree.max_depth -1)} nodes")
    print(f"This tree has node_count {tree.node_count} and a size estimate is {readable_size(tree.node_count*8*8)}")
    
print_tree_estimate(model100K.tree_)
print()
print_tree_estimate(model1M.tree_)

输出结果为：

A tree with max_depth 37 can have up to 68719476736 nodes
This tree has node_count 80159 and a size estimate is 4.9 MB

A tree with max_depth 46 can have up to 35184372088832 nodes
This tree has node_count 807881 and a size estimate is 49.3 MB

实际上，这些估计值与pickle对象的大小相符。

进一步注意，确保限制DecisionTree大小的唯一方法是限制max_depth，因为二叉树的最大节点数可以由2 **（max_depth-1）限制，但上述特定树的实现节点数量远低于这个理论边界。

注意：上述估计仅适用于具有单个输出和无类别的决策树回归器。estimator.tree_.values是一个形状数组[node_count, n_outputs, max_n_classes]，因此对于n_outputs > 1和/或max_n_classes > 1，大小估计需要考虑它们，正确的估计值将是estimator.tree_.node_count*8*(7 + n_outputs*max_n_classes)

- pietroppeter

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接