HoeffdingTree#
- class capymoa.classifier.HoeffdingTree[source]#
Bases:
MOAClassifier
Hoeffding Tree.
Hoeffding Tree (VFDT) [1] is a tree classifier classifier. A Hoeffding tree is an incremental, anytime decision tree induction algorithm that is capable of learning from massive data streams, assuming that the distribution generating examples does not change over time. Hoeffding trees exploit the fact that a small sample can often be enough to choose an optimal splitting attribute. This idea is supported mathematically by the Hoeffding bound, which quantifies the number of observations (in our case, examples) needed to estimate some statistics within a prescribed precision (in our case, the goodness of an attribute).
A theoretically appealing feature of Hoeffding Trees not shared by other incremental decision tree learners is that it has sound guarantees of performance. Using the Hoeffding bound one can show that its output is asymptotically nearly identical to that of a non-incremental learner using infinitely many examples.
>>> from capymoa.classifier import HoeffdingTree >>> from capymoa.datasets import ElectricityTiny >>> from capymoa.evaluation import prequential_evaluation >>> >>> stream = ElectricityTiny() >>> classifier = HoeffdingTree(stream.get_schema()) >>> results = prequential_evaluation(stream, classifier, max_instances=1000) >>> print(f"{results['cumulative'].accuracy():.1f}") 84.4
- __init__(
- schema: Schema | None = None,
- random_seed: int = 0,
- grace_period: int = 200,
- split_criterion: str | SplitCriterion = 'InfoGainSplitCriterion',
- confidence: float = 1e-3,
- tie_threshold: float = 0.05,
- leaf_prediction: int = 'NaiveBayesAdaptive',
- nb_threshold: int = 0,
- numeric_attribute_observer: str = 'GaussianNumericAttributeClassObserver',
- binary_split: bool = False,
- max_byte_size: float = 33554433,
- memory_estimate_period: int = 1000000,
- stop_mem_management: bool = True,
- remove_poor_attrs: bool = False,
- disable_prepruning: bool = True,
Construct Hoeffding Tree.
- Parameters:
schema – Stream schema.
random_seed – Seed for reproducibility.
grace_period – Number of instances a leaf should observe between split attempts.
split_criterion – Split criterion to use.
confidence – Significance level to calculate the Hoeffding bound. The significance level is given by 1 - delta. Values closer to zero imply longer split decision delays.
tie_threshold – Threshold below which a split will be forced to break ties.
leaf_prediction – Prediction mechanism used at leafs.
nb_threshold – Number of instances a leaf should observe before allowing Naive Bayes.
numeric_attribute_observer – The Splitter or Attribute Observer (AO) used to monitor the class statistics of numeric features and perform splits.
binary_split – If True, only allow binary splits.
max_byte_size – The max size of the tree, in bytes.
memory_estimate_period – Interval (number of processed instances) between memory consumption checks.
stop_mem_management – If True, stop growing as soon as memory limit is hit.
remove_poor_attrs – If True, disable poor attributes to reduce memory usage.
disable_prepruning – If True, disable merit-based tree pre-pruning.
- predict(instance)[source]#
Predict the label of an instance.
The base implementation calls
predict_proba()
and returns the label with the highest probability.- Parameters:
instance – The instance to predict the label for.
- Returns:
The predicted label or
None
if the classifier is unable to make a prediction.
- predict_proba(instance)[source]#
Return probability estimates for each label.
- Parameters:
instance – The instance to estimate the probabilities for.
- Returns:
An array of probabilities for each label or
None
if the classifier is unable to make a prediction.
- train(instance)[source]#
Train the classifier with a labeled instance.
- Parameters:
instance – The labeled instance to train the classifier with.
- random_seed: int#
The random seed for reproducibility.
When implementing a classifier ensure random number generators are seeded.