EFDT#
- class capymoa.classifier.EFDT[source]#
Bases:
MOAClassifier
Extremely Fast Decision Tree (EFDT) Classifier
Also referred to as the Hoeffding AnyTime Tree (HATT) classifier. In practice, despite the name, EFDTs are typically slower than a vanilla Hoeffding Tree to process data. The speed differences come from the mechanism of split re-evaluation present in EFDT. Nonetheless, EFDT has theoretical properties that ensure it converges faster than the vanilla Hoeffding Tree to the structure that would be created by a batch decision tree model (such as Classification and Regression Trees - CART). Keep in mind that such propositions hold when processing a stationary data stream. When dealing with non-stationary data, EFDT is somewhat robust to concept drifts as it continually revisits and updates its internal decision tree structure. Still, in such cases, the Hoeffding Adaptive Tree might be a better option, as it was specifically designed to handle non-stationarity.
Reference:
Example usage:
>>> from capymoa.datasets import ElectricityTiny >>> from capymoa.classifier import EFDT >>> from capymoa.evaluation import prequential_evaluation >>> stream = ElectricityTiny() >>> schema = stream.get_schema() >>> learner = EFDT(schema) >>> results = prequential_evaluation(stream, learner, max_instances=1000) >>> results["cumulative"].accuracy() 84.39999999999999
- __init__(
- schema: Schema | None = None,
- random_seed: int = 0,
- grace_period: int = 200,
- min_samples_reevaluate: int = 200,
- split_criterion: str | SplitCriterion = 'InfoGainSplitCriterion',
- confidence: float = 0.001,
- tie_threshold: float = 0.05,
- leaf_prediction: str = 'NaiveBayesAdaptive',
- nb_threshold: int = 0,
- numeric_attribute_observer: str = 'GaussianNumericAttributeClassObserver',
- binary_split: bool = False,
- max_byte_size: float = 33554433,
- memory_estimate_period: int = 1000000,
- stop_mem_management: bool = True,
- remove_poor_attrs: bool = False,
- disable_prepruning: bool = True,
Construct an Extremely Fast Decision Tree (EFDT) Classifier
- Parameters:
schema – The schema of the stream.
random_seed – The random seed passed to the MOA learner.
grace_period – Number of instances a leaf should observe between split attempts.
min_samples_reevaluate – Number of instances a node should observe before re-evaluating the best split.
split_criterion – Split criterion to use. Defaults to InfoGainSplitCriterion.
confidence – Significance level to calculate the Hoeffding bound. The significance level is given by 1 - delta. Values closer to zero imply longer split decision delays.
tie_threshold – Threshold below which a split will be forced to break ties.
leaf_prediction – Prediction mechanism used at the leaves (“MajorityClass” or 0, “NaiveBayes” or 1, “NaiveBayesAdaptive” or 2).
nb_threshold – Number of instances a leaf should observe before allowing Naive Bayes.
numeric_attribute_observer – The Splitter or Attribute Observer (AO) used to monitor the class statistics of numeric features and perform splits.
binary_split – If True, only allow binary splits.
max_byte_size – The max size of the tree, in bytes.
memory_estimate_period – Interval (number of processed instances) between memory consumption checks.
stop_mem_management – If True, stop growing as soon as memory limit is hit.
remove_poor_attrs – If True, disable poor attributes to reduce memory usage.
disable_prepruning – If True, disable merit-based tree pre-pruning.