0. Getting started with CapyMOA#

This notebook shows some basic usage of CapyMOA for supervised learning (classification and regression)

There are more detailed notebooks and documentation available, our goal here is just present some high-level functions and demonstrate a subset of CapyMOA’s functionalities.
For simplicity, we simulate data streams in the following examples using datasets and employing synthetic generators. One could also read data directly from a CSV or ARFF (See stream_from_file function)

More information about CapyMOA can be found in https://www.capymoa.org

last update on 25/07/2024

1. Classification#

Classification for data streams traditionally assumes instances are available to the classifier in an incremental fashion and labels become available before a new instance becomes available
It is common to simulate this behavior using a while loop, often referred to as a test-then-train loop which contains 4 distinct steps:
1. Fetches the next instance from the stream
2. Makes a prediction
3. Train the model with the instance
4. Update a mechanism to keep track of metrics

Some remarks about test-then-train loop:

We must not train before testing, meaning that steps 2 and 3 should not be interchanged, as this would invalidate our interpretation concerning how the model performs on unseen data, leading to unreliable evaluations of its efficacy.
Steps 3 and 4 can be completed in any order without altering the result.
What if labels are not immediately available? Then you might want to read about delayed labeling and partially labeled data, see A Survey on Semi-supervised Learning for Delayed Partially Labelled Data Streams
More information on classification for data streams is available on section 2.2 Classification from Machine Learning for Data Streams book

[2]:

from capymoa.datasets import Electricity
from capymoa.evaluation import ClassificationEvaluator
from capymoa.classifier import OnlineBagging

elec_stream = Electricity()
ob_learner = OnlineBagging(schema=elec_stream.get_schema(), ensemble_size=5)
ob_evaluator = ClassificationEvaluator(schema=elec_stream.get_schema())

while elec_stream.has_more_instances():
    instance = elec_stream.next_instance()
    prediction = ob_learner.predict(instance)
    ob_learner.train(instance)
    ob_evaluator.update(instance.y_index, prediction)

print(ob_evaluator.accuracy())

82.06656073446328

1.1 High-level evaluation functions#

If our goal is just to evaluate learners it would be tedious to keep writing test-then-train loops. Thus, it makes sense to encapsulate that loop inside high-level evaluation functions.
Furthermore, sometimes we are interested in cumulative metrics and sometimes we care about metrics windowed metrics. For example, if we want to know how accurate our model is so far, considering all the instances it has seen, then we would look at its cumulative metrics. However, we might also be interested in how well the model is every n number of instances, so that we can, for example, identify periods in which our model was really struggling to produce correct predictions.
In this example, we use the prequential_evaluation function, which provides us with both the cumulative and the windowed metrics!
Some remarks:
- If you want to know more about other high-level evaluation functions, evaluators, or which metrics are available, check the 01_evaluation notebook
- The results from evaluation functions such as prequential_evaluation follow a standard, also discussed thoroughly in the Evaluation documentation in http://www.capymoa.org
- Sometimes authors refer to the cumulative metrics as test-then-train metrics, such as test-then-train accuracy (or TTT accuracy for short). They all refer to the same concept.
- Shouldn’t we recreate the stream object elec_stream? No, prequential_evaluation(), by default, automatically restart() streams when they are reused.

In the below example prequential_evaluation is used with a HoeffdingTree classifier on the Electricity data stream.

[3]:

from capymoa.evaluation import prequential_evaluation
from capymoa.classifier import HoeffdingTree

ht = HoeffdingTree(schema=elec_stream.get_schema(), grace_period=50)

# Obtain the results from the high-level function.
# Note that we need to specify a window_size as we obtain both windowed and cumulative results.
# The results from a high-level evaluation function are represented as a PrequentialResults object
results_ht = prequential_evaluation(stream=elec_stream, learner=ht, window_size=4500)

print(
    f"Cumulative accuracy = {results_ht.cumulative.accuracy()}, wall-clock time: {results_ht.wallclock()}"
)

# The windowed results are conveniently stored in a pandas DataFrame.
display(results_ht.windowed.metrics_per_window())

Cumulative accuracy = 81.6604872881356, wall-clock time: 0.2248706817626953

	instances	accuracy	kappa	kappa_t	kappa_m	f1_score	f1_score_0	f1_score_1	precision	precision_0	precision_1	recall	recall_0	recall_1
0	4500.0	87.777778	74.440796	24.242424	68.856172	87.222016	84.550562	89.889706	87.149807	84.078212	90.221402	87.294344	85.028249	89.560440
1	9000.0	83.666667	66.963969	2.649007	64.458414	83.538542	81.657100	85.279391	83.752489	84.373388	83.131589	83.325685	79.110251	87.541118
2	13500.0	85.644444	71.282626	2.269289	70.009285	85.663875	85.304823	85.968723	85.634554	83.780161	87.488948	85.693216	86.886006	84.500427
3	18000.0	81.977778	61.953129	-25.154321	57.021728	81.463331	76.168087	85.510095	82.841248	85.488127	80.194370	80.130502	68.680445	91.580559
4	22500.0	86.177778	70.202882	13.370474	64.719229	85.389296	80.931944	89.159986	86.648480	88.058706	85.238254	84.166185	74.872377	93.459993
5	27000.0	78.088889	53.951820	-72.377622	47.272727	77.186522	71.634062	82.150615	77.962693	77.521793	78.403594	76.425652	66.577540	86.273764
6	31500.0	79.066667	55.619360	-71.897810	46.263548	77.829775	72.504378	83.100108	78.081099	74.237896	81.924301	77.580064	70.849971	84.310157
7	36000.0	74.955556	49.002474	-89.411765	37.354086	74.661963	70.719667	78.120753	74.256346	66.390244	82.122449	75.072035	75.653141	74.490929
8	40500.0	74.555556	50.130218	-71.664168	41.312148	76.116886	74.818562	74.286998	76.196815	65.523883	86.869748	76.037125	87.186058	64.888191
9	45000.0	84.377778	68.535062	-0.428571	68.390288	84.268304	82.949309	85.585401	84.249034	82.648623	85.849445	84.287584	83.252191	85.322976
10	45312.0	84.266667	68.237903	-2.757620	67.876588	84.118966	82.587309	85.650588	84.121918	82.627953	85.615883	84.116013	82.546706	85.685320

1.2 Comparing results among classifiers#

CapyMOA provides plot_windowed_results as an easy visualization function for quickly comparing windowed metrics
In the example below, we create three classifiers: HoeffdingAdaptiveTree, HoeffdingTree and AdaptiveRandomForest, and plot the results using plot_windowed_results
More details about plot_windowed_results options are described in the documentation: http://www.capymoa.org

[4]:

from capymoa.evaluation.visualization import plot_windowed_results
from capymoa.base import MOAClassifier
from moa.classifiers.trees import HoeffdingAdaptiveTree
from capymoa.classifier import HoeffdingTree
from capymoa.classifier import AdaptiveRandomForestClassifier

# Create the wrapper for HoeffdingAdaptiveTree (from MOA)
HAT = MOAClassifier(
    schema=elec_stream.get_schema(), moa_learner=HoeffdingAdaptiveTree, CLI="-g 50"
)
HT = HoeffdingTree(schema=elec_stream.get_schema(), grace_period=50)
ARF = AdaptiveRandomForestClassifier(
    schema=elec_stream.get_schema(), ensemble_size=10, number_of_jobs=4
)

results_HAT = prequential_evaluation(stream=elec_stream, learner=HAT, window_size=4500)
results_HT = prequential_evaluation(stream=elec_stream, learner=HT, window_size=4500)
results_ARF = prequential_evaluation(stream=elec_stream, learner=ARF, window_size=4500)

# Comparing models based on their cumulative accuracy
print(f"HAT accuracy = {results_HAT.cumulative.accuracy()}")
print(f"HT accuracy = {results_HT.cumulative.accuracy()}")
print(f"ARF accuracy = {results_ARF.cumulative.accuracy()}")

# Plotting the results. Note that we ovewrote the ylabel, but that doesn't change the metric.
plot_windowed_results(
    results_HAT,
    results_HT,
    results_ARF,
    metric="accuracy",
    xlabel="# Instances (window)",
)

HAT accuracy = 84.68617584745762
HT accuracy = 81.6604872881356
ARF accuracy = 81.90986935028248

../_images/notebooks_00_getting_started_7_1.png

2. Regression#

Regression algorithms have its API usage very similar to classification algorithms. We can use the same high-level evaluation and visualization functions for regression and classification.
Similarly to classification, we can also use MOA objects through a generic API.

[5]:

from capymoa.datasets import Fried
from moa.classifiers.trees import FIMTDD
from capymoa.base import MOARegressor
from capymoa.regressor import KNNRegressor

fried_stream = (
    Fried()
)  # Downloads the Fried dataset into the data dir in case it is not there yet.
fimtdd = MOARegressor(schema=fried_stream.get_schema(), moa_learner=FIMTDD())
knnreg = KNNRegressor(schema=fried_stream.get_schema(), k=3, window_size=1000)

results_fimtdd = prequential_evaluation(
    stream=fried_stream, learner=fimtdd, window_size=5000
)
results_knnreg = prequential_evaluation(
    stream=fried_stream, learner=knnreg, window_size=5000
)

results_fimtdd.windowed.metrics_per_window()
# Note that the metric is different from the ylabel parameter, which just overrides the y-axis label.
plot_windowed_results(
    results_fimtdd, results_knnreg, metric="rmse", ylabel="root mean squared error"
)

../_images/notebooks_00_getting_started_9_0.png

3. Concept Drift#

One of the most challenging and defining aspects of data streams is the phenomenon known as concept drifts.
In CapyMOA, we designed the simplest and most complete API for simulating, visualizing and assessing concept drifts.
In the example below we focus on a simple way of simulating and visualizing a drifting stream. There is a tutorial focusing entirely on how Concept Drift can be simulated, detected and assessed in a separate notebook (See Tutorial 4: Simulating Concept Drifts with the DriftStream API)

3.1 Plotting Drift Detection results#

This example uses the DriftStream building API, precisely the positional version where drifts are specified according to their exact location on the stream.
Integration with the visualization function. The DriftStream object carries meta-information about the drift which is passed along the stream and thus become available to plot_windowed_results
The following plot contains two drifts: 1 abrupt and 1 gradual, such that the abrupt drift is located at instance 5000 and the gradual drift starts at instance 9000 and ends at 12000. This information is provided to the stream via GradualDrift(start=9000, end=12000)
More details concerning Concept Drift in CapyMOA can be found in the documentation: http://www.capymoa.org

[6]:

from capymoa.classifier import OnlineBagging
from capymoa.stream.generator import SEA
from capymoa.stream.drift import AbruptDrift, GradualDrift, DriftStream

# Generating a synthetic stream with 1 abrupt drift and 1 gradual drift.
stream_sea2drift = DriftStream(
    stream=[
        SEA(function=1),
        AbruptDrift(position=5000),
        SEA(function=3),
        GradualDrift(start=9000, end=12000),
        SEA(function=1),
    ]
)

OB = OnlineBagging(schema=stream_sea2drift.get_schema(), ensemble_size=10)

# Since this is a synthetic stream, max_instances is needed to determine the amount of instances to be generated.
results_sea2drift_OB = prequential_evaluation(
    stream=stream_sea2drift, learner=OB, window_size=100, max_instances=15000
)

# print(stream_sea2drift.drifts)
plot_windowed_results(results_sea2drift_OB, metric="accuracy")

None

../_images/notebooks_00_getting_started_12_1.png

[ ]: