0. Getting started with CapyMOA#

This notebook shows some basic usage of CapyMOA for supervised learning (classification and regression) * There are more detailed notebooks and documentation available, our goal here is just present some high-level functions and demonstrate a subset of CapyMOA’s functionalities. * For simplicity, we simulate data streams in the following examples by reading data from files and employing synthetic generators.


More information about CapyMOA can be found in https://www.capymoa.org

last update on 03/05/2024

1. Classification#

  • Classification for data streams tradicionally assumes instances are available to the classifier in an incremental fashion and labels become available before a new instance becomes available

  • It is common to simulate this behavior using a while loop, often referred to as test-then-train loop which contains 4 distinct steps:

    1. Fetches the next instance from the stream

    2. Makes a prediction

    3. Train the model with the instance

    4. Update a mechanism to keep track of metrics

Some remarks about test-then-train loop: * We must not train before testing, meaning that steps 2 and 3 should not be interchanged, as this would invalidate our interpretation concerning how the model performs on unseen data, leading to unreliable evaluations of its efficacy. * Steps 3 and 4 can be completed in any order without altering the result. * What if labels are not immediately available? Then you might want to read about delayed labeling and partially labeled data, see A Survey on Semi-supervised Learning for Delayed Partially Labelled Data Streams * More information on classification for data streams is available on section 2.2 Classification from Machine Learning for Data Streams book

[1]:
from capymoa.stream import stream_from_file
from capymoa.evaluation import ClassificationEvaluator
from capymoa.classifier import OnlineBagging

elec_stream = stream_from_file(path_to_csv_or_arff="../data/electricity.csv")
ob_learner = OnlineBagging(schema=elec_stream.get_schema(), ensemble_size=5)
ob_evaluator = ClassificationEvaluator(schema=elec_stream.get_schema())

while elec_stream.has_more_instances():
    instance = elec_stream.next_instance()
    prediction = ob_learner.predict(instance)
    ob_learner.train(instance)
    ob_evaluator.update(instance.y_index, prediction)

print(ob_evaluator.accuracy())
79.05190677966102

1.1 High-level evaluation functions#

  • If our goal is just to evaluate learners it would be tedious to keep writing the test-then-train loop. Thus, it makes sense to encapsulate that loop inside high-level evaluation functions.

  • Furthermore, sometimes we are interested in cumulative metrics and sometimes we care about metrics windowed metrics. For example, if we want to know how accurate our model is so far, considering all the instances it has seen, then we would look at its cumulative metrics. However, we might also be interested in how well the model is every n number of instances, so that we can, for example, identify periods in which our model was really struggling to produce correct predictions.

  • In this example, we use the prequential_evaluation function, which provides us with both the cumulative and the windowed metrics! If you are only interested in the test-then-train evaluation or the windowed evaluation, there are high-level functions for those as well (see the remarks below).

  • Some remarks:

    • If you want to know more about other high-level evaluation functions, evaluators, or which metrics are available, check the complete Evaluation documentation in http://www.capymoa.org

    • The results from evaluation functions such as prequential_evaluation follow a standard, also discussed thoroughly in the Evaluation documentation in http://www.capymoa.org

    • Sometimes authors refer to the cumulative metrics as test-then-train metrics, such as test-then-train accuracy (or TTT accuracy for short). They all refer to the same concept.

    • Shouldn’t we recreate the stream object elec_stream? No, high-level evaluators automatically restart() streams when they are reused.

In the below example prequential_evaluation is used with HoeffdingTree classifier on elec_stream data stream.

[2]:
from capymoa.evaluation import prequential_evaluation
from capymoa.classifier import HoeffdingTree

ht = HoeffdingTree(schema=elec_stream.get_schema(), grace_period=50)

# Obtain the results from the high-level function.
    # Note that we need to specify a window_size as we obtain both windowed and cumulative results.
    # The results from a high-level evaluation function are represented as a dictionary
results_ht = prequential_evaluation(stream=elec_stream, learner=ht, window_size=4500)

print(f"Cumulative accuracy = {results_ht['cumulative'].accuracy()}, wall-clock time: {results_ht['wallclock']}")

# The windowed results are conveniently stored in a pandas DataFrame.
display(results_ht['windowed'].metrics_per_window())
Cumulative accuracy = 78.85549081920904, wall-clock time: 1.498734951019287
classified instances classifications correct (percent) Kappa Statistic (percent) Kappa Temporal Statistic (percent) Kappa M Statistic (percent) F1 Score (percent) F1 Score for class 0 (percent) F1 Score for class 1 (percent) Precision (percent) Precision for class 0 (percent) Precision for class 1 (percent) Recall (percent) Recall for class 0 (percent) Recall for class 1 (percent)
0 4500.0 83.377778 64.450021 -2.888583 57.644394 82.379745 86.793785 77.577938 83.203661 83.776414 82.630907 81.571987 90.036630 73.107345
1 9000.0 84.044444 67.822827 4.900662 65.280464 83.919275 85.382736 82.436399 83.986147 84.556452 83.415842 83.852510 86.225329 81.479691
2 13500.0 83.977778 68.053669 -9.077156 66.465116 84.256688 83.698847 84.247324 84.324989 88.947621 79.702356 84.188498 79.035013 89.341983
3 18000.0 80.511111 58.612501 -35.339506 53.524112 79.999177 84.546256 73.624060 81.732853 78.347485 85.118220 78.337522 91.810180 64.864865
4 22500.0 82.200000 61.229846 -11.559889 54.566081 81.108879 86.277197 74.675941 82.791475 81.225806 84.357143 79.493314 91.998539 66.988088
5 27000.0 71.022222 36.254807 -127.972028 30.267380 69.937686 78.587849 55.189003 73.186694 69.161850 77.211538 66.964885 90.988593 42.941176
6 31500.0 76.511111 48.257452 -92.883212 39.703366 74.732167 82.190396 65.513866 76.515018 76.505646 76.524390 73.030507 88.787768 57.273246
7 36000.0 73.666667 45.062428 -99.159664 34.130072 72.531630 78.116343 66.945607 72.559253 77.929256 67.189250 72.504028 78.304332 66.703724
8 40500.0 72.311111 46.648342 -86.806597 36.135315 75.494304 70.019249 74.277457 76.362982 90.541381 62.184583 74.645166 57.081208 92.209124
9 45000.0 81.244444 62.378190 -20.571429 62.050360 81.233849 82.313495 80.037843 81.132844 84.436801 77.828887 81.335105 80.294358 82.375852
10 45312.0 80.244444 60.314484 -29.027576 59.664247 80.195732 81.552189 78.737144 80.087763 83.510412 76.665114 80.303993 79.683698 80.924287

1.2 Comparing results among classifiers#

  • CapyMOA provides plot_windowed_results as an easy visualization function for quickly comparing windowed metrics

  • In the example below, we create three classifiers: HoeffdingAdaptiveTree, HoeffdingTree and AdaptiveRandomForest, and plot the results using plot_windowed_results

  • More details about plot_windowed_results options are described in the documentation: http://www.capymoa.org

[3]:
from capymoa.evaluation.visualization import plot_windowed_results
from capymoa.base import MOAClassifier
from moa.classifiers.trees import HoeffdingAdaptiveTree
from capymoa.classifier import HoeffdingTree
from capymoa.classifier import AdaptiveRandomForestClassifier

# Create the wrapper for HoeffdingAdaptiveTree (from MOA)
HAT = MOAClassifier(schema=elec_stream.get_schema(), moa_learner=HoeffdingAdaptiveTree, CLI="-g 50")
HT = HoeffdingTree(schema=elec_stream.get_schema(), grace_period=50)
ARF = AdaptiveRandomForestClassifier(schema=elec_stream.get_schema(), ensemble_size=10, number_of_jobs=4)

results_HAT = prequential_evaluation(stream=elec_stream, learner=HAT, window_size=4500)
results_HT = prequential_evaluation(stream=elec_stream, learner=HT, window_size=4500)
results_ARF = prequential_evaluation(stream=elec_stream, learner=ARF, window_size=4500)

# Comparing models based on their cumulative accuracy
print(f"HAT accuracy = {results_HAT['cumulative'].accuracy()}")
print(f"HT accuracy = {results_HT['cumulative'].accuracy()}")
print(f"ARF accuracy = {results_ARF['cumulative'].accuracy()}")

# Plotting the results. Note that we ovewrote the ylabel, but that doesn't change the metric.
plot_windowed_results(results_HAT, results_HT, results_ARF, ylabel='Accuracy', xlabel="# Instances (window)")
HAT accuracy = 82.36228813559322
HT accuracy = 78.85549081920904
ARF accuracy = 87.5684145480226
../_images/notebooks_00_getting_started_6_1.png

2. Regression#

  • Regression algorithms have its API usage very similar to classification algorithms. We can use the same high-level evaluation and visualization functions for regression and classification.

  • Similarly to classification, we can also use MOA objects through a generic API.

[4]:
from capymoa.datasets import Fried
from moa.classifiers.trees import FIMTDD
from capymoa.base import MOARegressor
from capymoa.regressor import KNNRegressor

fried_stream = Fried()  # Downloads the Fried dataset into the data dir in case it is not there yet.
fimtdd = MOARegressor(schema=fried_stream.get_schema(), moa_learner=FIMTDD())
knnreg = KNNRegressor(schema=fried_stream.get_schema(), k=3, window_size=1000)

results_fimtdd = prequential_evaluation(stream=fried_stream, learner=fimtdd, window_size=5000)
results_knnreg = prequential_evaluation(stream=fried_stream, learner=knnreg, window_size=5000)

results_fimtdd['windowed'].metrics_per_window()
# Selecting the metric so that we don't use the default one.
# Note that the metric is different from the ylabel parameter, which just overrides the y-axis label.
plot_windowed_results(results_fimtdd, results_knnreg, metric="coefficient of determination")
Downloading fried.arff
100% [............................................................................] 922613 / 922613
../_images/notebooks_00_getting_started_8_1.png

3. Concept Drift#

  • One of the most challenging and defining aspects of data streams is the phenomenon known as concept drifts.

  • In CapyMOA, we designed the simplest and most complete API for simulating, visualizing and assessing concept drifts.

  • In the example below we focus on a simple way of simulating and visualizing a drifting stream. There is a tutorial focusing entirely on how Concept Drift can be simulated, detected and assessed in a separate notebook (See Tutorial 4: Simulating Concept Drifts with the DriftStream API)

3.1 Plotting Drift Detection results#

  • This example uses the DriftStream building API, precisely the positional version where drifts are specified according to their exact location on the stream.

  • Integration with the visualization function. The DriftStream object carries meta-information about the drift which is passed along the stream and thus become available to plot_windowed_results

  • The following plot contains two drifts: 1 abrupt and 1 gradual, such that the abrupt drift is located at instance 5000 and the gradual drift starts at instance 9000 and ends at 12000. This information is provided to the stream via GradualDrift(start=9000, end=12000)

  • More details concerning Concept Drift in CapyMOA can be found in the documentation: http://www.capymoa.org

[5]:
from capymoa.classifier import OnlineBagging
from capymoa.stream.generator import SEA
from capymoa.stream.drift import Drift, AbruptDrift, GradualDrift, DriftStream

# Generating a synthetic stream with 1 abrupt drift and 1 gradual drift.
stream_sea2drift = DriftStream(stream=[SEA(function=1),
                                AbruptDrift(position=5000),
                                SEA(function=3),
                                GradualDrift(start=9000, end=12000),
                                SEA(function=1)])

OB = OnlineBagging(schema=stream_sea2drift.get_schema(), ensemble_size=10)

# Since this is a synthetic stream, max_instances is needed to determine the amount of instances to be generated.
results_sea2drift_OB = prequential_evaluation(stream=stream_sea2drift, learner=OB, window_size=100, max_instances=15000)

# print(stream_sea2drift.drifts)
plot_windowed_results(results_sea2drift_OB, ylabel='Accuracy')
None
../_images/notebooks_00_getting_started_11_1.png
[ ]: