2. Using sklearn with CapyMOA#

In this tutorial we demonstrate how someone can directly use scikit-learn learners in CapyMOA. * The primary requirement for a scikit-learn learner to be used is that it implements partial_fit()


More information about CapyMOA can be found in https://www.capymoa.org

last update on 03/05/2024

1. Using raw sklearn objects#

  • This example shows a model from scikit-learn can be used with our Instance representation in a simple test-then-train loop

  • In this case, we need to adapt data to accommodate what the sklearn expects

[1]:
from capymoa.evaluation import ClassificationEvaluator
from capymoa.datasets import ElectricityTiny

from sklearn import linear_model

# Toy dataset with only 1000 instances
elec_stream = ElectricityTiny()

# Creates a sklearn classifier
sklearn_SGD = linear_model.SGDClassifier()

ob_evaluator = ClassificationEvaluator(schema=elec_stream.get_schema())

# Counter for partial fits
partial_fit_count = 0
while elec_stream.has_more_instances():
    instance = elec_stream.next_instance()

    prediction = -1
    if partial_fit_count > 0: # scikit-learn does not allows invoking predict in a model that was not fit before
        prediction = sklearn_SGD.predict([instance.x])[0]
    ob_evaluator.update(instance.y_index, prediction)
    sklearn_SGD.partial_fit([instance.x], [instance.y_index], classes=elec_stream.schema.get_label_indexes())
    partial_fit_count += 1

ob_evaluator.accuracy()
[1]:
84.7

2. Using a generic SKClassifier wrapper#

  • Instead of sklearn SGDClassifier here we use CapyMOA wrapper SKClassifier on a test-then-train loop

  • There is also a SKRegressor available in CapyMOA

[2]:
from sklearn import linear_model
from capymoa.base import SKClassifier
from capymoa.evaluation import ClassificationEvaluator

## Opening a file as a stream
elec_stream = ElectricityTiny()

# Creating a learner
sklearn_SGD = SKClassifier(schema=elec_stream.get_schema(), sklearner=linear_model.SGDClassifier())

# Creating the evaluator
sklearn_SGD_evaluator = ClassificationEvaluator(schema=elec_stream.get_schema())

while elec_stream.has_more_instances():
    instance = elec_stream.next_instance()

    prediction = sklearn_SGD.predict(instance)
    sklearn_SGD_evaluator.update(instance.y_index, prediction)
    sklearn_SGD.train(instance)

sklearn_SGD_evaluator.accuracy()
[2]:
84.7

3. Using prequential evaluation and SKClassifier#

  • Instead of an instance loop we may use the prequential_evaluation() function

[3]:
from capymoa.evaluation import prequential_evaluation

elec_stream = ElectricityTiny()

sklearn_SGD = SKClassifier(schema=elec_stream.get_schema(), sklearner=linear_model.SGDClassifier())

results_sklearn_SGD = prequential_evaluation(stream=elec_stream, learner=sklearn_SGD, window_size=4500)

results_sklearn_SGD.cumulative.accuracy()
[3]:
84.7

4. Further abstractions#

  • We can wrap popular algorithms to make then even easier to use

  • So far, one can use the following wrappers:

    • PassiveAggressiveClassifier

    • SGDClassifier

    • PassiveAggressiveRegressor

    • SGDRegressor

  • In the following example we show how one can use SGDClassifier and PassiveAggressiveClassifier

Observation: this code take up to 3 minutes if ``max_instances`` is not set as it will process all 100k instances from ``RTG_2abrupt`` using SGD and PA. We set the ``max_instances`` parameter to use less instances to process it quicker in the example.

[4]:
%%time
from capymoa.classifier import SGDClassifier, PassiveAggressiveClassifier
from capymoa.evaluation import prequential_evaluation_multiple_learners
from capymoa.evaluation.visualization import plot_windowed_results
from capymoa.datasets import RTG_2abrupt

RTG_2abrupt_stream = RTG_2abrupt()

sklearn_SGD = SGDClassifier(schema=RTG_2abrupt_stream.get_schema())
sklearn_PA = PassiveAggressiveClassifier(schema=RTG_2abrupt_stream.get_schema())

results = prequential_evaluation_multiple_learners(stream=RTG_2abrupt_stream,
                                                   learners={'SGD': sklearn_SGD, 'PA': sklearn_PA},
                                                   max_instances=10000,
                                                   window_size=1000)

plot_windowed_results(results['SGD'], results['PA'], metric='accuracy')
CPU times: user 16.8 s, sys: 166 ms, total: 17 s
Wall time: 16.8 s
../_images/notebooks_02_sklearn_8_1.png
[ ]: