1. Evaluating supervised learners in CapyMOA#
This notebook further explores high-level evaluation functions, Data Abstraction and Classifiers
High-level evaluation functions
We demonstrate how to use
prequential_evaluation()
and how to further encapsulate prequential evaluation usingprequential_evaluation_multiple_learners
We also discuss particularities about how these evaluation functions relate to how research has developed in the field, and how evaluation is commonly performed and presented.
Supervised Learning
We clarify important information concerning the usage of Classifiers and their predictions
We added some examples using Regressors, which highlight the fact that the evaluation is identical to Classifiers (i.e. same high-level evaluation functions)
More information about CapyMOA can be found in https://www.capymoa.org
last update on 25/07/2024
1. The difference between Evaluators#
The following example implements an while loop that updates a
ClassificationWindowedEvaluator
and aClassificationEvaluator
for the same learner.The
ClassificationWindowedEvaluator
update the metrics according to tumbling windows which ‘forgets’ old correct and incorrect predictions. This allows us to observe how well the learner performs on shorter windows.The
ClassificationEvaluator
updates the metrics taking into account all the correct and incorrect predictions made. It is useful to observe the overall performance after processing hundreds of thousands of instances.Two important points:
Regarding window_size in
ClassificationEvaluator
: AClassificationEvaluator
also allow us to specify a window size, but it only controls the frequency at which cumulative metrics are calculated.If we access metrics directly (not through
metrics_per_window()
) inClassificationWindowedEvaluator
we will be looking at the metrics corresponding to the last window.
For further insight into the specifics of the Evaluators, please refer to the documentation: https://www.capymoa.org
[1]:
from capymoa.datasets import Electricity
from capymoa.evaluation import ClassificationWindowedEvaluator, ClassificationEvaluator
from capymoa.classifier import AdaptiveRandomForestClassifier
stream = Electricity()
ARF = AdaptiveRandomForestClassifier(schema=stream.get_schema(), ensemble_size=10)
# The window_size in ClassificationWindowedEvaluator specifies the amount of instances used per evaluation
windowedEvaluatorARF = ClassificationWindowedEvaluator(schema=stream.get_schema(), window_size=4500)
# The window_size ClassificationEvaluator just specifies the frequency at which the cumulative metrics are stored
classificationEvaluatorARF = ClassificationEvaluator(schema=stream.get_schema(), window_size=4500)
while stream.has_more_instances():
instance = stream.next_instance()
prediction = ARF.predict(instance)
windowedEvaluatorARF.update(instance.y_index, prediction)
classificationEvaluatorARF.update(instance.y_index, prediction)
ARF.train(instance)
# Showing only the 'classifications correct (percent)' (i.e. accuracy)
print(f'[ClassificationWindowedEvaluator] Windowed accuracy reported for every window_size windows')
print(windowedEvaluatorARF.accuracy())
print(f'[ClassificationEvaluator] Cumulative accuracy: {classificationEvaluatorARF.accuracy()}')
# We could report the cumulative accuracy every window_size instances with the following code, but that is normally not very insightful.
# display(classificationEvaluatorARF.metrics_per_window())
[ClassificationWindowedEvaluator] Windowed accuracy reported for every window_size windows
[89.57777777777778, 89.46666666666667, 90.2, 89.71111111111111, 88.68888888888888, 88.48888888888888, 87.6888888888889, 88.88888888888889, 89.28888888888889, 91.06666666666666]
[ClassificationEvaluator] Cumulative accuracy: 89.32953742937853
2. High-level evaluation functions#
In CapyMOA, for supervised learning, there is one primary evaluation function designed to handle the manipulation of Evaluators, i.e. the prequential_evaluation()
. This function streamline the process, ensuring users need not directly update them. Essentially, this function execute the evaluation loop and update the relevant Evaluators:
prequential_evaluation()
utilises ClassificationEvaluator
and ClassificationWindowedEvaluator
Previously, CapyMOA included two other functions: cumulative_evaluation()
and windowed_evaluation()
. However, since prequential_evaluation()
incorporates the functionality of both we decided to remove those functions and focus on prequential_evaluation()
. It’s important to note that prequential_evaluation()
is applicable to both Regression
and Prediction Intervals
besides Classification
. The functionality and interpretation remain the same across these cases, but
the metrics differ.
Result of a high-level function * The return from prequential_evaluation()
is a PrequentialResults
object which provides access to the cumulative
and windowed
metrics as well as some other metrics (like wall-clock and cpu time).
Common characteristics for all high-level evaluation functions * prequential_evaluation()
specify a max_instances
parameter, which by default is None
. Depending on the source of the data (e.g. a real stream or a synthetic stream) the function will never stop! The intuition behind this is that Streams are infinite, we process them as such. Therefore, it is a good idea to specify max_instances
unless you are using a snapshot of a stream (i.e. a Dataset
like
Electricity
)
Evaluation practices in the literature (and practice)
Interested readers might want to peruse section 6.1.1 Error Estimation from Machine Learning for Data Streams book. We further expand the relationships between the literature and our evaluation functions in the documentation: https://www.capymoa.org
2.1 prequential_evaluation()#
A prequential_evaluation()
performs a windowed evaluation and a cumulative evaluation at once. Internally, it maintains a ClassificationWindowedEvaluator
(for the windowed metrics) and ClassificationEvaluator
(for the cumulative metrics). This allows us to have access to the cumulative and windowed results without running two separate evaluation functions.
The results returned from
prequential_evaluation()
allows accessing the Evaluator objectsClassificationWindowedEvaluator
(attributewindowed
) andClassificationEvaluator
(attributecumulative
) directly.Notice that the computational overhead of training and assessing the same model twice outweighs the minimum overhead of updating the two Evaluators within the function. Thus, it is advisable to use the
prequential_evaluation()
function instead of creating separatewhile
loops for evaluationAdvanced users might intuitively request metrics directly from the
results
object, which will return thecumulative
metrics. For example, assumingresults = prequential_evaluation(...)
,results.accuracy()
will return thecumulative
accuracy. IMPORTANT: There are no IDE hints for these metrics as they are accessed dynamic via__getattr__
. It is advisable that users access metrics explicitly throughresults.cumulative
(orresults['cumulative']
) orresults.windowed
(orresults['windowed']
)Invoking
results.metrics_per_window()
from aresults
object will return the dataframe with thewindowed
results.results.write_to_file()
will output thecumulative
andwindowed
results to a directory.results.cumulative.metrics_dict()
will return all the cumulative metrics identifiers and their corresponding values in a dictionary structureInvoking
plot_windowed_results()
with aPrequentialResults
object will plot itswindowed
resultsFor plotting and analysis purposes, one might want to set
store_predictions=True
andstore_y=True
on theprequential_evaluation()
function, which will include all the predictions and ground truth y in the PrequentialResults object. It is important to note that this can be costly in terms of memory depending on the size of the stream.
[2]:
from capymoa.evaluation import prequential_evaluation
from capymoa.classifier import HoeffdingTree
from capymoa.datasets import ElectricityTiny
from capymoa.evaluation.visualization import plot_windowed_results
elec_stream = ElectricityTiny()
ht = HoeffdingTree(schema=elec_stream.get_schema(), grace_period=50)
results_ht = prequential_evaluation(stream=elec_stream, learner=ht, window_size=100, optimise=True, store_predictions=False, store_y=False)
print("\tDifferent ways of accessing metrics:")
print(f"results_ht['wallclock']: {results_ht['wallclock']} results_ht.wallclock(): {results_ht.wallclock()}")
print(f"results_ht['cpu_time']: {results_ht['cpu_time']} results_ht.cpu_time(): {results_ht.cpu_time()}")
print(f"results_ht.cumulative.accuracy() = {results_ht.cumulative.accuracy()}")
print(f"results_ht.cumulative['accuracy'] = {results_ht.cumulative['accuracy']}")
print(f"results_ht['cumulative'].accuracy() = {results_ht['cumulative'].accuracy()}")
print(f"results_ht.accuracy() = {results_ht.accuracy()}")
print(f"\n\tAll the cumulative results:")
print(results_ht.cumulative.metrics_dict())
print(f"\n\tAll the windowed results:")
display(results_ht.metrics_per_window())
# OR display(results_ht.windowed.metrics_per_window())
# results_ht.write_to_file() -> this will save the results to a directory
plot_windowed_results(results_ht, metric= "accuracy")
Different ways of accessing metrics:
results_ht['wallclock']: 0.014730691909790039 results_ht.wallclock(): 0.014730691909790039
results_ht['cpu_time']: 0.05554199999999909 results_ht.cpu_time(): 0.05554199999999909
results_ht.cumulative.accuracy() = 83.85000000000001
results_ht.cumulative['accuracy'] = 83.85000000000001
results_ht['cumulative'].accuracy() = 83.85000000000001
results_ht.accuracy() = 83.85000000000001
All the cumulative results:
{'instances': 2000.0, 'accuracy': 83.85000000000001, 'kappa': 66.04003700899992, 'kappa_t': -14.946619217081869, 'kappa_m': 59.010152284263974, 'f1_score': 83.03346476507683, 'f1_score_0': 86.77855096193205, 'f1_score_1': 79.25497752087348, 'precision': 83.24177714270593, 'precision_0': 85.82995951417004, 'precision_1': 80.65359477124183, 'recall': 82.82619238745067, 'recall_0': 87.74834437086093, 'recall_1': 77.90404040404042}
All the windowed results:
instances | accuracy | kappa | kappa_t | kappa_m | f1_score | f1_score_0 | f1_score_1 | precision | precision_0 | precision_1 | recall | recall_0 | recall_1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100.0 | 89.0 | 75.663717 | 31.250000 | 64.516129 | 87.841244 | 91.603053 | 84.057971 | 87.582418 | 92.307692 | 82.857143 | 88.101604 | 90.909091 | 85.294118 |
1 | 200.0 | 80.0 | 49.367089 | -42.857143 | 67.213115 | 78.947368 | 60.000000 | 86.666667 | 88.235294 | 100.000000 | 76.470588 | 71.428571 | 42.857143 | 100.000000 |
2 | 300.0 | 71.0 | 16.953036 | -141.666667 | 29.268293 | 58.514754 | 81.290323 | 35.555556 | 58.114035 | 82.894737 | 33.333333 | 58.921037 | 79.746835 | 38.095238 |
3 | 400.0 | 85.0 | 66.637011 | -36.363636 | 77.941176 | 84.021504 | 77.611940 | 88.721805 | 86.376882 | 89.655172 | 83.098592 | 81.791171 | 68.421053 | 95.161290 |
4 | 500.0 | 87.0 | 73.684211 | -8.333333 | 80.000000 | 87.218591 | 88.495575 | 85.057471 | 87.916667 | 83.333333 | 92.500000 | 86.531513 | 94.339623 | 78.723404 |
5 | 600.0 | 84.0 | 64.221825 | -14.285714 | 54.285714 | 82.965706 | 88.059701 | 75.757576 | 85.615079 | 81.944444 | 89.285714 | 80.475382 | 95.161290 | 65.789474 |
6 | 700.0 | 85.0 | 70.000000 | 16.666667 | 70.588235 | 85.880856 | 83.146067 | 86.486486 | 85.000000 | 74.000000 | 96.000000 | 86.780160 | 94.871795 | 78.688525 |
7 | 800.0 | 99.0 | 97.954173 | 94.117647 | 97.674419 | 98.987342 | 98.823529 | 99.130435 | 99.137931 | 100.000000 | 98.275862 | 98.837209 | 97.674419 | 100.000000 |
8 | 900.0 | 78.0 | 57.446809 | -15.789474 | 56.862745 | 81.751825 | 80.357143 | 75.000000 | 83.582090 | 67.164179 | 100.000000 | 80.000000 | 100.000000 | 60.000000 |
9 | 1000.0 | 96.0 | 91.922456 | 50.000000 | 92.727273 | 95.998775 | 95.555556 | 96.363636 | 96.185065 | 97.727273 | 94.642857 | 95.813205 | 93.478261 | 98.148148 |
10 | 1100.0 | 83.0 | 1.162791 | -142.857143 | 0.000000 | 50.583013 | 90.607735 | 10.526316 | 50.555556 | 91.111111 | 10.000000 | 50.610501 | 90.109890 | 11.111111 |
11 | 1200.0 | 76.0 | 10.979228 | -100.000000 | 7.692308 | 66.740576 | 86.046512 | 14.285714 | 87.755102 | 75.510204 | 100.000000 | 53.846154 | 100.000000 | 7.692308 |
12 | 1300.0 | 87.0 | 66.529351 | -62.500000 | 59.375000 | 85.391617 | 91.275168 | 74.509804 | 91.975309 | 83.950617 | 100.000000 | 79.687500 | 100.000000 | 59.375000 |
13 | 1400.0 | 91.0 | 64.285714 | 57.142857 | 52.631579 | 84.639017 | 94.736842 | 68.965517 | 95.000000 | 90.000000 | 100.000000 | 76.315789 | 100.000000 | 52.631579 |
14 | 1500.0 | 92.0 | 62.686567 | 42.857143 | 50.000000 | 84.076433 | 95.454545 | 66.666667 | 95.652174 | 91.304348 | 100.000000 | 75.000000 | 100.000000 | 50.000000 |
15 | 1600.0 | 89.0 | 73.170732 | 21.428571 | 65.625000 | 87.145717 | 92.307692 | 80.701754 | 90.000000 | 88.000000 | 92.000000 | 84.466912 | 97.058824 | 71.875000 |
16 | 1700.0 | 89.0 | 78.000000 | 8.333333 | 76.086957 | 89.196535 | 89.523810 | 88.421053 | 89.393939 | 85.454545 | 93.333333 | 89.000000 | 94.000000 | 84.000000 |
17 | 1800.0 | 72.0 | 45.141066 | -47.368421 | 9.677419 | 77.094241 | 63.157895 | 77.419355 | 81.578947 | 100.000000 | 63.157895 | 73.076923 | 46.153846 | 100.000000 |
18 | 1900.0 | 58.0 | 24.677188 | -200.000000 | -31.250000 | 65.568421 | 58.823529 | 57.142857 | 65.329768 | 88.235294 | 42.424242 | 65.808824 | 44.117647 | 87.500000 |
19 | 2000.0 | 86.0 | 66.410749 | 26.315789 | 58.823529 | 84.238820 | 90.140845 | 75.862069 | 87.938596 | 84.210526 | 91.666667 | 80.837790 | 96.969697 | 64.705882 |
2.4 Evaluating a single stream using multiple learners#
prequential_evaluation_multiple_learners()
further encapsulates experiments by executing multiple learners on a single stream.
This function behaves as if we invoked
prequential_evaluation()
multiple times, but internally it only iterates through the Stream once. This is useful if we are faced with a situation where accessing each Instance of the Stream is costly this function will be more convenient than just invokingprequential_evaluation()
multiple times.This method does not calculate
wallclock
orcpu_time
because the training and testing of each learner is interleaved, thus timing estimations are unreliable. Thus, the results dictionaries do not contain the keyswallclock
andcpu_time
.
[7]:
from capymoa.evaluation import prequential_evaluation_multiple_learners
from capymoa.datasets import Electricity
from capymoa.classifier import AdaptiveRandomForestClassifier, OnlineBagging
from capymoa.evaluation.visualization import plot_windowed_results
stream = Electricity()
# Define the learners + an alias (dictionary key)
learners = {
'OB': OnlineBagging(schema=stream.get_schema(), ensemble_size=10),
'ARF': AdaptiveRandomForestClassifier(schema=stream.get_schema(), ensemble_size=10)
}
results = prequential_evaluation_multiple_learners(stream, learners, window_size=4500)
print(f"OB final accuracy = {results['OB'].cumulative.accuracy()} and ARF final accuracy = {results['ARF'].cumulative.accuracy()}")
plot_windowed_results(results['OB'], results['ARF'], metric="accuracy")
OB final accuracy = 82.4174611581921 and ARF final accuracy = 89.32953742937853
3. Regression#
We introduce a simple example using regression just to show how similar it is to assess regressors using the high-level evaluation functions
In the example below, we just use
prequential_evaluation()
but it would work withcumulative_evaluation()
andwindowed_evaluation()
as well.One difference between Classification and Regression evaluation in CapyMOA is that the Evaluators are different. Instead of
ClassificationEvaluator
andClassificationWindowedEvaluator
functions useRegressionEvaluator
andRegressionWindowedEvaluator
[8]:
from capymoa.datasets import Fried
from capymoa.evaluation.visualization import plot_windowed_results
from capymoa.evaluation import prequential_evaluation
from capymoa.regressor import KNNRegressor, AdaptiveRandomForestRegressor
stream = Fried()
kNN_learner = KNNRegressor(schema=stream.get_schema(), k=5)
ARF_learner = AdaptiveRandomForestRegressor(schema=stream.get_schema(), ensemble_size=10)
kNN_results = prequential_evaluation(stream=stream, learner=kNN_learner, window_size=5000)
ARF_results = prequential_evaluation(stream=stream, learner=ARF_learner, window_size=5000)
print(f"{kNN_results['learner']} [cumulative] RMSE = {kNN_results['cumulative'].rmse()} and \
{ARF_results['learner']} [cumulative] RMSE = {ARF_results['cumulative'].rmse()}")
plot_windowed_results(kNN_results, ARF_results, metric='rmse')
kNNRegressor [cumulative] RMSE = 2.7229994765160916 and AdaptiveRandomForestRegressor [cumulative] RMSE = 2.3894271579519426
3.1 Evaluating a single stream using multiple learners (Regression)#
prequential_evaluation_multiple_learners
also works for multiple regressors, the example below shows how it can be used.
[9]:
from capymoa.evaluation import prequential_evaluation_multiple_learners
# Define the learners + an alias (dictionary key)
learners = {
'kNNReg_k5': KNNRegressor(schema=stream.get_schema(), k=5),
'kNNReg_k2': KNNRegressor(schema=stream.get_schema(), k=2),
'kNNReg_k5_median': KNNRegressor(schema=stream.get_schema(), CLI='-k 5 -m'),
'ARFReg_s5': AdaptiveRandomForestRegressor(schema=stream.get_schema(), ensemble_size=5)
}
results = prequential_evaluation_multiple_learners(stream, learners)
print('Cumulative results for each learner:')
for learner_id in learners.keys():
if learner_id in results:
cumulative = results[learner_id]['cumulative']
print(f"{learner_id}, RMSE: {cumulative.rmse():.2f}, adjusted R2: {cumulative.adjusted_r2():.2f}")
# Tip: invoking metrics_header() from an Evaluator will show us all the metrics available,
# e.g. results['kNNReg_k5']['cumulative'].metrics_header()
plot_windowed_results(results['kNNReg_k5'], results['kNNReg_k2'], results['kNNReg_k5_median'],
results['ARFReg_s5'], metric="rmse")
plot_windowed_results(results['kNNReg_k5'], results['kNNReg_k2'], results['kNNReg_k5_median'],
results['ARFReg_s5'], metric="adjusted_r2")
Cumulative results for each learner:
kNNReg_k5, RMSE: 2.72, adjusted R2: 0.70
kNNReg_k2, RMSE: 3.08, adjusted R2: 0.62
kNNReg_k5_median, RMSE: 2.94, adjusted R2: 0.65
ARFReg_s5, RMSE: 2.58, adjusted R2: 0.73
3.2 Plotting predictions vs. ground truth over time (Regression)#
In Regression it is sometimes desirable to plot predictions vs. ground truth to observe what is happening with the Stream. If we create a custom loop and use the Evaluators directly it is trivial to store the ground truth and predictions, and then proceed to plot them. However, to make people’s life easier
plot_predictions_vs_ground_truth
function can be used.For massive streams with millions of instances it can be unbearable to plot all at once, thus we can specify a
plot_interval
(that we want to investigate) toplot_predictions_vs_ground_truth
. By default, the plot function will attempt to plot everything, i.e. ifplot_interval=None
, which is seldom a good idea.
[10]:
from capymoa.evaluation import prequential_evaluation
from capymoa.evaluation.visualization import plot_predictions_vs_ground_truth
from capymoa.regressor import KNNRegressor, AdaptiveRandomForestRegressor
from capymoa.datasets import Fried
stream = Fried()
kNN_learner = KNNRegressor(schema=stream.get_schema(), k=5)
ARF_learner = AdaptiveRandomForestRegressor(schema=stream.get_schema(), ensemble_size=10)
# When we specify store_predictions and store_y, the results will also include all the predictions and all the ground truth y.
# It is useful for debugging and outputting the predictions elsewhere.
kNN_results = prequential_evaluation(stream=stream, learner=kNN_learner, window_size=5000, store_predictions=True, store_y=True)
# We don't need to store the ground-truth for every experiment, since it is always the same for the same stream
ARF_results = prequential_evaluation(stream=stream, learner=ARF_learner, window_size=5000, store_predictions=True)
# Plot only 200 predictions (see plot_interval)
plot_predictions_vs_ground_truth(kNN_results, ARF_results, ground_truth=kNN_results['ground_truth_y'], plot_interval=(0, 200))
[ ]: