{ "cells": [ { "cell_type": "markdown", "id": "5dbd70f8-ca5f-455b-bc3b-94c909431b60", "metadata": {}, "source": [ "# 0. Getting started with CapyMOA\n", "\n", "This notebook shows some basic usage of CapyMOA for supervised learning (classification and regression)\n", "\n", "* There are more detailed notebooks and documentation available, our goal here is just present some high-level functions and demonstrate a subset of CapyMOA's functionalities.\n", "* For simplicity, we simulate data streams in the following examples using datasets and employing synthetic generators. One could also read data directly from a CSV or ARFF (See [stream_from_file](https://capymoa.org/api/stream.html#capymoa.stream.stream_from_file) function)\n", "\n", "---\n", "\n", "*More information about CapyMOA can be found in* https://www.capymoa.org\n", "\n", "**last update on 25/07/2024**" ] }, { "attachments": {}, "cell_type": "markdown", "id": "deff0053-1130-43f1-bf69-02bd34bc8e03", "metadata": {}, "source": [ "## 1. Classification\n", "\n", "* Classification for data streams traditionally assumes instances are available\n", " to the classifier in an incremental fashion and labels become available before\n", " a new instance becomes available\n", "* It is common to simulate this behavior using a **while loop**, often referred\n", " to as a **test-then-train loop** which contains 4 distinct steps:\n", " 1. Fetches the next instance from the stream\n", " 2. Makes a prediction\n", " 3. Train the model with the instance\n", " 4. Update a mechanism to keep track of metrics\n", "\n", "**Some remarks about test-then-train loop**:\n", "\n", "* We must not train before testing, meaning that steps 2 and 3 should not be interchanged, as this would invalidate our interpretation concerning how the model performs on unseen data, leading to unreliable evaluations of its efficacy. \n", "* Steps 3 and 4 can be completed in any order without altering the result. \n", "* What if labels are not immediately available? Then you might want to read about delayed labeling and partially labeled data, see [A Survey on Semi-supervised Learning for Delayed Partially Labelled Data Streams](https://dl.acm.org/doi/full/10.1145/3523055)\n", "* More information on classification for data streams is available on section **2.2 Classification** from [Machine Learning for Data Streams](https://moa.cms.waikato.ac.nz/book-html/) book\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "38a4df75", "metadata": { "execution": { "iopub.execute_input": "2024-09-23T00:26:27.428242Z", "iopub.status.busy": "2024-09-23T00:26:27.427587Z", "iopub.status.idle": "2024-09-23T00:26:27.434558Z", "shell.execute_reply": "2024-09-23T00:26:27.434095Z" }, "nbsphinx": "hidden" }, "outputs": [], "source": [ "# This cell is hidden on capymoa.org. See docs/contributing/docs.rst\n", "from util.nbmock import mock_datasets, is_nb_fast\n", "\n", "if is_nb_fast():\n", " mock_datasets()" ] }, { "cell_type": "code", "execution_count": 2, "id": "4728fd23-a4ca-46e9-8571-dc091e4e0d50", "metadata": { "execution": { "iopub.execute_input": "2024-09-23T00:26:27.436319Z", "iopub.status.busy": "2024-09-23T00:26:27.436090Z", "iopub.status.idle": "2024-09-23T00:26:31.488386Z", "shell.execute_reply": "2024-09-23T00:26:31.487891Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "82.06656073446328\n" ] } ], "source": [ "from capymoa.datasets import Electricity\n", "from capymoa.evaluation import ClassificationEvaluator\n", "from capymoa.classifier import OnlineBagging\n", "\n", "elec_stream = Electricity()\n", "ob_learner = OnlineBagging(schema=elec_stream.get_schema(), ensemble_size=5)\n", "ob_evaluator = ClassificationEvaluator(schema=elec_stream.get_schema())\n", "\n", "while elec_stream.has_more_instances():\n", " instance = elec_stream.next_instance()\n", " prediction = ob_learner.predict(instance)\n", " ob_learner.train(instance)\n", " ob_evaluator.update(instance.y_index, prediction)\n", "\n", "print(ob_evaluator.accuracy())" ] }, { "cell_type": "markdown", "id": "5934db15-b6e8-4084-978f-4af71b34df46", "metadata": {}, "source": [ "### 1.1 High-level evaluation functions\n", "\n", "* If our goal is just to evaluate learners it would be tedious to keep writing **test-then-train loops**. \n", "Thus, it makes sense to encapsulate that loop inside **high-level evaluation functions**. \n", "\n", "* Furthermore, sometimes we are interested in **cumulative metrics** and sometimes we care about metrics **windowed metrics**. For example, if we want to know how accurate our model is so far, considering all the instances it has seen, then we would look at its **cumulative metrics**. However, we might also be interested in how well the model is every **n** number of instances, so that we can, for example, identify periods in which our model was really struggling to produce correct predictions. \n", "\n", "* In this example, we use the ```prequential_evaluation``` function, which provides us with both the cumulative and the windowed metrics! \n", "\n", "* Some remarks:\n", " * If you want to know more about other **high-level evaluation functions**, **evaluators**, or which **metrics** are available, check the **01_evaluation** notebook\n", " * The **results** from evaluation functions such as **prequential_evaluation** follow a standard, also discussed thoroughly in the **Evaluation documentation** in http://www.capymoa.org\n", " * Sometimes authors refer to the **cumulative** metrics as **test-then-train** metrics, such as **test-then-train accuracy** (or TTT accuracy for short). They all refer to the same concept.\n", " * Shouldn't we recreate the stream object ```elec_stream```? No, `prequential_evaluation()`, by default, automatically ```restart()``` streams when they are reused.\n", "\n", "In the below example `prequential_evaluation` is used with a `HoeffdingTree` classifier on the `Electricity` data stream." ] }, { "cell_type": "code", "execution_count": 3, "id": "1ac9ffd4-6dd0-436b-8c35-eb61393f985d", "metadata": { "execution": { "iopub.execute_input": "2024-09-23T00:26:31.490311Z", "iopub.status.busy": "2024-09-23T00:26:31.490099Z", "iopub.status.idle": "2024-09-23T00:26:31.733709Z", "shell.execute_reply": "2024-09-23T00:26:31.733253Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Cumulative accuracy = 81.6604872881356, wall-clock time: 0.2248706817626953\n" ] }, { "data": { "text/html": [ "
\n", " | instances | \n", "accuracy | \n", "kappa | \n", "kappa_t | \n", "kappa_m | \n", "f1_score | \n", "f1_score_0 | \n", "f1_score_1 | \n", "precision | \n", "precision_0 | \n", "precision_1 | \n", "recall | \n", "recall_0 | \n", "recall_1 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "4500.0 | \n", "87.777778 | \n", "74.440796 | \n", "24.242424 | \n", "68.856172 | \n", "87.222016 | \n", "84.550562 | \n", "89.889706 | \n", "87.149807 | \n", "84.078212 | \n", "90.221402 | \n", "87.294344 | \n", "85.028249 | \n", "89.560440 | \n", "
1 | \n", "9000.0 | \n", "83.666667 | \n", "66.963969 | \n", "2.649007 | \n", "64.458414 | \n", "83.538542 | \n", "81.657100 | \n", "85.279391 | \n", "83.752489 | \n", "84.373388 | \n", "83.131589 | \n", "83.325685 | \n", "79.110251 | \n", "87.541118 | \n", "
2 | \n", "13500.0 | \n", "85.644444 | \n", "71.282626 | \n", "2.269289 | \n", "70.009285 | \n", "85.663875 | \n", "85.304823 | \n", "85.968723 | \n", "85.634554 | \n", "83.780161 | \n", "87.488948 | \n", "85.693216 | \n", "86.886006 | \n", "84.500427 | \n", "
3 | \n", "18000.0 | \n", "81.977778 | \n", "61.953129 | \n", "-25.154321 | \n", "57.021728 | \n", "81.463331 | \n", "76.168087 | \n", "85.510095 | \n", "82.841248 | \n", "85.488127 | \n", "80.194370 | \n", "80.130502 | \n", "68.680445 | \n", "91.580559 | \n", "
4 | \n", "22500.0 | \n", "86.177778 | \n", "70.202882 | \n", "13.370474 | \n", "64.719229 | \n", "85.389296 | \n", "80.931944 | \n", "89.159986 | \n", "86.648480 | \n", "88.058706 | \n", "85.238254 | \n", "84.166185 | \n", "74.872377 | \n", "93.459993 | \n", "
5 | \n", "27000.0 | \n", "78.088889 | \n", "53.951820 | \n", "-72.377622 | \n", "47.272727 | \n", "77.186522 | \n", "71.634062 | \n", "82.150615 | \n", "77.962693 | \n", "77.521793 | \n", "78.403594 | \n", "76.425652 | \n", "66.577540 | \n", "86.273764 | \n", "
6 | \n", "31500.0 | \n", "79.066667 | \n", "55.619360 | \n", "-71.897810 | \n", "46.263548 | \n", "77.829775 | \n", "72.504378 | \n", "83.100108 | \n", "78.081099 | \n", "74.237896 | \n", "81.924301 | \n", "77.580064 | \n", "70.849971 | \n", "84.310157 | \n", "
7 | \n", "36000.0 | \n", "74.955556 | \n", "49.002474 | \n", "-89.411765 | \n", "37.354086 | \n", "74.661963 | \n", "70.719667 | \n", "78.120753 | \n", "74.256346 | \n", "66.390244 | \n", "82.122449 | \n", "75.072035 | \n", "75.653141 | \n", "74.490929 | \n", "
8 | \n", "40500.0 | \n", "74.555556 | \n", "50.130218 | \n", "-71.664168 | \n", "41.312148 | \n", "76.116886 | \n", "74.818562 | \n", "74.286998 | \n", "76.196815 | \n", "65.523883 | \n", "86.869748 | \n", "76.037125 | \n", "87.186058 | \n", "64.888191 | \n", "
9 | \n", "45000.0 | \n", "84.377778 | \n", "68.535062 | \n", "-0.428571 | \n", "68.390288 | \n", "84.268304 | \n", "82.949309 | \n", "85.585401 | \n", "84.249034 | \n", "82.648623 | \n", "85.849445 | \n", "84.287584 | \n", "83.252191 | \n", "85.322976 | \n", "
10 | \n", "45312.0 | \n", "84.266667 | \n", "68.237903 | \n", "-2.757620 | \n", "67.876588 | \n", "84.118966 | \n", "82.587309 | \n", "85.650588 | \n", "84.121918 | \n", "82.627953 | \n", "85.615883 | \n", "84.116013 | \n", "82.546706 | \n", "85.685320 | \n", "