{ "cells": [ { "cell_type": "markdown", "id": "154b5160-fd11-4ed9-82f3-17b7bf7abf0d", "metadata": {}, "source": [ "# Drift Detection in CapyMOA\n", "\n", "In this tutorial, we show how to conduct drift detection using CapyMOA\n", "\n", "* Then test different drift detectors\n", "* Example using ADWIN\n", "* Evaluating detectors based on known drift location\n", "\n", "---\n", "\n", "*More information about CapyMOA can be found in* https://www.capymoa.org\n", "\n", "**last update on 25/07/2024**" ] }, { "cell_type": "code", "execution_count": 1, "id": "78dc8927-1bc3-4ce2-b352-ecf50ab56480", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "import capymoa.drift.detectors as detectors" ] }, { "cell_type": "markdown", "id": "432b8844-6f91-412d-ad36-3a640affc223", "metadata": {}, "source": [ "## Basic example" ] }, { "cell_type": "markdown", "id": "93224151-66bd-4124-ba0f-4ad486d5810a", "metadata": {}, "source": [ "- Creating dummy data" ] }, { "cell_type": "code", "execution_count": 2, "id": "3406740a-f265-4434-aae8-05db48de7e56", "metadata": {}, "outputs": [], "source": [ "data_stream = np.random.randint(2, size=2000)\n", "for i in range(999, 2000):\n", " data_stream[i] = np.random.randint(6, high=12)" ] }, { "cell_type": "markdown", "id": "e6aca673-6eab-42b0-8981-e5c73491243e", "metadata": {}, "source": [ "- Basic drift detection example" ] }, { "cell_type": "code", "execution_count": 3, "id": "bca87b8f-91f3-4eaf-a011-e8b274bda1f5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ADWIN 2\n", "CUSUM 1\n", "DDM 1\n", "EWMAChart 1\n", "GeometricMovingAverage 1\n", "HDDMAverage 126\n", "HDDMWeighted 89\n", "PageHinkley 1\n", "RDDM 2\n", "SEED 2\n", "STEPD 1\n", "ABCD 1\n", "dtype: int64\n" ] } ], "source": [ "all_detectors = detectors.__all__\n", "\n", "n_detections = {k: 0 for k in all_detectors}\n", "for detector_name in all_detectors:\n", " detector = getattr(detectors, detector_name)()\n", "\n", " for i in range(2000):\n", " detector.add_element(float(data_stream[i]))\n", " if detector.detected_change():\n", " n_detections[detector_name] += 1\n", "\n", "print(pd.Series(n_detections))" ] }, { "cell_type": "markdown", "id": "5ca1d03f-a7a2-421a-b800-ebd6a7918791", "metadata": {}, "source": [ "## Example using ADWIN" ] }, { "cell_type": "code", "execution_count": 4, "id": "5c9ade00-778f-481c-a51c-49dd199cd145", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Change detected in data: 1 - at index: 24\n", "Change detected in data: 6 - at index: 1010\n" ] } ], "source": [ "from capymoa.drift.detectors import ADWIN\n", "\n", "# detector = ADWIN(delta=0.001)\n", "\n", "for i in range(2000):\n", " detector.add_element(data_stream[i])\n", " if detector.detected_change():\n", " print(\n", " \"Change detected in data: \" + str(data_stream[i]) + \" - at index: \" + str(i)\n", " )" ] }, { "cell_type": "code", "execution_count": 5, "id": "dcf26795-09bb-4508-ab36-878c4f145197", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1011, 2025, 3011]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Detection indices\n", "detector.detection_index" ] }, { "cell_type": "code", "execution_count": 6, "id": "cd7ae6be-17cb-4dec-b630-32e41531b020", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1009, 1010, 2022, 2023, 2024, 3009, 3010]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Warning indices\n", "detector.warning_index" ] }, { "cell_type": "code", "execution_count": 7, "id": "0b8a45e7-d983-4f69-a0d8-035619da3b11", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4000" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Instance counter\n", "detector.idx" ] }, { "cell_type": "markdown", "id": "71ba1b8b-751c-4ab4-a9a3-331de462b0f9", "metadata": {}, "source": [ "## Evaluating drift detectors\n", "\n", "Assuming the drift locations are known, you can evaluate detectors using **EvaluateDetector** class\n", "\n", "This class takes a parameter called **max_delay**, which is the maximum number of instances for which we consider a detector to have detected a change. After **max_delay** instances, we assume that the change is obvious and have been missed by the detector." ] }, { "cell_type": "code", "execution_count": 8, "id": "598a89e7-8460-415f-8a92-6854509e4697", "metadata": {}, "outputs": [], "source": [ "from capymoa.drift.eval_detector import EvaluateDetector" ] }, { "cell_type": "code", "execution_count": 9, "id": "4a2df820-9314-42e8-bf37-575c837ffabe", "metadata": {}, "outputs": [], "source": [ "eval = EvaluateDetector(max_delay=200)" ] }, { "cell_type": "markdown", "id": "98799472-d6cb-4cc1-912e-49d1004b3c84", "metadata": {}, "source": [ "The EvaluateDetector class takes two arguments for evaluating detectors:\n", "- The locations of the drift\n", "- The locations of the detections" ] }, { "cell_type": "code", "execution_count": 10, "id": "352a52da-71e0-4f7b-bf74-09230086b91a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "mean_time_to_detect 11.0\n", "missed_detection_ratio 0.0\n", "mean_time_btw_false_alarms NaN\n", "no_alarms_per_episode 0.0\n", "dtype: float64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "trues = np.array([1000])\n", "preds = detector.detection_index\n", "\n", "eval.calc_performance(preds, trues)" ] }, { "cell_type": "markdown", "id": "928328fa-c4bc-41b2-8920-8ee59b1a2818", "metadata": {}, "source": [ "## Multivariate Drift Detection" ] }, { "cell_type": "code", "execution_count": 11, "id": "17658d67-c038-4293-9013-90734fd50b20", "metadata": {}, "outputs": [], "source": [ "from capymoa.drift.detectors import ABCD\n", "from capymoa.datasets import ElectricityTiny\n", "\n", "detector = ABCD()\n", "\n", "## Opening a file as a stream\n", "stream = ElectricityTiny()" ] }, { "cell_type": "code", "execution_count": 12, "id": "e9cb0128-5eb1-421c-ac6b-ed06dd85ad31", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Change detected at index: 2283\n" ] } ], "source": [ "i = 0\n", "loss_values = []\n", "while stream.has_more_instances and i < 5000:\n", " i += 1\n", " instance = stream.next_instance()\n", " detector.add_element(instance)\n", " loss_values.append(detector.loss())\n", " if detector.detected_change():\n", " print(\"Change detected at index: \" + str(i))" ] }, { "cell_type": "code", "execution_count": 13, "id": "aae1b2a8-f45d-4c25-aa10-7ee72acfac04", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A 2-dimensional stream\n" ] } ], "source": [ "import numpy as np\n", "from capymoa.drift.detectors import ABCD\n", "from capymoa.datasets import ElectricityTiny\n", "\n", "detector = ABCD(model_id=\"pca\")\n", "\n", "## Opening a file as a stream\n", "stream_change = np.hstack([np.random.uniform(0, 0.5, 3000), np.random.uniform(0.5, 1.0, 3000)])\n", "stream_nochange = np.random.uniform(0, 1.0, len(stream_change))\n", "stream = np.vstack([stream_change, stream_nochange]).T\n", "print(f\"A {stream.shape[-1]}-dimensional stream\")" ] }, { "cell_type": "code", "execution_count": 14, "id": "849708ef-0725-43e5-84e8-5c6c34b52430", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Change detected at index: 3063\n" ] } ], "source": [ "i = 0\n", "loss_values = []\n", "while i < len(stream):\n", " instance = stream[i]\n", " i += 1\n", " detector.add_element(instance)\n", " loss_values.append(detector.loss())\n", " if detector.detected_change():\n", " print(\"Change detected at index: \" + str(i))" ] }, { "cell_type": "code", "execution_count": 15, "id": "fd1914b1-0bd0-4d14-b076-1a96ddff305f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0, 0.5, 'Reconstruction loss')" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "\n", "plt.plot(pd.Series(loss_values).rolling(10).mean())\n", "plt.title(\"ABCD with PCA\")\n", "plt.xlabel(\"# Instances\")\n", "plt.ylabel(\"Reconstruction loss\")" ] }, { "cell_type": "markdown", "id": "b8c75d0d-266f-4c55-a73d-6c23458efb8b", "metadata": {}, "source": [ "We see that a value of 1 as maximum reconstruction error is very conservative. By decreasing the `maximum_absolute_value` parameter, we can make change detection faster as it makes the applied statistical test more sensitive." ] }, { "cell_type": "code", "execution_count": 16, "id": "47940ff5-27a7-4b3f-8467-47803307b031", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Change detected at index: 3024\n" ] } ], "source": [ "detector= ABCD(model_id=\"pca\", maximum_absolute_value=0.3)\n", "\n", "i = 0\n", "loss_values = []\n", "while i < len(stream):\n", " instance = stream[i]\n", " i += 1\n", " detector.add_element(instance)\n", " loss_values.append(detector.loss())\n", " if detector.detected_change():\n", " print(\"Change detected at index: \" + str(i))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.19" } }, "nbformat": 4, "nbformat_minor": 5 }