{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "b773bf8e-c420-44e1-80a6-99f75dd12268", "metadata": {}, "source": [ "# 7. Pipelines and Transformers\n", "\n", "This notebook showcases the current version of data processing pipelines in CapyMOA. \n", "\n", "* Includes an example of how preprocessing can be accomplished via pipelines and transformers.\n", "* Transformers transform an instance, e.g., using standardization, normalization, etc.\n", "* Pipelines bundle transformers and can also act as classifiers or regressors\n", "\n", "Please note that this feature is still under development; some functionality might not yet be available or change in future releases.\n", "\n", "---\n", "\n", "*More information about CapyMOA can be found in* https://www.capymoa.org\n", "\n", "**notebook last updated on 25/07/2024**" ] }, { "attachments": {}, "cell_type": "markdown", "id": "55d070de-8697-4f98-a11b-eab4e3d5c281", "metadata": {}, "source": [ "## 1. Running onlineBagging without any preprocessing\n", "\n", "First, let us have a look at a simple test-then-train classification example without pipelines. \n", "- We loop over the instances of the data stream\n", "- make a prediction,\n", "- update the evaluator with the prediction and label\n", "- and then train the classifier on the instance." ] }, { "cell_type": "code", "execution_count": 1, "id": "14681f54-23a1-4f93-9145-abf484c91c54", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "82.06656073446328" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Test-then-train loop\n", "from capymoa.datasets import Electricity\n", "from capymoa.classifier import OnlineBagging\n", "from capymoa.evaluation import ClassificationEvaluator\n", "\n", "## Opening a file as a stream\n", "elec_stream = Electricity()\n", "\n", "# Creating a learner\n", "ob_learner = OnlineBagging(schema=elec_stream.get_schema(), ensemble_size=5)\n", "\n", "# Creating the evaluator\n", "ob_evaluator = ClassificationEvaluator(schema=elec_stream.get_schema())\n", "\n", "while elec_stream.has_more_instances():\n", " instance = elec_stream.next_instance()\n", " \n", " prediction = ob_learner.predict(instance)\n", " ob_evaluator.update(instance.y_index, prediction)\n", " ob_learner.train(instance)\n", "\n", "ob_evaluator.accuracy()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "0c1360ef-0583-4c87-8645-1e2d701fffca", "metadata": {}, "source": [ "## 2. Online Bagging using pipelines and transformers\n", "\n", "If we want to perform some preprocessing, such as normalization or feature transformation, or a combination of both, we can chain multiple such `Transformer`s within a pipeline. The last step of a pipeline is a learner, such as capymoa classifier or regressor.\n", "\n", "Similar as classifiers and regressors, pipelines support `train` and `test`. Hence, we can use them in the same way as we would use other capymoa learners. Internally, the pipeline object passes an incoming instance from one transformer to the next. It then returns the prediction of the classifier / regressor using the transformed instance.\n", "\n", "Creating a pipeline consists of the following steps:\n", "1. Create a stream instance\n", "2. Initialize the transformers\n", "3. Initialize the learner\n", "4. Create the pipeline. Here, we use a `ClassifierPipeline`\n", "5. Use the pipeline the same way as any other learner." ] }, { "cell_type": "code", "execution_count": 2, "id": "ae9bb646-e0d1-4de6-b5a1-cff0f0a1b172", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "77.5048552259887" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from capymoa.stream.preprocessing import MOATransformer\n", "from capymoa.stream.preprocessing import ClassifierPipeline\n", "from capymoa.stream import Stream\n", "from moa.streams.filters import AddNoiseFilter, NormalisationFilter\n", "from moa.streams import FilteredStream\n", "\n", "elec_stream = Electricity()\n", "\n", "# Creating the transformers\n", "normalisation_transformer = MOATransformer(schema=elec_stream.get_schema(), moa_filter=NormalisationFilter())\n", "add_noise_transformer = MOATransformer(schema=normalisation_transformer.get_schema(), moa_filter=AddNoiseFilter())\n", "\n", "# Creating a learner\n", "ob_learner = OnlineBagging(schema=add_noise_transformer.get_schema(), ensemble_size=5)\n", "\n", "# Creating and populating the pipeline\n", "pipeline = ClassifierPipeline(transformers=[normalisation_transformer, add_noise_transformer],\n", " learner=ob_learner)\n", "\n", "# Creating the evaluator\n", "ob_evaluator = ClassificationEvaluator(schema=elec_stream.get_schema()) \n", "\n", "while elec_stream.has_more_instances():\n", " instance = elec_stream.next_instance()\n", " prediction = pipeline.predict(instance)\n", " ob_evaluator.update(instance.y_index, prediction)\n", " pipeline.train(instance)\n", "\n", "ob_evaluator.accuracy()" ] }, { "cell_type": "markdown", "id": "676f53b7-0839-47a5-88f9-393b2007855e", "metadata": {}, "source": [ "We can also get a textual representation of the pipeline:" ] }, { "cell_type": "code", "execution_count": 3, "id": "31a481db-d23b-4fc8-a689-fc5c14df5fff", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Transformer(NormalisationFilter) | Transformer(AddNoiseFilter) | OnlineBagging'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "str(pipeline)" ] }, { "cell_type": "markdown", "id": "df255274-83cd-41df-a1da-04778bc427aa", "metadata": {}, "source": [ "### 2.1 Alternative syntax\n", "* An alternative syntax to define the pipeline is shown below\n", "* Since the pipeline behaves like a learner, it can be used with high-level evaluation functions like `prequential_evaluation`" ] }, { "cell_type": "code", "execution_count": 4, "id": "50cb066b-e3e4-4ffd-ad9d-65631d5462e3", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "AdaptiveRandomForest: 88.55049435028248\n", "Transformer(NormalisationFilter) | AdaptiveRandomForest: 88.07821327683615\n" ] }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from capymoa.evaluation import prequential_evaluation\n", "from capymoa.classifier import AdaptiveRandomForestClassifier\n", "from capymoa.evaluation.visualization import plot_windowed_results\n", "\n", "elec_stream = Electricity()\n", "\n", "# Creating a transformer\n", "normalisation_transformer = MOATransformer(schema=elec_stream.get_schema(), moa_filter=NormalisationFilter())\n", "\n", "# Creating an ARF classifier as a baseline\n", "arf = AdaptiveRandomForestClassifier(schema=normalisation_transformer.get_schema(), ensemble_size=5)\n", "\n", "# Alternative syntax. \n", "pipeline_arf = ClassifierPipeline()\n", "pipeline_arf.add_transformer(normalisation_transformer)\n", "pipeline_arf.set_learner(AdaptiveRandomForestClassifier(schema=add_noise_transformer.get_schema(), ensemble_size=5))\n", "\n", "results_arf_pipeline = prequential_evaluation(stream=elec_stream, learner=pipeline_arf, window_size=4500)\n", "results_arf_baseline = prequential_evaluation(stream=elec_stream, learner=arf, window_size=4500)\n", "\n", "print(f\"{arf}: {results_arf_baseline['cumulative'].accuracy()}\")\n", "print(f\"{pipeline_arf}: {results_arf_pipeline['cumulative'].accuracy()}\")\n", "plot_windowed_results(results_arf_pipeline, results_arf_baseline, metric='accuracy')" ] }, { "cell_type": "markdown", "id": "5cc06f0d-d2a0-4fc6-aa9f-ea80e0d224cd", "metadata": {}, "source": [ "## 3. RegressorPipeline\n", "\n", "* The regression version of the pipeline is quite similar to the classification one" ] }, { "cell_type": "code", "execution_count": 8, "id": "f3c8271b-bbb7-4ca9-97f5-fb41e27ec4fd", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "rmse: 2.5841013025710358\n" ] } ], "source": [ "from capymoa.regressor import AdaptiveRandomForestRegressor\n", "from capymoa.stream.preprocessing import RegressorPipeline\n", "from capymoa.evaluation import RegressionEvaluator\n", "from capymoa.datasets import Fried\n", "\n", "fried_stream = Fried()\n", "\n", "# Creating a transformer\n", "normalisation_transformer = MOATransformer(schema=fried_stream.get_schema(), moa_filter=NormalisationFilter())\n", "\n", "arfreg = AdaptiveRandomForestRegressor(schema=normalisation_transformer.get_schema(), ensemble_size=5)\n", "\n", "# Creating and populating the pipeline\n", "pipeline_arfreg = RegressorPipeline(transformers=[normalisation_transformer],\n", " learner=arfreg)\n", "\n", "# Creating the evaluator\n", "arfreg_evaluator = RegressionEvaluator(schema=fried_stream.get_schema()) \n", "\n", "while fried_stream.has_more_instances():\n", " instance = fried_stream.next_instance()\n", " prediction = pipeline_arfreg.predict(instance)\n", " arfreg_evaluator.update(instance.y_value, prediction)\n", " pipeline_arfreg.train(instance)\n", "\n", "print(f\"rmse: {arfreg_evaluator.rmse()}\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.19" } }, "nbformat": 4, "nbformat_minor": 5 }