Anomaly Detection#

This notebook shows some basic usage of CapyMOA for Anomaly Detection tasks.

Algorithms: HalfSpaceTrees, Autoencoder and Online Isolation Forest

Important notes: Prior to version 0.8.2, a lower anomaly score indicated a higher likelihood of an anomaly. This has been updated so that a higher anomaly score now indicates a higher likelihood of an anomaly, aligning with the standard anomaly detection literature.


More information about CapyMOA can be found in https://www.capymoa.org

last update on 10/01/2025

1. Creating simple anomalous data with sklearn#

  • Generating a few examples and some simple anomalous data using sklearn

[1]:
import numpy as np
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
from capymoa.stream import NumpyStream

# generate normal data points
n_samples = 10000
n_features = 2
n_clusters = 3
X, y = make_blobs(
    n_samples=n_samples, n_features=n_features, centers=n_clusters, random_state=42
)

# generate anomalous data points
n_anomalies = 100  # the anomaly rate is 1%
anomalies = np.random.uniform(low=-10, high=10, size=(n_anomalies, n_features))

# combine the normal data points with anomalies
X = np.vstack([X, anomalies])
y = np.hstack([y, [1] * n_anomalies])  # Label anomalies with 1
y[:n_samples] = 0  # Label normal points with 0

# shuffle the data
idx = np.random.permutation(n_samples + n_anomalies)
X = X[idx]
y = y[idx]

plt.scatter(X[:, 0], X[:, 1], c=y, cmap="viridis")
plt.show()

# create a NumpyStream from the combined dataset
feature_names = [f"feature_{i}" for i in range(n_features)]
target_name = "class"
../_images/notebooks_anomaly_detection_2_0.png

2. Unsupervised Anomaly Detection for data streams#

  • Recent research has been focused on unsupervised anomaly detection for data streams, as it is often difficult to obtain labeled data for training.

  • Instead of using evaluation functions, we first use a basic test-then-train loop from scratch to evaluate the model’s performance.

  • Please notice that higher scores indicate higher anomaly likelihood.

[2]:
from capymoa.anomaly import HalfSpaceTrees
from capymoa.evaluation import AnomalyDetectionEvaluator

stream_ad = NumpyStream(
    X,
    y,
    dataset_name="AnomalyDetectionDataset",
    feature_names=feature_names,
    target_name=target_name,
    target_type="categorical",
)
learner = HalfSpaceTrees(stream_ad.get_schema())
evaluator = AnomalyDetectionEvaluator(stream_ad.get_schema())
while stream_ad.has_more_instances():
    instance = stream_ad.next_instance()
    score = learner.score_instance(instance)
    evaluator.update(instance.y_index, score)
    learner.train(instance)

auc = evaluator.auc()
print(f"AUC: {auc:.2f}")
AUC: 0.94

3. High-level evaluation functions#

  • CapyMOA provides prequential_evaluation_anomaly as a high level function to assess Anomaly Detectors

3.1 prequential_evaluation_anomaly#

In this example, we use the prequential_evaluation_anomaly function with plot_windowed_results to plot AUC for HalfSpaceTrees on the synthetic data stream

[3]:
from capymoa.evaluation.visualization import plot_windowed_results
from capymoa.anomaly import HalfSpaceTrees
from capymoa.evaluation import prequential_evaluation_anomaly

stream_ad = NumpyStream(
    X,
    y,
    dataset_name="AnomalyDetectionDataset",
    feature_names=feature_names,
    target_name=target_name,
    target_type="categorical",
)
hst = HalfSpaceTrees(schema=stream_ad.get_schema())

results_hst = prequential_evaluation_anomaly(
    stream=stream_ad, learner=hst, window_size=1000
)

print(f"AUC: {results_hst.auc()}")
display(results_hst.windowed.metrics_per_window())
plot_windowed_results(results_hst, metric="auc", save_only=False)
AUC: 0.9415815
instances auc s_auc Accuracy Kappa Periodical holdout AUC Pos/Neg ratio G-Mean Recall KappaM
0 1000.0 0.849040 0.180372 0.060 -0.001022 0.000000 0.010101 0.215322 0.9 -93.000000
1 2000.0 0.945619 0.196099 0.007 0.000000 0.849040 0.007049 0.000000 1.0 -115.823529
2 3000.0 0.965010 0.215852 0.014 0.000000 0.945619 0.014199 0.000000 1.0 -94.419355
3 4000.0 0.927062 0.200816 0.006 0.000000 0.965010 0.006036 0.000000 1.0 -106.459459
4 5000.0 0.929814 0.185818 0.008 0.000000 0.927062 0.008065 0.000000 1.0 -109.222222
5 6000.0 0.944977 0.206376 0.013 0.000000 0.929814 0.013171 0.000000 1.0 -101.103448
6 7000.0 0.955576 0.207342 0.013 0.000000 0.944977 0.013171 0.000000 1.0 -96.309859
7 8000.0 0.972828 0.221292 0.010 0.000000 0.955576 0.010101 0.000000 1.0 -96.777778
8 9000.0 0.937626 0.216386 0.008 0.000000 0.972828 0.008065 0.000000 1.0 -99.314607
9 10000.0 0.976118 0.219181 0.009 0.000000 0.937626 0.009082 0.000000 1.0 -100.122449
10 10100.0 0.977778 0.222641 0.010 0.000000 0.976118 0.010101 0.000000 1.0 -98.990000
../_images/notebooks_anomaly_detection_7_2.png

3.2 Autoencoder#

[4]:
from capymoa.evaluation.visualization import plot_windowed_results
from capymoa.anomaly import Autoencoder
from capymoa.evaluation import prequential_evaluation_anomaly

stream_ad = NumpyStream(
    X,
    y,
    dataset_name="AnomalyDetectionDataset",
    feature_names=feature_names,
    target_name=target_name,
    target_type="categorical",
)
ae = Autoencoder(schema=stream_ad.get_schema(), hidden_layer=1)

results_ae = prequential_evaluation_anomaly(
    stream=stream_ad, learner=ae, window_size=1000
)

print(f"AUC: {results_ae.auc()}")
display(results_ae.windowed.metrics_per_window())
plot_windowed_results(results_ae, metric="auc", save_only=False)
AUC: 0.54006
instances auc s_auc Accuracy Kappa Periodical holdout AUC Pos/Neg ratio G-Mean Recall KappaM
0 1000.0 0.477778 0.015538 0.989 -0.001821 0.000000 0.010101 0.000000 0.0 -0.100000
1 2000.0 0.545533 0.028975 0.993 0.000000 0.477778 0.007049 0.000000 0.0 0.176471
2 3000.0 0.557664 0.000238 0.986 0.000000 0.545533 0.014199 0.000000 0.0 -0.354839
3 4000.0 0.615023 0.000008 0.994 0.000000 0.557664 0.006036 0.000000 0.0 0.351351
4 5000.0 0.523438 0.000445 0.991 -0.001781 0.615023 0.008065 0.000000 0.0 0.000000
5 6000.0 0.428727 0.009665 0.986 -0.001861 0.523438 0.013171 0.000000 0.0 -0.448276
6 7000.0 0.545554 0.000071 0.986 -0.001861 0.428727 0.013171 0.000000 0.0 -0.380282
7 8000.0 0.670808 0.061258 0.991 0.180328 0.545554 0.010101 0.316228 0.1 0.111111
8 9000.0 0.597908 0.042223 0.991 -0.001781 0.670808 0.008065 0.000000 0.0 0.089888
9 10000.0 0.588070 0.041752 0.991 0.000000 0.597908 0.009082 0.000000 0.0 0.081633
10 10100.0 0.553131 0.037586 0.990 0.000000 0.588070 0.010101 0.000000 0.0 -0.010000
../_images/notebooks_anomaly_detection_9_2.png

3.3 Online Isolation Forest#

[5]:
from capymoa.evaluation.visualization import plot_windowed_results
from capymoa.anomaly import OnlineIsolationForest
from capymoa.evaluation import prequential_evaluation_anomaly

stream_ad = NumpyStream(
    X,
    y,
    dataset_name="AnomalyDetectionDataset",
    feature_names=feature_names,
    target_name=target_name,
    target_type="categorical",
)
oif = OnlineIsolationForest(schema=stream_ad.get_schema(), num_trees=10)

results_oif = prequential_evaluation_anomaly(
    stream=stream_ad, learner=oif, window_size=1000
)

print(f"AUC: {results_oif.auc()}")
display(results_oif.windowed.metrics_per_window())
plot_windowed_results(results_oif, metric="auc", save_only=False)
AUC: 0.643632
instances auc s_auc Accuracy Kappa Periodical holdout AUC Pos/Neg ratio G-Mean Recall KappaM
0 1000.0 0.609091 0.111064 0.959 0.031649 0.000000 0.010101 0.311075 0.1 -3.100000
1 2000.0 0.767228 0.063018 0.993 0.000000 0.609091 0.007049 0.000000 0.0 0.176471
2 3000.0 0.678463 0.060250 0.986 0.000000 0.767228 0.014199 0.000000 0.0 -0.354839
3 4000.0 0.680835 0.083064 0.994 0.000000 0.678463 0.006036 0.000000 0.0 0.351351
4 5000.0 0.632560 0.057094 0.992 0.000000 0.680835 0.008065 0.000000 0.0 0.111111
5 6000.0 0.632492 0.085781 0.987 0.000000 0.632560 0.013171 0.000000 0.0 -0.344828
6 7000.0 0.585457 0.043215 0.987 0.000000 0.632492 0.013171 0.000000 0.0 -0.281690
7 8000.0 0.789848 0.082476 0.990 0.000000 0.585457 0.010101 0.000000 0.0 0.012346
8 9000.0 0.675025 0.069445 0.992 0.000000 0.789848 0.008065 0.000000 0.0 0.191011
9 10000.0 0.649849 0.066430 0.991 0.000000 0.675025 0.009082 0.000000 0.0 0.081633
10 10100.0 0.613182 0.059306 0.990 0.000000 0.649849 0.010101 0.000000 0.0 -0.010000
../_images/notebooks_anomaly_detection_11_2.png

4 Comparing algorithms#

[6]:
plot_windowed_results(
    results_hst, results_ae, results_oif, metric="auc", save_only=False
)
../_images/notebooks_anomaly_detection_13_0.png