Datasets#

CapyMOA comes with some datasets ‘out of the box’. Simply import the dataset and start using it, the data will be downloaded automatically if it is not already present in the download directory. You can configure where the datasets are downloaded to by setting an environment variable (See capymoa.env)

>>> from capymoa.datasets import ElectricityTiny
>>> stream = ElectricityTiny()
>>> stream.next_instance().x
array([0.      , 0.056443, 0.439155, 0.003467, 0.422915, 0.414912])

Alternatively, you may download the datasets all at once with the command line interface provided by capymoa.datasets:

python -m capymoa.datasets --help
class capymoa.datasets.Sensor[source]#

Bases: DownloadARFFGzip

Sensor stream is a classification problem based on indoor sensor data.

  • Number of instances: 2,219,803

  • Number of attributes: 5

  • Number of classes: 54

The stream contains temperature, humidity, light, and sensor voltage collected from 54 sensors deployed in Intel Berkeley Research Lab. The classification objective is to predict the sensor ID.

References:

  1. https://www.cse.fau.edu/~xqzhu/stream.html

class capymoa.datasets.RTG_2abrupt[source]#

Bases: DownloadARFFGzip

RTG_2abrupt is a synthetic classification problem based on the Random Tree generator with 2 abrupt drifts.

  • Number of instances: 100,000

  • Number of attributes: 30

  • Number of classes: 5

  • generators.RandomTreeGenerator -o 0 -u 30 -d 20

This is a snapshot (100k instances with 2 simulated abrupt drifts) of the synthetic generator based on the one proposed by Domingos and Hulten [1], producing concepts that in theory should favour decision tree learners. It constructs a decision tree by choosing attributes at random to split, and assigning a random class label to each leaf. Once the tree is built, new examples are generated by assigning uniformly distributed random values to attributes which then determine the class label via the tree.

References:

  1. Domingos, Pedro, and Geoff Hulten. “Mining high-speed data streams.” In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 71-80. 2000.

See also capymoa.stream.generator.RandomTreeGenerator

class capymoa.datasets.RBFm_100k[source]#

Bases: DownloadARFFGzip

RBFm_100k is a synthetic classification problem based on the Radial Basis Function generator.

  • Number of instances: 100,000

  • Number of attributes: 10

  • generators.RandomRBFGeneratorDrift -s 1.0E-4 -c 5

This is a snapshot (100k instances) of the synthetic generator RBF (Radial Basis Function), which works as follows: A fixed number of random centroids are generated. Each center has a random position, a single standard deviation, class label and weight. New examples are generated by selecting a center at random, taking weights into consideration so that centers with higher weight are more likely to be chosen. A random direction is chosen to offset the attribute values from the central point. The length of the displacement is randomly drawn from a Gaussian distribution with standard deviation determined by the chosen centroid. The chosen centroid also determines the class label of the example. This effectively creates a normally distributed hypersphere of examples surrounding each central point with varying densities. Only numeric attributes are generated.

class capymoa.datasets.Hyper100k[source]#

Bases: DownloadARFFGzip

Hyper100k is a classification problem based on the moving hyperplane generator.

  • Number of instances: 100,000

  • Number of attributes: 10

  • Number of classes: 2

References:

  1. Hulten, Geoff, Laurie Spencer, and Pedro Domingos. “Mining time-changing data streams.” Proceedings of the seventh ACM SIGKDD international conference son Knowledge discovery and data mining. 2001.

class capymoa.datasets.Fried[source]#

Bases: DownloadARFFGzip

Fried is a regression problem based on the Friedman dataset.

  • Number of instances: 40,768

  • Number of attributes: 10

  • Number of targets: 1

This is an artificial dataset that contains ten features, only five out of which are related to the target value.

References:

  1. Friedman, Jerome H. “Multivariate adaptive regression splines.” The annals of statistics 19, no. 1 (1991): 1-67.

class capymoa.datasets.ElectricityTiny[source]#

Bases: DownloadARFFGzip

A truncated version of the Electricity dataset with 1000 instances.

This is a tiny version (1k instances) of the Electricity widely used dataset described by M. Harries. This should only be used for quick tests, not for benchmarking algorithms.

See Electricity for the widely used electricity dataset.

class capymoa.datasets.Electricity[source]#

Bases: DownloadARFFGzip

Electricity is a classification problem based on the Australian New South Wales Electricity Market.

  • Number of instances: 45,312

  • Number of attributes: 8

  • Number of classes: 2 (UP, DOWN)

The Electricity data set was collected from the Australian New South Wales Electricity Market, where prices are not fixed. It was described by M. Harries and analysed by Gama. These prices are affected by demand and supply of the market itself and set every five minutes. The Electricity data set contains 45,312 instances, where class labels identify the changes of the price (2 possible classes: up or down) relative to a moving average of the last 24 hours. An important aspect of this data set is that it exhibits temporal dependencies. This version of the dataset has been normalised (AKA elecNormNew) and it is the one most commonly used in benchmarks.

References:

  1. https://sourceforge.net/projects/moa-datastream/files/Datasets/Classification/elecNormNew.arff.zip/download/

class capymoa.datasets.CovtypeTiny[source]#

Bases: DownloadARFFGzip

A truncated version of the classic Covtype classification problem.

This should only be used for quick tests, not for benchmarking algorithms.

  • Number of instances: 581,383 (30m^2 cells)

  • Number of attributes: 54 (10 continuous, 44 categorical)

  • Number of classes: 7 (forest cover types)

Forest Covertype (or simply covtype) contains the forest cover type for 30 x 30 meter cells obtained from US Forest Service (USFS) Region 2 Resource Information System (RIS) data.

References:

  1. Blackard,Jock. (1998). Covertype. UCI Machine Learning Repository. https://doi.org/10.24432/C50K5N.

  2. https://archive.ics.uci.edu/ml/datasets/Covertype

See Also:

  • CovtFD - Covtype with simulated feature drifts

  • Covtype - The classic covertype dataset

  • CovtypeNorm - A normalized version of the classic covertype dataset

class capymoa.datasets.CovtypeNorm[source]#

Bases: DownloadARFFGzip

A normalized version of the classic Covtype classification problem.

  • Number of instances: 581,383 (30m^2 cells)

  • Number of attributes: 54 (10 continuous, 44 categorical)

  • Number of classes: 7 (forest cover types)

Forest Covertype (or simply covtype) contains the forest cover type for 30 x 30 meter cells obtained from US Forest Service (USFS) Region 2 Resource Information System (RIS) data.

References:

  1. Blackard,Jock. (1998). Covertype. UCI Machine Learning Repository. https://doi.org/10.24432/C50K5N.

  2. https://sourceforge.net/projects/moa-datastream/files/Datasets/Classification/covtypeNorm.arff.zip/download/

See Also:

  • CovtFD - Covtype with simulated feature drifts

  • Covtype - The classic covertype dataset

  • CovtypeTiny - A truncated version of the classic covertype dataset

class capymoa.datasets.Covtype[source]#

Bases: DownloadARFFGzip

The classic covertype (/covtype) classification problem

  • Number of instances: 581,383 (30m^2 cells)

  • Number of attributes: 54 (10 continuous, 44 categorical)

  • Number of classes: 7 (forest cover types)

Forest Covertype (or simply covtype) contains the forest cover type for 30 x 30 meter cells obtained from US Forest Service (USFS) Region 2 Resource Information System (RIS) data.

References:

  1. Blackard,Jock. (1998). Covertype. UCI Machine Learning Repository. https://doi.org/10.24432/C50K5N.

  2. https://archive.ics.uci.edu/ml/datasets/Covertype

See Also:

  • CovtFD - Covtype with simulated feature drifts

  • CovtypeNorm - A normalized version of the classic covertype dataset

  • CovtypeTiny - A truncated version of the classic covertype dataset

class capymoa.datasets.CovtFD[source]#

Bases: DownloadARFFGzip

CovtFD is an adaptation from the classic Covtype classification problem with added feature drifts.

  • Number of instances: 581,383 (30m^2 cells)

  • Number of attributes: 104 (10 continuous, 44 categorical, 50 dummy)

  • Number of classes: 7 (forest cover types)

Given 30x30-meter cells obtained from the US Resource Information System (RIS). The dataset includes 10 continuous and 44 categorical features, which we augmented by adding 50 dummy continuous features drawn from a Normal probability distribution with μ = 0 and σ = 1. Only the continuous features were randomly swapped with 10 (out of the fifty) dummy features to simulate drifts. We added such synthetic drift twice, one at instance 193, 669 and another at 387, 338.

References:

  1. Gomes, Heitor Murilo, Rodrigo Fernandes de Mello, Bernhard Pfahringer, and Albert Bifet. “Feature scoring using tree-based ensembles for evolving data streams.” In 2019 IEEE International Conference on Big Data (Big Data), pp. 761-769. IEEE, 2019.

  2. Blackard,Jock. (1998). Covertype. UCI Machine Learning Repository. https://doi.org/10.24432/C50K5N.

  3. https://archive.ics.uci.edu/ml/datasets/Covertype

See Also:

  • Covtype - The classic covertype dataset

  • CovtypeNorm - A normalized version of the classic covertype dataset

  • CovtypeTiny - A truncated version of the classic covertype dataset

capymoa.datasets.get_download_dir(download_dir: str | None = None) Path[source]#

Get a directory where datasets should be downloaded to.

The download directory is determined by the following steps:

  1. If the download_dir parameter is provided, use that.

  2. If the CAPYMOA_DATASETS_DIR environment variable is set, use that.

  3. Otherwise, use the default download directory: ./data.

Parameters:

download_dir – Override the download directory.

Returns:

The download directory.

class capymoa.datasets.downloader.DownloadableDataset[source]#

Bases: ABC, Stream

__init__(
directory: str = PosixPath('data'),
auto_download: bool = True,
CLI: str | None = None,
schema: str | None = None,
)[source]#
get_path()[source]#
abstract download(working_directory: Path) Path[source]#

Download the dataset and return the path to the downloaded dataset within the working directory.

Parameters:

working_directory – The directory to download the dataset to.

Returns:

The path to the downloaded dataset within the working directory.

abstract extract(stream_archive: Path) Path[source]#

Extract the dataset from the archive and return the path to the extracted dataset.

Parameters:

stream_archive – The path to the archive containing the dataset.

Returns:

The path to the extracted dataset.

abstract to_stream(stream: Path)[source]#

Convert the dataset to a MOA stream.

Parameters:

stream – The path to the dataset.

Returns:

A MOA stream.

CLI_help() str[source]#

Return cli help string for the stream.

get_moa_stream() InstanceStream | None[source]#

Get the MOA stream object if it exists.

get_schema() Schema[source]#

Return the schema of the stream.

has_more_instances() bool[source]#

Return True if the stream have more instances to read.

next_instance() LabeledInstance | RegressionInstance[source]#

Return the next instance in the stream.

Raises:

ValueError – If the machine learning task is neither a regression nor a classification task.

Returns:

A labeled instances or a regression depending on the schema.

restart()[source]#

Restart the stream to read instances from the beginning.

class capymoa.datasets.downloader.DownloadARFFGzip[source]#

Bases: DownloadableDataset

download(working_directory: Path) Path[source]#

Download the dataset and return the path to the downloaded dataset within the working directory.

Parameters:

working_directory – The directory to download the dataset to.

Returns:

The path to the downloaded dataset within the working directory.

extract(stream_archive: Path) Path[source]#

Extract the dataset from the archive and return the path to the extracted dataset.

Parameters:

stream_archive – The path to the archive containing the dataset.

Returns:

The path to the extracted dataset.

CLI_help() str[source]#

Return cli help string for the stream.

__init__(
directory: str = PosixPath('data'),
auto_download: bool = True,
CLI: str | None = None,
schema: str | None = None,
)[source]#
get_moa_stream() InstanceStream | None[source]#

Get the MOA stream object if it exists.

get_path()[source]#
get_schema() Schema[source]#

Return the schema of the stream.

has_more_instances() bool[source]#

Return True if the stream have more instances to read.

next_instance() LabeledInstance | RegressionInstance[source]#

Return the next instance in the stream.

Raises:

ValueError – If the machine learning task is neither a regression nor a classification task.

Returns:

A labeled instances or a regression depending on the schema.

restart()[source]#

Restart the stream to read instances from the beginning.

to_stream(stream: Path) Any[source]#

Convert the dataset to a MOA stream.

Parameters:

stream – The path to the dataset.

Returns:

A MOA stream.