CSVStream#

class capymoa.stream.CSVStream[source]#

Bases: Stream[_AnyInstance]

Create a CapyMOA datastream from a CSV file.

  • The CSV file must have a header row with feature names.

  • Integers or strings can specify nominal features.

  • ? represent missing values.

  • CSV is read line by line, so it can handle large files.

When ‘categories’ are provided for the target attribute, then the stream returns LabeledInstance objects.

>>> from io import StringIO
>>> from capymoa.stream import CSVStream
>>> csv_content = '''feature1,feature2,target
... 1,A,yes
... 2,B,no
... 3,0,0
... 5,1,1
... ?,?,?
... '''
>>> csv_file = StringIO(csv_content)
>>> stream = CSVStream(
...     file=csv_file,
...     target="target",
...     categories={"target": ["yes", "no"], "feature2": ["A", "B"]},
...     name="TestStream"
... )
>>> for instance in stream:
...     print(instance.x, instance.y_index, instance.y_label)
[1. 0.] 0 yes
[2. 1.] 1 no
[3. 0.] 0 yes
[5. 1.] 1 no
[nan nan] -1 None

When no categories are provided for the target attribute, then the stream returns RegressionInstance objects.

>>> csv_content = '''target,feature1,feature2
... 0.0,A,1
... 0.5,B,2
... 1.5,0,3
... 2.0,1,4
... ?,?,?
... '''
>>> csv_file = StringIO(csv_content)
>>> stream = CSVStream(
...     file=csv_file,
...     target="target",
...     categories={"feature1": ["A", "B"]},
...     name="TestStream"
... )
>>> for instance in stream:
...     print(instance.x, instance.y_value)
[0. 1.] 0.0
[1. 2.] 0.5
[0. 3.] 1.5
[1. 4.] 2.0
[nan nan] nan
__init__(
file: Path | str | TextIO,
target: str,
categories: Mapping[str, Sequence[str]] | None = None,
name: str | None = None,
length: int | None = None,
) None[source]#

Create a CSV stream.

Parameters:
  • file – A path to a CSV file or an open file-like object.

  • target – The name of the target attribute.

  • categories – A mapping from attribute names to their categorical values.

  • name – An optional name for the stream. If not provided, the filename is used.

  • length – An optional length of the stream (number of instances). If provided, this enables the Sized interface.

__iter__() Iterator[_AnyInstance][source]#

Get an iterator over the stream.

This will NOT restart the stream if it has already been iterated over. Please use the restart() method to restart the stream.

Yield:

An iterator over the stream.

__next__() _AnyInstance[source]#

Get the next instance in the stream.

Returns:

The next instance in the stream.

cli_help() str[source]#

Return a help message

get_moa_stream() InstanceStream | None[source]#

Get the MOA stream object if it exists.

get_schema() Schema[source]#

Return the schema of the stream.

has_more_instances() bool[source]#

Return True if the stream have more instances to read.

next_instance() _AnyInstance[source]#

Return the next instance in the stream.

Raises:

ValueError – If the machine learning task is neither a regression nor a classification task.

Returns:

A labeled instances or a regression depending on the schema.

restart() None[source]#

Restart the stream to read instances from the beginning.