feat: direct download csv from datasets api by hmgomes · Pull Request #342 · adaptive-machine-learning/CapyMOA

hmgomes · 2026-04-08T00:19:53Z

Add CSV Support to Built-in Datasets API

This PR adds CSV support to the built-in Datasets API while preserving the current default behaviour.

After this change, built-in datasets still download ARFF by default, but can now also download and open CSV files when available:

from capymoa.datasets import Hyper100k

stream_arff = Hyper100k()
stream_csv = Hyper100k(file_type="csv")

Why

CapyMOA already stores CSV source URLs for several datasets in _source_list.py, but the public Datasets API only used the ARFF sources.

This made it harder to:

Benchmark CapyMOA datasets with non-MOA workflows
Reuse hosted datasets with other tooling
Validate CSV dataset assets through the public API

This PR exposes that capability in a simple way without changing the existing ARFF default.

What Changed

Added file_type="arff" | "csv" support to the shared dataset download logic
Kept ARFF as the default for backward compatibility
Made built-in dataset classes use shared URL resolution instead of hardcoding ARFF URLs
Routed downloaded dataset files through the appropriate stream loader based on file type
Preserved task semantics for CSV-backed datasets by explicitly marking built-in datasets as classification or regression
Normalised CSV headers when loading from file so headers with extra whitespace do not break target detection

Behaviour

Default behaviour is unchanged:

from capymoa.datasets import ElectricityTiny

stream = ElectricityTiny()

CSV can now be requested explicitly:

from capymoa.datasets import ElectricityTiny, Hyper100k

stream1 = ElectricityTiny(file_type="csv")
stream2 = Hyper100k(file_type="csv")

Notes

ARFF-backed datasets continue to use the MOA-backed path
CSV-backed datasets use the Python stream path
For CSV-backed evaluation, optimise=False may be required because these streams are not MOA-backed

Validation

Small test of the CSV-backed built-in datasets with representative classification and regression cases.

Examples tested

ElectricityTiny(file_type="csv")
FriedTiny(file_type="csv")
Hyper100k(file_type="csv")

Also ran a small end-to-end evaluation on:

Hyper100k(file_type="csv")

with PassiveAggressiveClassifier using:

prequential_evaluation(..., max_instances=100, optimise=False)

Example Test

from pathlib import Path

from capymoa.classifier import HoeffdingTree
from capymoa.datasets import (
    Bike,
    CovtFD,
    Covtype,
    CovtypeNorm,
    CovtypeTiny,
    Electricity,
    ElectricityTiny,
    Fried,
    FriedTiny,
    Hyper100k,
    RBFm_100k,
    RTG_2abrupt,
    Sensor,
)
from capymoa.evaluation import prequential_evaluation
from capymoa.regressor import FIMTDD


MAX_INSTANCES = 100
DOWNLOAD_DIR = Path("data")

DATASETS = [
    Sensor,
    Hyper100k,
    CovtFD,
    Covtype,
    CovtypeTiny,
    CovtypeNorm,
    RBFm_100k,
    RTG_2abrupt,
    ElectricityTiny,
    Electricity,
    Fried,
    FriedTiny,
    Bike,
]


def build_learner(stream):
    schema = stream.get_schema()
    if schema.is_classification():
        return HoeffdingTree(schema=schema)
    return FIMTDD(schema=schema)


def main():
    DOWNLOAD_DIR.mkdir(exist_ok=True)

    for dataset_cls in DATASETS:
        print(f"\n=== Testing {dataset_cls.__name__} (csv) ===")

        try:
            stream = dataset_cls(directory=DOWNLOAD_DIR, file_type="csv")
            learner = build_learner(stream)

            print(f"path: {stream.path}")
            print(f"task: {'classification' if stream.get_schema().is_classification() else 'regression'}")
            print(f"learner: {learner}")

            results = prequential_evaluation(
                stream=stream,
                learner=learner,
                max_instances=MAX_INSTANCES,
                progress_bar=False,
                optimise=False,
            )

            if stream.get_schema().is_classification():
                print(f"accuracy: {results.cumulative.accuracy():.4f}")
            else:
                print(f"rmse: {results.cumulative.rmse():.4f}")

            print("status: OK")

        except Exception as exc:
            print("status: FAIL")
            print(f"error: {type(exc).__name__}: {exc}")


if __name__ == "__main__":
    main()

hmgomes · 2026-04-08T00:20:28Z

I am not adding a test for this because it would require downloading all CSV files available every time it runs

hmgomes · 2026-04-08T00:23:20Z

Outcome of the validation test

Those that failed below should be fine because they just don't have CSV equivalent files.

from _source_list.py

    "Covtype": _Source(
        "https://www.dropbox.com/scl/fi/kwjvr5kn0l0u5l4gd5788/covtype.arff.gz?rlkey=6vlomqdoi3oud26o1ngyjoibr&st=5jvy1ctv&dl=1",
        None,
    ),

=== Testing Sensor (csv) ===
sensor.csv: 15.0MB [00:03, 5.17MB/s]
path: data/sensor.csv
task: classification
learner: HoeffdingTree
accuracy: 18.0000
status: OK

=== Testing Hyper100k (csv) ===
path: data/Hyper100k.csv
task: classification
learner: HoeffdingTree
accuracy: 71.0000
status: OK

=== Testing CovtFD (csv) ===
covtFD.csv: 280MB [00:18, 15.9MB/s]
path: data/covtFD.csv
task: classification
learner: HoeffdingTree
accuracy: 28.0000
status: OK

=== Testing Covtype (csv) ===
status: FAIL
error: ValueError: Dataset Covtype does not provide a CSV download.

=== Testing CovtypeTiny (csv) ===
status: FAIL
error: ValueError: Dataset CovtypeTiny does not provide a CSV download.

=== Testing CovtypeNorm (csv) ===
covtypeNorm.csv: 16.1MB [00:02, 5.85MB/s]
path: data/covtypeNorm.csv
task: classification
learner: HoeffdingTree
accuracy: 57.0000
status: OK

=== Testing RBFm_100k (csv) ===
RBFm_100k.csv: 8.55MB [00:02, 3.20MB/s]
path: data/RBFm_100k.csv
task: classification
learner: HoeffdingTree
accuracy: 38.0000
status: OK

=== Testing RTG_2abrupt (csv) ===
RTG_2abrupt.csv: 25.3MB [00:02, 8.96MB/s]
path: data/RTG_2abrupt.csv
task: classification
learner: HoeffdingTree
accuracy: 88.0000
status: OK

=== Testing ElectricityTiny (csv) ===
electricity_tiny.csv: 24.0kB [00:02, 8.73kB/s]
path: data/electricity_tiny.csv
task: classification
learner: HoeffdingTree
accuracy: 88.0000
status: OK

=== Testing Electricity (csv) ===
electricity.csv: 696kB [00:01, 435kB/s]
path: data/electricity.csv
task: classification
learner: HoeffdingTree
accuracy: 84.0000
status: OK

=== Testing Fried (csv) ===
fried.csv: 904kB [00:02, 349kB/s]
path: data/fried.csv
task: regression
learner: FIMTDD
rmse: 11.3027
status: OK

=== Testing FriedTiny (csv) ===
fried_tiny.csv: 24.0kB [00:01, 15.2kB/s]
path: data/fried_tiny.csv
task: regression
learner: FIMTDD
rmse: 11.3027
status: OK

=== Testing Bike (csv) ===
bike.csv: 144kB [00:01, 99.1kB/s]
path: data/bike.csv
task: regression
learner: FIMTDD
rmse: 70.0564
status: OK

feat: direct download csv from datasets api

d9c21fe

hmgomes requested a review from tachyonicClock April 8, 2026 00:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: direct download csv from datasets api#342

feat: direct download csv from datasets api#342
hmgomes wants to merge 1 commit intoadaptive-machine-learning:mainfrom
hmgomes:datasets-csv-api

hmgomes commented Apr 8, 2026

Uh oh!

hmgomes commented Apr 8, 2026

Uh oh!

hmgomes commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hmgomes commented Apr 8, 2026

Add CSV Support to Built-in Datasets API

Why

What Changed

Behaviour

Notes

Validation

Examples tested

Example Test

Uh oh!

hmgomes commented Apr 8, 2026

Uh oh!

hmgomes commented Apr 8, 2026

Outcome of the validation test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant