Skip to content

feat: direct download csv from datasets api#342

Open
hmgomes wants to merge 1 commit intoadaptive-machine-learning:mainfrom
hmgomes:datasets-csv-api
Open

feat: direct download csv from datasets api#342
hmgomes wants to merge 1 commit intoadaptive-machine-learning:mainfrom
hmgomes:datasets-csv-api

Conversation

@hmgomes
Copy link
Copy Markdown
Collaborator

@hmgomes hmgomes commented Apr 8, 2026

Add CSV Support to Built-in Datasets API

This PR adds CSV support to the built-in Datasets API while preserving the current default behaviour.

After this change, built-in datasets still download ARFF by default, but can now also download and open CSV files when available:

from capymoa.datasets import Hyper100k

stream_arff = Hyper100k()
stream_csv = Hyper100k(file_type="csv")

Why

CapyMOA already stores CSV source URLs for several datasets in _source_list.py, but the public Datasets API only used the ARFF sources.

This made it harder to:

  • Benchmark CapyMOA datasets with non-MOA workflows
  • Reuse hosted datasets with other tooling
  • Validate CSV dataset assets through the public API

This PR exposes that capability in a simple way without changing the existing ARFF default.

What Changed

  • Added file_type="arff" | "csv" support to the shared dataset download logic
  • Kept ARFF as the default for backward compatibility
  • Made built-in dataset classes use shared URL resolution instead of hardcoding ARFF URLs
  • Routed downloaded dataset files through the appropriate stream loader based on file type
  • Preserved task semantics for CSV-backed datasets by explicitly marking built-in datasets as classification or regression
  • Normalised CSV headers when loading from file so headers with extra whitespace do not break target detection

Behaviour

Default behaviour is unchanged:

from capymoa.datasets import ElectricityTiny

stream = ElectricityTiny()

CSV can now be requested explicitly:

from capymoa.datasets import ElectricityTiny, Hyper100k

stream1 = ElectricityTiny(file_type="csv")
stream2 = Hyper100k(file_type="csv")

Notes

  • ARFF-backed datasets continue to use the MOA-backed path
  • CSV-backed datasets use the Python stream path
  • For CSV-backed evaluation, optimise=False may be required because these streams are not MOA-backed

Validation

Small test of the CSV-backed built-in datasets with representative classification and regression cases.

Examples tested

  • ElectricityTiny(file_type="csv")
  • FriedTiny(file_type="csv")
  • Hyper100k(file_type="csv")

Also ran a small end-to-end evaluation on:

Hyper100k(file_type="csv")

with PassiveAggressiveClassifier using:

prequential_evaluation(..., max_instances=100, optimise=False)

Example Test

from pathlib import Path

from capymoa.classifier import HoeffdingTree
from capymoa.datasets import (
    Bike,
    CovtFD,
    Covtype,
    CovtypeNorm,
    CovtypeTiny,
    Electricity,
    ElectricityTiny,
    Fried,
    FriedTiny,
    Hyper100k,
    RBFm_100k,
    RTG_2abrupt,
    Sensor,
)
from capymoa.evaluation import prequential_evaluation
from capymoa.regressor import FIMTDD


MAX_INSTANCES = 100
DOWNLOAD_DIR = Path("data")

DATASETS = [
    Sensor,
    Hyper100k,
    CovtFD,
    Covtype,
    CovtypeTiny,
    CovtypeNorm,
    RBFm_100k,
    RTG_2abrupt,
    ElectricityTiny,
    Electricity,
    Fried,
    FriedTiny,
    Bike,
]


def build_learner(stream):
    schema = stream.get_schema()
    if schema.is_classification():
        return HoeffdingTree(schema=schema)
    return FIMTDD(schema=schema)


def main():
    DOWNLOAD_DIR.mkdir(exist_ok=True)

    for dataset_cls in DATASETS:
        print(f"\n=== Testing {dataset_cls.__name__} (csv) ===")

        try:
            stream = dataset_cls(directory=DOWNLOAD_DIR, file_type="csv")
            learner = build_learner(stream)

            print(f"path: {stream.path}")
            print(f"task: {'classification' if stream.get_schema().is_classification() else 'regression'}")
            print(f"learner: {learner}")

            results = prequential_evaluation(
                stream=stream,
                learner=learner,
                max_instances=MAX_INSTANCES,
                progress_bar=False,
                optimise=False,
            )

            if stream.get_schema().is_classification():
                print(f"accuracy: {results.cumulative.accuracy():.4f}")
            else:
                print(f"rmse: {results.cumulative.rmse():.4f}")

            print("status: OK")

        except Exception as exc:
            print("status: FAIL")
            print(f"error: {type(exc).__name__}: {exc}")


if __name__ == "__main__":
    main()

@hmgomes hmgomes requested a review from tachyonicClock April 8, 2026 00:20
@hmgomes
Copy link
Copy Markdown
Collaborator Author

hmgomes commented Apr 8, 2026

I am not adding a test for this because it would require downloading all CSV files available every time it runs

@hmgomes
Copy link
Copy Markdown
Collaborator Author

hmgomes commented Apr 8, 2026

Outcome of the validation test

Those that failed below should be fine because they just don't have CSV equivalent files.

from _source_list.py

    "Covtype": _Source(
        "https://www.dropbox.com/scl/fi/kwjvr5kn0l0u5l4gd5788/covtype.arff.gz?rlkey=6vlomqdoi3oud26o1ngyjoibr&st=5jvy1ctv&dl=1",
        None,
    ),

=== Testing Sensor (csv) ===
sensor.csv: 15.0MB [00:03, 5.17MB/s]
path: data/sensor.csv
task: classification
learner: HoeffdingTree
accuracy: 18.0000
status: OK

=== Testing Hyper100k (csv) ===
path: data/Hyper100k.csv
task: classification
learner: HoeffdingTree
accuracy: 71.0000
status: OK

=== Testing CovtFD (csv) ===
covtFD.csv: 280MB [00:18, 15.9MB/s]
path: data/covtFD.csv
task: classification
learner: HoeffdingTree
accuracy: 28.0000
status: OK

=== Testing Covtype (csv) ===
status: FAIL
error: ValueError: Dataset Covtype does not provide a CSV download.

=== Testing CovtypeTiny (csv) ===
status: FAIL
error: ValueError: Dataset CovtypeTiny does not provide a CSV download.

=== Testing CovtypeNorm (csv) ===
covtypeNorm.csv: 16.1MB [00:02, 5.85MB/s]
path: data/covtypeNorm.csv
task: classification
learner: HoeffdingTree
accuracy: 57.0000
status: OK

=== Testing RBFm_100k (csv) ===
RBFm_100k.csv: 8.55MB [00:02, 3.20MB/s]
path: data/RBFm_100k.csv
task: classification
learner: HoeffdingTree
accuracy: 38.0000
status: OK

=== Testing RTG_2abrupt (csv) ===
RTG_2abrupt.csv: 25.3MB [00:02, 8.96MB/s]
path: data/RTG_2abrupt.csv
task: classification
learner: HoeffdingTree
accuracy: 88.0000
status: OK

=== Testing ElectricityTiny (csv) ===
electricity_tiny.csv: 24.0kB [00:02, 8.73kB/s]
path: data/electricity_tiny.csv
task: classification
learner: HoeffdingTree
accuracy: 88.0000
status: OK

=== Testing Electricity (csv) ===
electricity.csv: 696kB [00:01, 435kB/s]
path: data/electricity.csv
task: classification
learner: HoeffdingTree
accuracy: 84.0000
status: OK

=== Testing Fried (csv) ===
fried.csv: 904kB [00:02, 349kB/s]
path: data/fried.csv
task: regression
learner: FIMTDD
rmse: 11.3027
status: OK

=== Testing FriedTiny (csv) ===
fried_tiny.csv: 24.0kB [00:01, 15.2kB/s]
path: data/fried_tiny.csv
task: regression
learner: FIMTDD
rmse: 11.3027
status: OK

=== Testing Bike (csv) ===
bike.csv: 144kB [00:01, 99.1kB/s]
path: data/bike.csv
task: regression
learner: FIMTDD
rmse: 70.0564
status: OK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant