diff --git a/.gitignore b/.gitignore
index 338be7fa..cc3d3680 100644
--- a/.gitignore
+++ b/.gitignore
@@ -214,3 +214,4 @@ competitors/wandb/
 
 # Misc
 *playground*.ipynb
+.codex/
diff --git a/README.md b/README.md
index adb94a56..85f07d65 100644
--- a/README.md
+++ b/README.md
@@ -7,7 +7,7 @@ Made for biomedical data, Agentomics outperformed human experts and created new
 
 
 How it works
-1) Input is a CSV training dataset + optional data description
+1) Input is a folder-based dataset split + optional data description
 2) Agentomics autonomously experments with various ML models and strategies
 3) Output is a trained model ready for inference and a detailed PDF report summarizing the development process and achieved metrics
 
@@ -49,7 +49,7 @@ Agentomics can be run either:
 For more details visit **https://biogemt.github.io/agentomics-ml/**
 
 ## Key Features
-- Generic: Agentomics can crunch any classification and regression datasets in CSV format.
+- Generic: Agentomics can use folder-based inputs for classification and regression tasks.
 - Secure: Agents execute code securely in Docker with read-only mounts to your file system and are only allowed to write in a Docker Volume.
 - Reproducible: Outputs include models, scripts, and conda environments needed to run inference or re-train models with one bash command.
 - Trustworthy: If you provide a test set, Agentomics fully abstracts LLMs from accessing it, allowing you to rely on programmaticly computed and reported test set metrics.
@@ -61,7 +61,6 @@ For more details visit **https://biogemt.github.io/agentomics-ml/**
 Agentomics is in active development. We welcome any raised Issues and suggestions. You can also [Email Us](mailto:martinekvlastimil95@gmail.com).
 
 Features coming soon:
-- Support for any data type (currently only CSV datasets)
 - Run forking and continuing
 - Better local model support and configuration
 - Remote GPU support for GCP
@@ -81,4 +80,3 @@ bioRxiv (preprint) https://www.biorxiv.org/content/10.64898/2026.01.27.702049v1
 
 MIT. See `LICENSE`.
 
-
diff --git a/docs/getting-started/quick-start.md b/docs/getting-started/quick-start.md
index 6b6a31dd..4a2a11c4 100644
--- a/docs/getting-started/quick-start.md
+++ b/docs/getting-started/quick-start.md
@@ -45,12 +45,18 @@ The agent will prompt you to:
 
 Place your data in `datasets/<your_dataset_name>/`:
 
-```
+```text
 datasets/my_dataset/
-├── train.csv           # Required: training data
-├── validation.csv      # Optional: validation data
-├── test.csv            # Optional: hidden test set
-└── dataset_description.md  # Optional: domain context
+├── train/
+│   ├── input/          # Required: model input files
+│   └── labels.csv      # Required: id,numeric_label
+├── validation/         # Optional
+│   ├── input/
+│   └── labels.csv
+├── test/               # Optional hidden test set
+│   ├── input/
+│   └── labels.csv
+└── dataset_description.md
 ```
 
 See [Preparing Datasets](../user-guide/datasets.md) for details.
diff --git a/docs/how-it-works/evaluation.md b/docs/how-it-works/evaluation.md
index 3862100b..7843fc00 100644
--- a/docs/how-it-works/evaluation.md
+++ b/docs/how-it-works/evaluation.md
@@ -32,7 +32,8 @@ At the end of the run:
 4. Results saved to final report
 
 !!! note
-    Test evaluation only occurs if you provide a `test.csv` file.
+    Test evaluation only occurs if you provide a `test/` split with `input/`
+    and `labels.csv`.
 
 ## Classification Metrics
 
diff --git a/docs/index.md b/docs/index.md
index 9ffe96fa..671516a6 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -31,7 +31,7 @@ Agentomics-ML works like an ML engineer:
 | Feature | Description |
 |---------|-------------|
 | **Any LLM** | Works with OpenAI, OpenRouter, or local models via Ollama |
-| **Any Dataset** | Supports classification or regression datasets in CSV format |
+| **Any Dataset** | Supports folder-based inputs for classification or regression tasks |
 | **Secure Execution** | Docker containers with read-only access to code and isolated execution |
 | **Reproducible** | Outputs include trained models, scripts, and conda environments |
 
diff --git a/docs/reference/workspace-structure.md b/docs/reference/workspace-structure.md
index 1f311aa6..9d228b8c 100644
--- a/docs/reference/workspace-structure.md
+++ b/docs/reference/workspace-structure.md
@@ -4,52 +4,73 @@ How Agentomics-ML organizes files during and after execution.
 
 ## Directory Overview
 
-```
+```text
 agentomics-ml/
 ├── datasets/                 # Raw input datasets
-├── prepared_datasets/        # Prepared training data
-├── prepared_test_sets/       # Prepared test data (hidden)
+├── prepared_datasets/        # Prepared public train/validation data
+├── prepared_test_sets/       # Prepared hidden test data
 ├── workspace/                # Active execution workspace
 │   ├── run/                  # Current run files
 │   ├── best_iteration_snapshot/ # Best iteration snapshot
 │   ├── reports/              # Iteration reports
-│   ├── extras/               # Logs and extra artifacts
-│   └── fallbacks/            # Backup for recovery
+│   └── extras/               # Logs and extra artifacts
 └── outputs/                  # Final results
 ```
 
 ## datasets/
 
-Your raw input datasets:
+Raw datasets use split folders:
 
-```
+```text
 datasets/my_dataset/
-├── train.csv              # Training data (required)
-├── validation.csv         # Validation data (optional)
-├── test.csv               # Test data (optional)
-└── dataset_description.md # Domain info (optional)
-```
+├── train/
+│   ├── input/
+│   ├── extras/             # Optional: supplementary training files
+│   └── labels.csv
+├── validation/             # Optional
+│   ├── input/
+│   ├── extras/             # Optional: supplementary training files
+│   └── labels.csv
+├── test/                   # Optional hidden test set
+│   ├── input/
+│   └── labels.csv
+├── supplementary/          # Optional: supporting/supplementary materials
+├── metadata.json           # Optional if task type is supplied at preparation
+└── dataset_description.md  # Optional domain information
+```
+
+Each `labels.csv` must include `id` and `numeric_label` columns. Only `train`,
+`validation`, and `test` are supported split names. The `input/` structure is
+recorded at preparation time and must match across all splits.
 
 ## prepared_datasets/
 
-After preparation, datasets are formatted for the agent:
+After preparation, public splits are formatted for the agent:
 
-```
+```text
 prepared_datasets/my_dataset/
-├── train.csv              # Processed training data
-├── validation.csv         # Processed validation data
-├── dataset_description.md # Copied/created description
-└── metadata.json          # Task info (type, classes, etc.)
+├── train/
+│   ├── input/
+│   ├── extras/             # If provided
+│   └── labels.csv
+├── validation/
+│   ├── input/
+│   ├── extras/             # If provided
+│   └── labels.csv
+├── supplementary/          # If provided
+├── dataset_description.md
+└── metadata.json
 ```
 
 ## prepared_test_sets/
 
-Test data is separated to ensure it stays hidden:
+Test data is separated to keep it hidden:
 
-```
+```text
 prepared_test_sets/my_dataset/
-├── test.csv               # Test data with labels
-└── test.no_label.csv      # Test data without labels
+└── test/
+    ├── input/
+    └── labels.csv
 ```
 
 The agent never sees files in this directory during training.
@@ -95,18 +116,27 @@ workspace/best_iteration_snapshot/
 
 Updated whenever a new best iteration is achieved.
 
-### workspace/fallbacks/
+### workspace/run/shared/splits/
 
-Recovery backup for split changes:
+Versioned train/validation split folders:
 
-```
-workspace/fallbacks/<agent_id>/
-├── train.csv
-├── validation.csv
-└── split_fingerprint.json
+```text
+workspace/run/shared/splits/
+└── split_0/
+    ├── train/
+    │   ├── input/
+    │   ├── extras/         # Optional
+    │   └── labels.csv
+    └── validation/
+        ├── input/
+        ├── extras/         # Optional
+        └── labels.csv
 ```
 
-Used to restore data if a split change causes issues.
+Each time the agent changes the train/validation split, a new `split_<n>/`
+folder is created. Iteration outputs record which split version they used.
+The `input/` structure must match the original recorded structure across all
+splits. The `extras/` subfolder may be created or modified by the agent.
 
 ### workspace/reports/
 
@@ -115,14 +145,14 @@ Iteration reports are written here during runs. These are copied to
 
 ### workspace/extras/
 
-Logs and auxiliary artifacts (metrics, run logs) are stored here and copied to
+Logs and auxiliary artifacts are stored here and copied to
 `outputs/<agent_id>/extras/`.
 
 ## outputs/
 
 Final results after run completion:
 
-```
+```text
 outputs/<agent_id>/
 ├── best_iteration_snapshot/           # Best iteration artifacts
 │   ├── model_training/
@@ -138,8 +168,12 @@ outputs/<agent_id>/
 │   │   ├── config.json
 │   │   └── splits/
 │   │       └── split_0/
-│   │           ├── train.csv
-│   │           └── validation.csv
+│   │           ├── train/
+│   │           │   ├── input/
+│   │           │   └── labels.csv
+│   │           └── validation/
+│   │               ├── input/
+│   │               └── labels.csv
 │   ├── iteration_0/
 │   ├── iteration_1/
 │   └── ...
@@ -176,7 +210,6 @@ rm -rf outputs/<agent_id>
 ```bash
 rm -rf workspace/run/*
 rm -rf workspace/best_iteration_snapshot/*
-rm -rf workspace/fallbacks/*
 ```
 
 ### Clean Everything
@@ -192,9 +225,9 @@ rm -rf prepared_test_sets/*
 
 In Docker mode, workspace is mounted as a volume:
 
-- Code repository: Read-only
-- Workspace: Read-write
-- Outputs: Read-write
+- Code repository: read-only
+- Workspace: read-write
+- Outputs: read-write
 
 This isolates agent execution from the host system.
 
diff --git a/docs/user-guide/datasets.md b/docs/user-guide/datasets.md
index a2e67b9c..3f242078 100644
--- a/docs/user-guide/datasets.md
+++ b/docs/user-guide/datasets.md
@@ -1,112 +1,154 @@
 # Preparing Datasets
 
-Agentomics-ML works with CSV datasets for classification or regression tasks.
+Agentomics-ML uses folder-based dataset splits. Each split has an `input/`
+folder for model-readable data and a `labels.csv` file for evaluator-readable
+labels.
 
 ## Quick Setup
 
 Create a folder in `datasets/` with your data:
 
-```
+```text
 datasets/my_dataset/
-├── train.csv              # Required
-├── validation.csv         # Optional
-├── test.csv               # Optional
-├── dataset_description.md # Optional
-└── dataset_config.json    # Optional — avoids interactive prompts during dataset preparation
+├── train/
+│   ├── input/              # Required: model input files
+│   ├── extras/             # Optional: supplementary training files
+│   └── labels.csv          # Required: id,numeric_label
+├── validation/             # Optional
+│   ├── input/
+│   ├── extras/             # Optional: supplementary training files
+│   └── labels.csv
+├── test/                   # Optional hidden test set
+│   ├── input/
+│   └── labels.csv
+├── supplementary/          # Optional: supporting/supplementary materials
+├── metadata.json           # Optional if --task-type is provided
+└── dataset_description.md  # Optional domain context
+```
+
+Only `train`, `validation`, and `test` are supported split names.
+
+## Split Requirements
+
+### input/
+
+The `input/` folder can contain any files your generated training and inference
+scripts can read. For tabular datasets, a common layout is:
+
+```text
+train/input/data.csv
 ```
 
-## File Requirements
+Input files should contain stable sample IDs that match `labels.csv`.
 
-### train.csv (Required)
+### labels.csv
 
-Your training data with features and a target column.
+Every labeled split must include `labels.csv` with exactly these columns:
 
 ```csv
-feature1,feature2,feature3,target
-1.2,3.4,5.6,positive
-7.8,9.0,1.2,negative
+id,numeric_label
+sample-1,0
+sample-2,1
 ```
 
-### validation.csv (Optional)
+Requirements:
 
-Separate validation data. If not provided, the agent creates a train/validation split from `train.csv`.
+- `id` is required, non-empty, and unique within the split
+- `numeric_label` is required and numeric
+- Extra columns are not supported
+- Train and validation IDs must not overlap
+- Classification labels should be integer class IDs
 
-### test.csv (Optional)
+### extras/ (split-level) Optional
 
-Hidden test set for final evaluation. The agent never sees this data during training - it's only used to report final metrics.
+Each split folder (`train/`, `validation/`) may contain an `extras/` subfolder
+with additional training files not used during inference. These can be provided
+by the user in the raw dataset or created by the agent during the data split
+step. The agent can populate `extras/` with files derived from
+supporting/supplementary materials or downloaded from the internet.
 
-### dataset_description.md (Optional)
+The inference script receives only the `input/` folder, so files in `extras/`
+are not available at inference time.
 
-Domain information to help the agent understand your data:
+### supplementary/ (dataset-level) Optional
 
-```markdown
-# Gene Expression Dataset
+Supporting/supplementary materials (PDFs, papers, helper scripts) can be placed
+in a `supplementary/` folder inside the dataset directory. These are copied
+during preparation and made available read-only to the agent. The agent can read
+these materials and use them to enrich training data (e.g., by copying or
+deriving files into `extras/`), but must not reference `supplementary/` directly
+in training or inference scripts.
 
-This dataset contains RNA-seq expression levels from tumor samples.
+### input/ structure
 
-## Features
-- Columns 1-100: Gene expression values (log2 TPM)
-- Samples are from breast cancer patients
+The structure of `train/input/` (files and subdirectories) is recorded at
+dataset preparation time and is immutable throughout the run. All splits must
+have matching `input/` structures — `validation/input/` and `test/input/` are
+validated against `train/input/` during preparation, and the agent cannot
+modify the `input/` structure during the split step.
 
-## Target
-- `class`: tumor subtype (Basal, Her2, LumA, LumB, Normal)
+### validation/ Optional
 
-## Notes
-- Data is already normalized
-- Consider using models that handle high-dimensional data
-```
+If `validation/` is not provided, the agent creates train and validation split
+folders from `train/` during the run.
+
+### test/ Optional
+
+The hidden test split is used only for final evaluation. The agent does not get
+access to `prepared_test_sets/` during training.
 
-## Dataset Config File (Optional)
+### metadata.json Optional
 
-Add an optional `dataset_config.json` to your dataset folder to avoid interactive prompts during dataset preparation.
+If you do not provide `metadata.json`, pass `--task-type classification` or
+`--task-type regression` during preparation. For classification datasets,
+Agentomics derives class IDs from `labels.csv` if `label_to_scalar` is absent.
 
-**`task_type` is the most important field** — without it you'll be prompted every time you prepare the dataset.
+Example:
 
 ```json
 {
-    "task_type": "classification",
-    "target_col": "label",
-    "positive_class": 1,
-    "negative_class": 0
+  "task_type": "classification",
+  "numeric_label_col": "numeric_label"
 }
 ```
 
-Fields:
+### dataset_description.md Optional
 
-- `task_type` (optional): `"classification"` or `"regression"`; if omitted, you will be prompted during dataset preparation.
-- `target_col` (optional): column name to predict; auto-detected if omitted.
-- `positive_class` (optional): value that counts as "positive"; only applicable for some binary classification metrics, auto-detected if omitted.
-- `negative_class` (optional): value that counts as "negative"; only applicable for some binary classification metrics, auto-detected if omitted.
+Domain information can help the agent understand your data:
+
+```markdown
+# Gene Expression Dataset
 
-Include only the fields you need — at minimum just `task_type`. Values from this file take precedence over auto-detection, but CLI flags (`--task-type`, `--target-col`, etc.) override the config file.
+This dataset contains RNA-seq expression levels from tumor samples.
 
-## Target Column Detection
+## Features
+- Input files contain log2 TPM expression values
+- Samples are from breast cancer patients
 
-The target column is resolved in this order:
-1. CLI flag (`--target-col`)
-2. `dataset_config.json` (`target_col` field)
-3. Auto-detection from common names: `class`, `target`, `label`, `y`
-4. Interactive prompt (if running interactively)
+## Target
+- `numeric_label`: tumor subtype encoded as class IDs
 
-If all of the above fail, preparation will raise an error.
+## Notes
+- Data is already normalized
+- Consider models that handle high-dimensional data
+```
 
 ## Manual Dataset Preparation
 
 For more control, run preparation separately:
 
 ```bash
-# Create preparation environment
 conda env create -f envs/environment_prepare.yaml
 conda activate agentomics-prepare-env
 
-# Prepare datasets
-python src/prepare_datasets.py --prepare-all
+python src/prepare_datasets.py --dataset-dir datasets/my_dataset --task-type classification
 ```
 
-### Preparation Options
+To prepare all datasets, include `metadata.json` in each dataset folder or use
+single-dataset preparation with `--task-type`.
 
 ```bash
-python src/prepare_datasets.py --help
+python src/prepare_datasets.py --prepare-all
 ```
 
 Key options:
@@ -115,26 +157,31 @@ Key options:
 |--------|-------------|
 | `--dataset-dir` | Specific dataset to prepare |
 | `--task-type` | Specify `classification` or `regression` |
-| `--target-col` | Specify target column name |
-| `--positive-class` | Define positive class for binary classification |
-| `--negative-class` | Define negative class for binary classification |
 
-*Note: already-prepared datasets are skipped on re-runs (preserves `--positive-class`/`--negative-class`). To re-prepare, delete the folder under `prepared_datasets/` (and under `prepared_test_sets/` if a test set was provided) and rerun the preparation script.*
+Running preparation for a single dataset replaces the existing prepared copy
+under `prepared_datasets/` and `prepared_test_sets/`.
+
 
 ## Prepared Dataset Structure
 
 After preparation, datasets are stored in:
 
-```
+```text
 prepared_datasets/my_dataset/
-├── train.csv              # Training data
-├── validation.csv         # Validation data (created if not provided)
-├── dataset_description.md # Copied/created description
-└── metadata.json          # Task type, classes, etc.
+├── train/
+│   ├── input/
+│   └── labels.csv
+├── validation/
+│   ├── input/
+│   └── labels.csv
+├── supplementary/          # If provided
+├── dataset_description.md
+└── metadata.json
 
 prepared_test_sets/my_dataset/
-├── test.csv               # Test data (if provided)
-└── test.no_label.csv      # Test data without labels
+└── test/
+    ├── input/
+    └── labels.csv
 ```
 
 ## Example Datasets
@@ -145,37 +192,21 @@ Download example datasets:
 ./scripts/download_example_dataset.sh --all
 ```
 
-## Data Format Tips
-
-### Classification
-
-- Target column should contain class labels (strings or integers)
-- Binary: `positive`/`negative`, `1`/`0`, `yes`/`no`
-- Multi-class: `class_a`, `class_b`, `class_c`
-- Multi-label classification is not supported (use a single label per row)
-
-### Regression
-
-- Target column should contain numeric values
-- Select `regression` during preparation or pass `--task-type regression`
-
-### Feature Columns
-
-- Numeric features work best
-- Categorical features are supported (encoded automatically)
-- Missing values are handled, but clean data performs better
-
 ## Common Issues
 
-### "Could not detect target column"
+### "Required split folder is missing or incomplete"
+
+Check that `train/input/` exists and that `train/labels.csv` is present.
 
-Solution: Add `--target-col your_column_name` to preparation command, or rename your target column to `class`, `target`, `label`, or `y`.
+### "labels.csv is invalid"
 
-### "Task type required"
+Check that `labels.csv` has `id` and `numeric_label` columns, no duplicate or
+empty IDs, and numeric labels.
 
-Solution (preferred): Add a `dataset_config.json` to your dataset folder with `{"task_type": "classification"}` or `{"task_type": "regression"}`.
+### "metadata.json is required"
 
-Alternative: Pass `--task-type classification` or `--task-type regression` to the preparation command, or run preparation interactively and select when prompted.
+Pass `--task-type classification` or `--task-type regression`, or add a
+`metadata.json` file with `task_type`.
 
 ## Next Steps
 
diff --git a/docs/user-guide/inference.md b/docs/user-guide/inference.md
index d7186668..dd4a3bf1 100644
--- a/docs/user-guide/inference.md
+++ b/docs/user-guide/inference.md
@@ -7,7 +7,7 @@ Use trained models to make predictions on new data with `scripts/inference.sh`.
 ```bash
 ./scripts/inference.sh \
   --agent-dir outputs/<agent_id> \
-  --input /path/to/new_data.csv \
+  --input /path/to/input_folder \
   --output /path/to/predictions.csv
 ```
 
@@ -16,7 +16,7 @@ Use trained models to make predictions on new data with `scripts/inference.sh`.
 | Argument | Description |
 |----------|-------------|
 | `--agent-dir` | Path to completed agent output folder |
-| `--input` | Path to input CSV file (without labels) |
+| `--input` | Path to an input folder without labels |
 | `--output` | Path where predictions will be saved |
 
 ## Optional Arguments
@@ -32,30 +32,37 @@ Use trained models to make predictions on new data with `scripts/inference.sh`.
 ```bash
 ./scripts/inference.sh \
   --agent-dir outputs/enchanted_fixing_reigned \
-  --input new_samples.csv \
+  --input new_samples/input \
   --output predictions.csv
 ```
 
 ## Input Data Format
 
-Your input file should:
+Your input folder should:
 
-- Be a CSV file
-- Have the same feature columns as training data
-- **Not** include the target/label column
+- Match the structure of the training split's `input/` folder
+- Contain the sample IDs needed by the generated `inference.py`
+- Not include `labels.csv` or target labels
+
+For a tabular dataset, the folder can contain a CSV file:
+
+```text
+new_samples/input/
+└── data.csv
+```
 
-Example:
 ```csv
-feature1,feature2,feature3
-1.2,3.4,5.6
-7.8,9.0,1.2
+id,feature1,feature2,feature3
+sample-1,1.2,3.4,5.6
+sample-2,7.8,9.0,1.2
 ```
 
 ## Output Format
 
 The output format is defined by the generated `inference.py` script. For
-classification tasks, the output often includes a `numeric_label` column with
-scores in `[0, 1]`, but you should treat the exact schema as run-specific.
+classification tasks, outputs include `id`, `prediction`, and probability
+columns when probabilities are available. Regression outputs include `id` and
+`prediction`.
 
 ## Docker vs Local Mode
 
@@ -103,7 +110,8 @@ Run `./run.sh` once to build the Docker image, or use `--local` mode.
 
 ### "Column mismatch"
 
-Ensure your input CSV has the same feature columns as the training data (minus the target column).
+Ensure your input folder has the same structure as the training split's `input/`
+folder.
 
 ### "Model file not found"
 
diff --git a/docs/user-guide/outputs.md b/docs/user-guide/outputs.md
index 6b88a3cc..930ddc99 100644
--- a/docs/user-guide/outputs.md
+++ b/docs/user-guide/outputs.md
@@ -55,7 +55,7 @@ The most important directory - contains the best-performing iteration's artifact
 ### Using the Best Model
 
 ```bash
-./scripts/inference.sh --agent-dir outputs/<agent_id> --input data.csv --output predictions.csv
+./scripts/inference.sh --agent-dir outputs/<agent_id> --input data/input --output predictions.csv
 ```
 
 ## Iteration Directories
@@ -105,10 +105,10 @@ During execution, the agent uses a workspace:
 ```
 workspace/
 ├── run/                     # Active run directory
+│   └── shared/splits/       # Versioned train/validation split folders
 ├── best_iteration_snapshot/    # Best iteration snapshot
 ├── reports/                 # Iteration reports
-├── extras/                  # Logs and metrics
-└── fallbacks/               # Backup for recovery
+└── extras/                  # Logs and metrics
 ```
 
 After completion, everything is copied to `outputs/`.
diff --git a/docs/user-guide/running-agent.md b/docs/user-guide/running-agent.md
index a340968a..5e9178c1 100644
--- a/docs/user-guide/running-agent.md
+++ b/docs/user-guide/running-agent.md
@@ -87,7 +87,7 @@ Also supported: `all`
 ```
 
 `--split-allowed-iterations` controls how many early iterations are allowed to resplit
-train/validation (ignored if you provide `validation.csv`). `--exploration-iterations`
+train/validation (ignored if you provide a `validation/` split). `--exploration-iterations`
 controls how long the agent spends on baseline/exploration models.
 
 ### Time Limits
diff --git a/docs/user-guide/training.md b/docs/user-guide/training.md
index 638d108f..82800fe3 100644
--- a/docs/user-guide/training.md
+++ b/docs/user-guide/training.md
@@ -13,8 +13,8 @@ After the agent completes a run, you can re-train the model with new data using
 ```bash
 ./scripts/train.sh \
   --agent-dir outputs/<agent_id> \
-  --train-data /path/to/new_train.csv \
-  --validation-data /path/to/new_validation.csv \
+  --train-data /path/to/train \
+  --validation-data /path/to/validation \
   --artifacts-dir /path/to/output_artifacts
 ```
 
@@ -23,8 +23,8 @@ After the agent completes a run, you can re-train the model with new data using
 | Argument | Description |
 |----------|-------------|
 | `--agent-dir` | Path to completed agent output folder |
-| `--train-data` | Path to new training CSV file |
-| `--validation-data` | Path to new validation CSV file |
+| `--train-data` | Path to a training split folder with `input/` and `labels.csv` |
+| `--validation-data` | Path to a validation split folder with `input/` and `labels.csv` |
 | `--artifacts-dir` | Where to save new training artifacts |
 
 ## Optional Arguments
@@ -41,8 +41,8 @@ After the agent completes a run, you can re-train the model with new data using
 # Re-train using new data
 ./scripts/train.sh \
   --agent-dir outputs/enchanted_fixing_reigned \
-  --train-data datasets/updated_data/train.csv \
-  --validation-data datasets/updated_data/validation.csv \
+  --train-data datasets/updated_data/train \
+  --validation-data datasets/updated_data/validation \
   --artifacts-dir outputs/retrained_model
 ```
 
@@ -57,11 +57,11 @@ The script:
 
 ## Data Format
 
-Your new data files must match the format expected by the agent's training script:
+Your new split folders must match the format expected by the agent's training script:
 
-- Same column names as original training data
-- Same feature encoding/preprocessing expectations
-- Target column with same name and format
+- Same `input/` structure as the original training data
+- `labels.csv` with `id` and `numeric_label`
+- Matching IDs between input files and labels
 
 ## Output
 
diff --git a/run.sh b/run.sh
index 3c6a5a06..77360c48 100755
--- a/run.sh
+++ b/run.sh
@@ -493,7 +493,7 @@ if [ "$LOCAL_MODE" = true ]; then
 
         echo "PDF reports ready at: outputs/${AGENT_ID}/reports/pdf/"
         echo -e "${GREEN}Run finished. Report and files can be found in outputs/${AGENT_ID}${NOCOLOR}"
-        echo -e "${GREEN}To run inference on new data, use ./inference.sh --agent-dir outputs/${AGENT_ID} --input <path_to_input_csv> --output <path_to_output_csv>${NOCOLOR}"
+        echo -e "${GREEN}To run inference on new data, use ./inference.sh --agent-dir outputs/${AGENT_ID} --input <path_to_input_folder> --output <path_to_output_csv>${NOCOLOR}"
     else
         PYTHONPATH="$(pwd)/src" conda run -n agentomics-env python -m runtime.iteration_reports --agent-dir "outputs/${AGENT_ID}"
         warn "Agent didn't produce any valid best iteration snapshot. Exported run artifacts to outputs/${AGENT_ID}."
@@ -812,7 +812,7 @@ else
             write_outputs_readme "${AGENT_ID}"
 
             echo -e "${GREEN}Run finished. Report and files can be found in outputs/${AGENT_ID}${NOCOLOR}"
-            echo -e "${GREEN}To run inference on new data, use ./inference.sh --agent-dir outputs/${AGENT_ID} --input <path_to_input_csv> --output <path_to_output_csv>${NOCOLOR}"
+            echo -e "${GREEN}To run inference on new data, use ./inference.sh --agent-dir outputs/${AGENT_ID} --input <path_to_input_folder> --output <path_to_output_csv>${NOCOLOR}"
         else
             docker run --rm \
               -u "$(id -u):$(id -g)" \
diff --git a/scripts/bash_helpers.sh b/scripts/bash_helpers.sh
index c33532b6..19b5751f 100644
--- a/scripts/bash_helpers.sh
+++ b/scripts/bash_helpers.sh
@@ -187,7 +187,7 @@ outputs/${agent_id}/
 │   └── test_metrics.json           # Metrics on held-out test set
 │
 ├── run/                            # Run working directory
-│   ├── shared/splits/              # Train/validation split CSVs
+│   ├── shared/splits/              # Versioned train/validation split folders
 │   ├── iteration_0/                # Archive of iteration 0
 │   ├── iteration_1/                # Archive of iteration 1
 │   └── ...                         # Additional iterations if present
@@ -214,7 +214,7 @@ outputs/${agent_id}/
 ## Running inference on new data
 
 \`\`\`bash
-./inference.sh --agent-dir outputs/${agent_id} --input <path_to_input_csv> --output <path_to_output_csv>
+./inference.sh --agent-dir outputs/${agent_id} --input <path_to_input_folder> --output <path_to_output_csv>
 \`\`\`
 
 Inference relies on:
diff --git a/scripts/inference.sh b/scripts/inference.sh
index d3c323a2..2620efea 100755
--- a/scripts/inference.sh
+++ b/scripts/inference.sh
@@ -14,7 +14,7 @@ show_help() {
     echo "Usage: $0 --agent-dir <agent_folder_path> --input <input_path> --output <output_path> [--cpu-only] [--local]"
     echo "Options:"
     echo "  --agent-dir   Path to agent folder (required)"
-    echo "  --input       Path to input file (required)"
+    echo "  --input       Path to input folder (required)"
     echo "  --output      Path to output file (required)"
     echo "  --code-path   Path to code files, points to best_iteration_snapshot by default, must be relative to --agent-dir and a child of --agent-dir (optional)"
     echo "  --remove-conda-env   Remove the conda environment after inference (optional)"
@@ -90,7 +90,7 @@ if [[ -z "$OUTPUT_PATH" ]]; then
 fi
 
 [[ -d "$AGENT_DIR" ]] || die "--agent-dir does not exist: $AGENT_DIR"
-[[ -f "$INPUT_PATH" ]] || die "--input does not exist: $INPUT_PATH"
+[[ -d "$INPUT_PATH" ]] || die "--input must be an input folder: $INPUT_PATH"
 [[ -d "$(dirname "$OUTPUT_PATH")" ]] || die "--output directory does not exist: $(dirname "$OUTPUT_PATH")"
 
 AGENT_NAME=$(basename "$AGENT_DIR")
@@ -168,18 +168,6 @@ if [[ "$DOCKER_MODE" == true ]]; then
             bash -c "conda install -n base mamba -c conda-forge -y && mamba env create -f \"${DESCRIPTOR_PATH_IN_CONTAINER}\" -p \"${CODE_ROOT_IN_CONTAINER}/.conda/envs/${AGENT_NAME}_env\""
     fi
 
-    NORMALIZE_SCRIPT_ABS="$SCRIPT_DIR/../src/datasets/normalize_dataset.py"
-    NORMALIZED_FILENAME=$(docker run --rm \
-        -v "$(dirname "$INPUT_PATH_ABS"):/input_dir" \
-        -v "$NORMALIZE_SCRIPT_ABS:/normalize_dataset.py:ro" \
-        --entrypoint "" \
-        "${AGENTOMICS_IMAGE}" \
-        python /normalize_dataset.py --input "/input_dir/$(basename "$INPUT_PATH_ABS")")
-    if [[ -n "$NORMALIZED_FILENAME" ]]; then
-        trap "rm -f \"$(dirname "$INPUT_PATH_ABS")/$NORMALIZED_FILENAME\"" EXIT
-        INPUT_PATH_ABS="$(dirname "$INPUT_PATH_ABS")/$NORMALIZED_FILENAME"
-    fi
-
     echo "Running inference in Docker..."
     docker run --rm \
         -v "${AGENT_DIR_ABS}/${CODE_PATH}:${CODE_ROOT_IN_CONTAINER}" \
@@ -201,12 +189,6 @@ else
         echo "Conda environment not found at: $ENV_PATH"
         conda env create -f "$DESCRIPTOR_PATH" -p "$ENV_PATH"
     fi
-    NORMALIZE_SCRIPT_ABS="$SCRIPT_DIR/../src/datasets/normalize_dataset.py"
-    NORMALIZED_FILENAME=$(python "$NORMALIZE_SCRIPT_ABS" --input "$INPUT_PATH")
-    if [[ -n "$NORMALIZED_FILENAME" ]]; then
-        trap "rm -f \"$(dirname "$INPUT_PATH")/$NORMALIZED_FILENAME\"" EXIT
-        INPUT_PATH="$(dirname "$INPUT_PATH")/$NORMALIZED_FILENAME"
-    fi
 
     echo "Running inference locally..."
     cd "$INFERENCE_WORKDIR"
diff --git a/scripts/train.sh b/scripts/train.sh
index 45361c3e..3904862e 100755
--- a/scripts/train.sh
+++ b/scripts/train.sh
@@ -9,11 +9,11 @@ DOCKERHUB_USERNAME="biogemt"
 ARGS=()
 
 show_help() {
-    echo "Usage: $0 --agent-dir <agent_folder_path> --train-data <train_data_path> --validation-data <validation_data_path> --artifacts-dir <artifacts_dir_path> [--cpu-only] [--local]"
+    echo "Usage: $0 --agent-dir <agent_folder_path> --train-data <train_split_path> --validation-data <validation_split_path> --artifacts-dir <artifacts_dir_path> [--cpu-only] [--local]"
     echo "Options:"
     echo "  --agent-dir       Path to agent folder (required)"
-    echo "  --train-data      Path to training data CSV file (required)"
-    echo "  --validation-data Path to validation data CSV file (required)"
+    echo "  --train-data      Path to training split folder with input/ and labels.csv (required)"
+    echo "  --validation-data Path to validation split folder with input/ and labels.csv (required)"
     echo "  --artifacts-dir   Path to directory where training artifacts will be saved (required)"
     echo "  --cpu-only        Run without GPU (optional)"
     echo "  --local           Run locally without Docker (optional)"
@@ -96,8 +96,12 @@ TRAIN_PATH="${CODE_ROOT}/model_training/train.py"
 TRAIN_WORKDIR="$(dirname "$TRAIN_PATH")"
 DESCRIPTOR_PATH="${CODE_ROOT}/environment.yml"
 
-[[ -f "$TRAIN_DATA_PATH" ]] || die "--train-data does not exist: $TRAIN_DATA_PATH"
-[[ -f "$VALIDATION_DATA_PATH" ]] || die "--validation-data does not exist: $VALIDATION_DATA_PATH"
+[[ -d "$TRAIN_DATA_PATH" ]] || die "--train-data must be a split folder: $TRAIN_DATA_PATH"
+[[ -d "$VALIDATION_DATA_PATH" ]] || die "--validation-data must be a split folder: $VALIDATION_DATA_PATH"
+[[ -d "$TRAIN_DATA_PATH/input" ]] || die "--train-data must contain an input/ folder: $TRAIN_DATA_PATH"
+[[ -f "$TRAIN_DATA_PATH/labels.csv" ]] || die "--train-data must contain labels.csv: $TRAIN_DATA_PATH"
+[[ -d "$VALIDATION_DATA_PATH/input" ]] || die "--validation-data must contain an input/ folder: $VALIDATION_DATA_PATH"
+[[ -f "$VALIDATION_DATA_PATH/labels.csv" ]] || die "--validation-data must contain labels.csv: $VALIDATION_DATA_PATH"
 [[ -f "$TRAIN_PATH" ]] || die "train.py not found at: $TRAIN_PATH"
 [[ -f "$DESCRIPTOR_PATH" ]] || die "environment.yml not found at: $DESCRIPTOR_PATH"
 
diff --git a/src/agents/prompt_builder.py b/src/agents/prompt_builder.py
index f193b611..5644eca5 100644
--- a/src/agents/prompt_builder.py
+++ b/src/agents/prompt_builder.py
@@ -1,16 +1,21 @@
 import json
+from datasets.data_contract import SUPPLEMENTARY_DIR_NAME, TRAIN_SPLIT, VALIDATION_SPLIT
 
 from runtime.system_resources import check_gpu_availability, get_resources_summary
 from utils.config import Config
 from utils.task_types import TaskTypes
 
+
 def get_system_prompt(config: Config):
-    train_csv_path = config.agent_dataset_dir / "train.csv"
-    validation_csv_path = config.agent_dataset_dir / "validation.csv"
+    train_split_path = config.agent_dataset_dir / TRAIN_SPLIT
+    validation_split_path = config.agent_dataset_dir / VALIDATION_SPLIT
     dataset_knowledge = get_dataset_knowledge(config)
-    dataset_paths = f"Dataset path:\n    {train_csv_path}"
-    if validation_csv_path.exists():
-        dataset_paths += f"\n    Validation path:\n    {validation_csv_path}"
+    dataset_paths = f"Training split path:\n    {train_split_path}"
+    if validation_split_path.exists():
+        dataset_paths += f"\n    Validation split path:\n    {validation_split_path}"
+    supplementary_path = config.agent_dataset_dir / SUPPLEMENTARY_DIR_NAME
+    if supplementary_path.is_dir():
+        dataset_paths += f"\n    Supporting/supplementary materials (read-only):\n    {supplementary_path}"
     
     gpu_available = check_gpu_availability() is not None
 
@@ -41,6 +46,13 @@ def get_system_prompt(config: Config):
 
     Dataset paths:
     {dataset_paths}
+    Split folder contract:
+    - Each split folder contains an input/ folder with model-readable data.
+    - Each labeled split folder contains labels.csv with labels keyed by id.
+    - A split folder may optionally contain an extras/ subfolder with additional training files (not used during inference).
+    - The input/ folder structure is fixed and must not be modified. Only extras/ may be created or changed.
+    - You can populate extras/ with files derived from supporting/supplementary materials or downloaded from the internet.
+    - The inference script receives only the input/ folder, not the full split — never depend on extras/ or labels.csv in inference code.
 
     Dataset knowledge:
     {dataset_knowledge}
diff --git a/src/agents/steps/data_split.py b/src/agents/steps/data_split.py
index 95f43a49..bd7e15ff 100644
--- a/src/agents/steps/data_split.py
+++ b/src/agents/steps/data_split.py
@@ -1,27 +1,43 @@
 from __future__ import annotations
 
+import json
 import os
 import shutil
 import time
 from pathlib import Path
 from typing import cast
 
+import pandas as pd
 from pydantic import Field
 from pydantic.json_schema import SkipJsonSchema
 from pydantic_ai import Agent, ModelRetry, RunContext
-import pandas as pd
 
 from agents.steps.base import AgenticStep, AgenticStepOutput
 from runtime.filesystem import chown_tree_to_root
 from utils.task_types import TaskTypes
-from runtime.read_write_utils import get_last_successful_iteration, load_current_iteration_index, load_iteration_state, update_current_iteration_state
+from runtime.read_write_utils import (
+    get_archived_iterations,
+    get_last_successful_iteration,
+    load_current_iteration_index,
+    load_iteration_state,
+    update_current_iteration_state,
+)
 from runtime.step_outputs import load_step_output
 from run_logging.logging_helpers import log_split_is_allowed
-from datasets.dataset_utils import get_numeric_label_col_from_prepared_dataset
+from datasets.data_contract import (
+    ID_COLUMN_NAME,
+    INPUT_DIR_NAME,
+    LABELS_FILE_NAME,
+    METADATA_FILE_NAME,
+    TRAIN_SPLIT,
+    VALIDATION_SPLIT,
+    validate_input_structure,
+    validate_labels_csv,
+)
 
 class DataSplitOutput(AgenticStepOutput):
-    train_path: str = Field(description="Path to generated train.csv file")
-    val_path: str = Field(description="Path to generated validation.csv file")
+    train_path: str = Field(description="Path to generated train split folder")
+    val_path: str = Field(description="Path to generated validation split folder")
     splitting_strategy: str = Field(description="Detailed description of the splitting strategy used")
     split_changed: bool = Field(
         default=False,
@@ -42,65 +58,107 @@ def _get_latest_split_strategy(self) -> str:
         assert iteration is not None
         return load_step_output(self.config, self.step_id, self.config.iteration_dir(iteration)).splitting_strategy
 
+    def _get_split_strategy(self, split_version: int) -> str:
+        for iteration in reversed(get_archived_iterations(self.config, only_successful=True)):
+            output = load_step_output(self.config, self.step_id, self.config.iteration_dir(iteration))
+            strategy = getattr(output, "splitting_strategy", None)
+            if output is not None and getattr(output, "split_version", 0) == split_version and strategy is not None:
+                return strategy
+        raise ModelRetry(f"No recorded splitting strategy found for split_{split_version}.")
+
+    def _split_version(self, split_dir: Path) -> int | None:
+        if not split_dir.name.startswith("split_"):
+            return None
+        try:
+            return int(split_dir.name.removeprefix("split_"))
+        except ValueError:
+            return None
+
     def _get_latest_split_dir(self) -> Path | None:
-        dirs = [d for d in self.config.splits_dir.iterdir() if d.is_dir()]
-        return max(dirs, key=lambda d: int(d.name.split("_")[1])) if dirs else None
+        dirs = [
+            d
+            for d in self.config.splits_dir.iterdir()
+            if d.is_dir() and self._split_version(d) is not None
+        ]
+        return max(dirs, key=lambda d: self._split_version(d)) if dirs else None
 
     def _get_next_split_dir(self) -> Path:
         latest = self._get_latest_split_dir()
         if latest is None:
             return self.config.splits_dir / "split_0"
-        n = int(latest.name.split("_")[1])
+        n = self._split_version(latest)
+        assert n is not None
         return self.config.splits_dir / f"split_{n + 1}"
 
+    def _load_expected_input_structure(self) -> list[dict] | None:
+        metadata_path = self.config.agent_dataset_dir / METADATA_FILE_NAME
+        if not metadata_path.exists():
+            return None
+        metadata = json.loads(metadata_path.read_text(encoding="utf-8"))
+        return metadata.get("input_structure")
+
     def _move_split_to_versioned_dir(self, result: DataSplitOutput) -> DataSplitOutput:
         train_path = Path(result.train_path)
         val_path = Path(result.val_path)
         next_split_dir = self._get_next_split_dir()
         next_split_dir.mkdir(parents=True, exist_ok=True)
-        shutil.move(str(train_path), next_split_dir / "train.csv")
-        shutil.move(str(val_path), next_split_dir / "validation.csv")
-        result.train_path = str(next_split_dir / "train.csv")
-        result.val_path = str(next_split_dir / "validation.csv")
+        shutil.move(str(train_path), next_split_dir / TRAIN_SPLIT)
+        shutil.move(str(val_path), next_split_dir / VALIDATION_SPLIT)
+        result.train_path = str(next_split_dir / TRAIN_SPLIT)
+        result.val_path = str(next_split_dir / VALIDATION_SPLIT)
         result.split_changed = True
         result.split_version = int(next_split_dir.name.removeprefix("split_"))
         return result
 
+    def _validate_split_folder(self, split_path: Path, label: str) -> list[str]:
+        labels_path = split_path / LABELS_FILE_NAME
+        input_path = split_path / INPUT_DIR_NAME
+        if not split_path.is_dir() or not input_path.is_dir() or not labels_path.is_file():
+            raise ModelRetry(f"{label} split folder must contain input/ and labels.csv: {split_path}")
+        try:
+            validate_labels_csv(labels_path)
+        except ValueError as exc:
+            raise ModelRetry(f"{label} labels.csv is invalid. {exc}") from exc
+        return pd.read_csv(labels_path)[ID_COLUMN_NAME].tolist()
+
     def attach_output_validator(self, agent: Agent[dict, AgenticStepOutput]) -> None:
         @agent.output_validator
         async def validate_split_dataset(ctx: RunContext[dict], result: AgenticStepOutput) -> AgenticStepOutput:
             result = cast(DataSplitOutput, result)
-            if not os.path.exists(result.train_path) or not os.path.exists(result.val_path):
-                raise ModelRetry("Split dataset files do not exist.")
-
             train_path = Path(result.train_path)
             val_path = Path(result.val_path)
-            if train_path.name != 'train.csv' or val_path.name != 'validation.csv':
-                for p in (train_path, val_path):
-                    if p.parent == self.config.current_step_dir:
-                        p.unlink(missing_ok=True)
-                raise ModelRetry(f"The files must be called exactly 'train.csv' and 'validation.csv'. Files ({train_path.name} and {val_path.name}) have been deleted.")
-
-            target_col = get_numeric_label_col_from_prepared_dataset(self.config.prepared_dataset_dir)
-            required_cols = [target_col, 'id']
-            train_df, val_df = pd.read_csv(result.train_path), pd.read_csv(result.val_path)
-            for df, path in [(train_df, result.train_path), (val_df, result.val_path)]:
-                missing = [col for col in required_cols if col not in df.columns]
-                if missing:
-                    raise ModelRetry(f"Required columns {missing} missing from {path}.")
-
-            train_ids = set(train_df['id'].dropna().tolist())
-            val_ids = set(val_df['id'].dropna().tolist())
-            if train_ids.intersection(val_ids):
-                raise ModelRetry("Train and validation datasets have overlapping IDs. IDs must be unique across train and validation splits.")
-
-            #The moved files will be absent from created_files field of the output object, however will be in the structured output field
-            if not Path(result.train_path).is_relative_to(self.config.splits_dir):
-                #TODO what if its the explicit split files -> should be moved or copied?
+            if not train_path.exists() or not val_path.exists():
+                raise ModelRetry("Split dataset folders do not exist.")
+            if train_path.name != TRAIN_SPLIT or val_path.name != VALIDATION_SPLIT:
+                raise ModelRetry(
+                    f"The split folders must be called exactly '{TRAIN_SPLIT}' and '{VALIDATION_SPLIT}'. "
+                    f"Received: {train_path.name} and {val_path.name}."
+                )
+            train_ids = set(self._validate_split_folder(train_path, "Train"))
+            val_ids = set(self._validate_split_folder(val_path, "Validation"))
+            overlapping_ids = train_ids & val_ids
+            if overlapping_ids:
+                raise ModelRetry(f"Train and validation labels.csv files have overlapping ids. First overlaps: {list(overlapping_ids)[:20]}")
+
+            expected_structure = self._load_expected_input_structure()
+            if expected_structure is not None:
+                for split_input, label in [(train_path / INPUT_DIR_NAME, "Train"), (val_path / INPUT_DIR_NAME, "Validation")]:
+                    try:
+                        validate_input_structure(split_input, expected_structure, label)
+                    except ValueError as exc:
+                        raise ModelRetry(str(exc)) from exc
+
+            if not train_path.is_relative_to(self.config.splits_dir):
                 result = self._move_split_to_versioned_dir(result)
             else:
-                result.splitting_strategy = self._get_latest_split_strategy()
-                result.split_version = int(Path(result.train_path).parent.name.removeprefix("split_"))
+                split_version = self._split_version(train_path.parent)
+                if split_version is None or val_path.parent != train_path.parent:
+                    raise ModelRetry(
+                        f"Reusable split folders must be inside a versioned split directory like "
+                        f"{self.config.splits_dir / 'split_0'}."
+                    )
+                result.splitting_strategy = self._get_split_strategy(split_version)
+                result.split_version = split_version
             return result
 
     def step_prompt(self) -> str:
@@ -117,28 +175,31 @@ def step_prompt(self) -> str:
 
         if iteration != 0 and latest_split_dir is not None:
             extra_info = f"""
-            Note: An existing split is already available at {latest_split_dir} (train.csv and validation.csv).
+            Note: An existing split is already available at {latest_split_dir} ({TRAIN_SPLIT}/ and {VALIDATION_SPLIT}/ folders).
             If you don't have a reason to change the splitting strategy, return those existing paths immediately and return an empty string for the splitting strategy.
-            If you do create a new split, save 'train.csv' and 'validation.csv' in {current_step_dir}.
+            If you do create a new split, save '{TRAIN_SPLIT}' and '{VALIDATION_SPLIT}' folders in {current_step_dir}.
             """
         else:
-            extra_info = f"Save 'train.csv' and 'validation.csv' in {current_step_dir}."
+            extra_info = f"Save '{TRAIN_SPLIT}' and '{VALIDATION_SPLIT}' folders in {current_step_dir}."
 
-        train_csv_path = self.config.agent_dataset_dir / "train.csv"
+        train_split_path = self.config.agent_dataset_dir / TRAIN_SPLIT
         return f"""
-            Your next task: Split the training dataset ({train_csv_path}) into training and validation sets.
+            Your next task: Split the training dataset ({train_split_path}) into training and validation split folders.
+            Each split folder must follow the dataset contract: an input/ folder with model-readable data and a labels.csv file with labels keyed by id.
+            The input/ folder structure must match the original train/input/ structure exactly — do not add, remove, or rename files inside input/.
+            Each split may also contain an extras/ subfolder with additional training files. You may create extras/ or add files to it — for example by deriving data from supporting/supplementary materials or downloading resources from the internet.
             Ensure the validation split is representative of new unseen data, since it will be used for optimizing choices like architecture, hyperparameters, and training strategies.
             {extra_instructions}
-            Return the absolute paths to the train and validation files.
+            Return the absolute paths to the train and validation split folders.
 
             {extra_info}
             """
 
     def on_iteration_start(self, iteration: int) -> None:
-        if (self.config.agent_dataset_dir / "validation.csv").exists():
+        if (self.config.agent_dataset_dir / VALIDATION_SPLIT).exists():
             split_allowed = False
         else:
-            has_reusable = any(d.is_dir() for d in self.config.splits_dir.iterdir())
+            has_reusable = self._get_latest_split_dir() is not None
             if not has_reusable:
                 split_allowed = True
             elif self.config.split_time_deadline is None:
@@ -153,21 +214,27 @@ def should_be_simulated(self) -> bool:
     def build_simulated_output(self) -> DataSplitOutput:
         latest_split_dir = self._get_latest_split_dir()
         if latest_split_dir is None:
-            if not (self.config.agent_dataset_dir / "validation.csv").exists(): #check for explicit validation files
+            validation_split = self.config.agent_dataset_dir / VALIDATION_SPLIT
+            if not validation_split.exists():
                 raise AssertionError(
                     "Agent did not have a chance to split data. "
-                    "Provide a non-zero split budget or ensure split files are available on disk."
+                    "Provide a non-zero split budget or ensure split folders are available on disk."
                 )
             latest_split_dir = self._get_next_split_dir()
             latest_split_dir.mkdir(parents=True, exist_ok=True)
-            for f in ("train.csv", "validation.csv"):
-                shutil.copy2(self.config.agent_dataset_dir / f, latest_split_dir / f)
+            for split_name in (TRAIN_SPLIT, VALIDATION_SPLIT):
+                shutil.copytree(
+                    self.config.agent_dataset_dir / split_name,
+                    latest_split_dir / split_name,
+                )
             splitting_strategy = ""
         else:
-            splitting_strategy = self._get_latest_split_strategy()
+            split_version = self._split_version(latest_split_dir)
+            assert split_version is not None
+            splitting_strategy = self._get_split_strategy(split_version)
         return DataSplitOutput(
-            train_path=str(latest_split_dir / "train.csv"),
-            val_path=str(latest_split_dir / "validation.csv"),
+            train_path=str(latest_split_dir / TRAIN_SPLIT),
+            val_path=str(latest_split_dir / VALIDATION_SPLIT),
             splitting_strategy=splitting_strategy,
             split_changed=False,
             split_version=int(latest_split_dir.name.removeprefix("split_")),
@@ -181,9 +248,9 @@ def on_step_success(self, output: DataSplitOutput) -> None:
     def on_iteration_fail(self, iteration: int) -> None:
         output = load_step_output(self.config, self.step_id, self.config.current_iteration_dir)
         if output is not None and output.split_changed:
-            split_dir = Path(output.train_path).parent
-            if split_dir.exists():
-                shutil.rmtree(split_dir)
+            split_versioned_dir = Path(output.train_path).parent
+            if split_versioned_dir.exists():
+                shutil.rmtree(split_versioned_dir)
 
     def on_iteration_end(self, iteration: int) -> None:
         iteration_state = load_iteration_state(self.config.current_iteration_dir)
@@ -192,4 +259,3 @@ def on_iteration_end(self, iteration: int) -> None:
             is_allowed=bool(iteration_state["split_allowed_at_start"]),
         )
         #TODO log if split has changed?
-
diff --git a/src/agents/steps/iteration_plan.py b/src/agents/steps/iteration_plan.py
index 87f7a0d7..14df9fb0 100644
--- a/src/agents/steps/iteration_plan.py
+++ b/src/agents/steps/iteration_plan.py
@@ -140,7 +140,7 @@ def step_prompt(self) -> str:
         Never refer to existing scripts or previous iteration agent's actions only as 'previous', 'existing', 'current, 'last', etc... Always mention the iteration number of what you're refering to.
         If you're requesting the agent to create specific files or folders, never request anything with the name 'iteration', 'iter', or similar. For example, prefer 'exploration_script.py' over 'exploration_scipt_iter3.py'. Simply refer to the agent's workspace path as 'your workspace'.
 
-        The agent will have access to the train.csv and validation.csv files, all previous iteration files and step outputs, and the dataset_description.md file.
+        The agent will have access to the train/ and validation/ split folders, all previous iteration files and step outputs, and the dataset_description.md file.
         The agent will have access to the following tools: {tools_info}.
         <foundation_models_info>
         The agent will have access to the following foundation models: {foundation_models_info}
@@ -264,8 +264,8 @@ def _build_splitting_info(self) -> str:
 
         return (
             f"{split_status} "
-            "If you choose data splitting needs change, never suggest cross-validations split or any other split "
-            "that would result in more than two files (train.csv and validation.csv). "
+            "If you choose data splitting needs change, never suggest cross-validation or any other split "
+            "that would result in anything other than train/ and validation/ split folders. "
             "Keep in mind that using a more representative validation split will result in a better selected "
             "'best iteration model' and therefore a better final hidden test set metrics. "
             "Based on the iteration history, if you suspect the current split is not representative "
diff --git a/src/agents/steps/model_inference.py b/src/agents/steps/model_inference.py
index 530640e3..efd0dd75 100644
--- a/src/agents/steps/model_inference.py
+++ b/src/agents/steps/model_inference.py
@@ -1,5 +1,6 @@
 from __future__ import annotations
 
+import csv
 import os
 import traceback
 from pathlib import Path
@@ -7,12 +8,20 @@
 import pandas as pd
 from pydantic import Field
 from pydantic_ai import Agent, ModelRetry, RunContext
+from datasets.data_contract import (
+    ID_COLUMN_NAME,
+    LABELS_FILE_NAME,
+    TRAIN_SPLIT,
+)
+
+LEGACY_SPLIT_FILE_REFERENCES = ("train.csv", "validation.csv", "test.csv", ".no_label.csv")
+PREDICTION_COLUMN_NAME = "prediction"
 
 from agents.steps.base import AgenticStep, AgenticStepOutput
 from agents.steps.model_training import ModelTrainingStep
 from runtime.conda_utils import get_shared_environment_path
 from runtime.filesystem import remove_path
-from runtime.inference_runner import compute_metrics, run_inference_on_labeled_data
+from runtime.inference_runner import compute_metrics, run_inference_on_split
 from runtime.read_write_utils import (
     does_file_contain_iteration_pattern,
     does_file_contain_string,
@@ -54,19 +63,19 @@ def step_prompt(self) -> str:
                 f"- 'probability_{class_id}': probability for class {class_id} (float)"
                 for class_id in get_classes_integers(self.config)
             )
-            output_file_description = f"A csv with columns:\n- 'prediction': the predicted class (int)\n{probability_columns}"
+            output_file_description = f"A csv with columns:\n- 'id': the input sample id\n- 'prediction': the predicted class (int)\n{probability_columns}"
         elif self.config.task_type == TaskTypes.REGRESSION:
-            output_file_description = "A csv with a single column 'prediction' containing the predicted continuous values."
+            output_file_description = "A csv with columns 'id' and 'prediction', where 'prediction' contains the predicted continuous value."
         else:
             raise ValueError(f"Unknown task type: {self.config.task_type}. Supported types are {TaskTypes}.")
 
         return f"""
         Your next task: create inference.py file.
         If your model can be accelerated by GPU, implement the code to use GPU.
-        The inference script must produce a prediction for every single input. Don't skip any samples. The 'id' column from the input file must be preserved in the output file.
+        The inference script must produce a prediction for every single input sample. Don't skip any samples. The 'id' values from the input folder data must be preserved in the output file.
         The inference script must use the same architecture as your current trained model from 'train.py' and use the artifacts produced by that script (located at '{model_training.path_to_artifacts_dir}').
         The inference script will be taking the following named arguments:
-        --input (an input file path). This file is of the same format as your training data (except the target column)
+        --input (a split input folder path). This folder has the same structure as train/input and contains no labels.
         --output (the output file path). {output_file_description}
         --artifacts-dir (the folder that contains training artifacts from the training step that are needed to run inference (for example model weights, tokenizers, etc..). The following dir should be used as a default: '{model_training.path_to_artifacts_dir}'. If a different path is provided, your script must adapt to the new source. You can assume the artifact files will always have the same name.
         The script must not accept any other parameters.
@@ -99,89 +108,115 @@ def _validate_inference_file_references(self, inference_file_path: str) -> None:
                 "iteration folder, which will not accessible during final testing. If you want to re-use "
                 "a file from a past iteration, copy it into the current working directory and use its path."
             )
-        contains_training_split_reference = does_file_contain_string(inference_file_path, "train.csv")
-        contains_validation_split_reference = does_file_contain_string(inference_file_path, "validation.csv")
-        if contains_training_split_reference or contains_validation_split_reference:
-            raise ModelRetry(
-                "Inference file contains references to dataset split files ('train.csv' or 'validation.csv' detected), "
-                "which will not be accessible during final testing."
-            )
+        for legacy_ref in LEGACY_SPLIT_FILE_REFERENCES:
+            if does_file_contain_string(inference_file_path, legacy_ref):
+                raise ModelRetry(
+                    f"Inference file contains a legacy split CSV reference ({legacy_ref}), "
+                    "which will not be accessible during final testing."
+                )
+        if does_file_contain_string(inference_file_path, "labels.csv"):
+            raise ModelRetry("Inference file contains references to labels.csv, which will not be accessible during final testing.")
+        for extras_ref in ("/extras", "extras/"):
+            if does_file_contain_string(inference_file_path, extras_ref):
+                raise ModelRetry(
+                    "Inference file references an extras path. "
+                    "The inference script receives only the input/ folder and must not depend on extras/."
+                )
 
     def _validate_dry_run(self) -> None:
         dataset_metadata = load_dataset_metadata(self.config)
         model_training = require_step_output(self.config, ModelTrainingStep.step_id, self.config.current_iteration_dir)
         evaluation_stage = "dry_run"
-        labeled_input_path = self.config.prepared_dataset_dir / "train.csv"
+        split_path = self.config.prepared_dataset_dir / TRAIN_SPLIT
+        labels_path = self.config.prepared_dataset_dir / TRAIN_SPLIT / LABELS_FILE_NAME
         output_path = self.config.current_step_dir / "eval_predictions_dry_run.csv"
-        inference_result = run_inference_on_labeled_data(
-            evaluation_stage=evaluation_stage,
-            labeled_input_path=labeled_input_path,
+        inference_result = run_inference_on_split(
+            split_path=split_path,
             output_path=output_path,
             conda_env_path=get_shared_environment_path(self.config),
             inference_script_path=self.config.current_step_dir / "inference.py",
             training_artifacts_dir=Path(model_training.path_to_artifacts_dir),
-            label_col=dataset_metadata["numeric_label_col"],
         )
         if inference_result.returncode != 0:
             print("DRY RUN EVAL FAIL during inference:", inference_result.stderr)
             message = collapse_repeated_lines(f"Inference script validation failed: {str(inference_result)}")
             raise ModelRetry(concise_output(message))
         self._validate_dry_run_predictions(
-            input_path=labeled_input_path,
+            labels_path=labels_path,
             predictions_path=output_path,
             inference_result=inference_result,
         )
         self._compute_dry_run_metrics(
             output_path=output_path,
-            labeled_input_path=labeled_input_path,
+            labels_path=labels_path,
             numeric_label_col=dataset_metadata["numeric_label_col"],
             task_type=dataset_metadata["task_type"],
             evaluation_stage=evaluation_stage,
         )
         print("DRY RUN EVAL SUCCESS")
 
-    def _validate_dry_run_predictions(self, input_path: Path, predictions_path: Path, inference_result) -> None:
+    def _validate_dry_run_predictions(self, labels_path: Path, predictions_path: Path, inference_result) -> None:
         if not predictions_path.exists():
             inference_summary = concise_output(collapse_repeated_lines(str(inference_result)))
             raise ModelRetry(f"Inference doesn't produce predictions. Stderr/stdout: {inference_summary}")
-
-        predictions = self._read_predictions(predictions_path)
-        input_frame = pd.read_csv(input_path)
-        expected_rows = len(input_frame)
-        actual_rows = len(predictions)
-        if actual_rows != expected_rows:
-            raise ModelRetry(
-                f"Inference script must produce prediction for each input sample. "
-                f"Input rows: {expected_rows}. Predicted rows: {actual_rows}"
-            )
-
+        expected_ids = pd.read_csv(labels_path)[ID_COLUMN_NAME].tolist()
         try:
-            input_set = set(input_frame["id"])
-            prediction_set = set(predictions["id"])
+            self._validate_prediction_ids(predictions_path, expected_ids)
         except Exception as error:
             traceback_summary = concise_output(collapse_repeated_lines(traceback.format_exc()))
-            raise ModelRetry(f"Inference script must produce predictions with the 'id' column. {error}\n{traceback_summary}") from error
+            raise ModelRetry(f"Inference produced faulty predictions csv file. {error}\n{traceback_summary}") from error
 
-        if input_set != prediction_set:
-            missing_ids = list(input_set - prediction_set)[:20]
-            extra_ids = list(prediction_set - input_set)[:20]
-            raise ModelRetry(
+    def _validate_prediction_ids(self, predictions_path: Path, expected_ids: list[str]) -> None:
+        prediction_ids = self._read_prediction_ids(predictions_path)
+
+        if len(prediction_ids) != len(expected_ids):
+            raise ValueError(
+                "Inference script must produce prediction for each input sample. "
+                f"Input rows: {len(expected_ids)}. Predicted rows: {len(prediction_ids)}"
+            )
+
+        expected_id_set = set(expected_ids)
+        prediction_id_set = set(prediction_ids)
+        if expected_id_set != prediction_id_set:
+            missing = list(expected_id_set - prediction_id_set)[:20]
+            extra = list(prediction_id_set - expected_id_set)[:20]
+            raise ValueError(
                 "Inference script must keep the id column from the input data the same. "
-                f"First (up to 20) missing ids: {missing_ids}. "
-                f"First (up to 20) extra ids: {extra_ids}"
+                f"First (up to 20) missing ids: {missing}. First (up to 20) extra ids: {extra}"
             )
 
-    def _read_predictions(self, predictions_path: Path) -> pd.DataFrame:
-        try:
-            return pd.read_csv(predictions_path)
-        except Exception as error:
-            traceback_summary = concise_output(collapse_repeated_lines(traceback.format_exc()))
-            raise ModelRetry(f"Inference produced faulty predictions csv file. {error}\n{traceback_summary}") from error
+    def _read_prediction_ids(self, predictions_path: Path) -> list[str]:
+        if not predictions_path.is_file():
+            raise ValueError(f"Predictions file does not exist: {predictions_path}")
+
+        ids = []
+        seen_ids = set()
+        with open(predictions_path, newline="", encoding="utf-8") as predictions_file:
+            reader = csv.DictReader(predictions_file)
+            fieldnames = set(reader.fieldnames or [])
+            missing_columns = {ID_COLUMN_NAME, PREDICTION_COLUMN_NAME} - fieldnames
+            if missing_columns:
+                raise ValueError(f"{predictions_path} is missing required columns: {sorted(missing_columns)}")
+
+            for line_number, row in enumerate(reader, start=2):
+                if None in row:
+                    raise ValueError(f"{predictions_path} has too many columns on line {line_number}")
+
+                sample_id = (row.get(ID_COLUMN_NAME) or "").strip()
+                if not sample_id:
+                    raise ValueError(f"{predictions_path} has an empty id on line {line_number}")
+                if sample_id in seen_ids:
+                    raise ValueError(f"{predictions_path} contains duplicate id '{sample_id}'")
+
+                seen_ids.add(sample_id)
+                ids.append(sample_id)
+
+        return ids
 
     def _compute_dry_run_metrics(
         self,
         output_path: Path,
-        labeled_input_path: Path,
+        labels_path: Path,
         numeric_label_col: str,
         task_type: str,
         evaluation_stage: str,
@@ -189,12 +224,11 @@ def _compute_dry_run_metrics(
         try:
             compute_metrics(
                 results_file=output_path,
-                labeled_input_path=labeled_input_path,
+                labels_path=labels_path,
                 numeric_label_col=numeric_label_col,
                 task_type=task_type,
                 evaluation_stage=evaluation_stage,
             )
-            # TODO should this be in finally?
             remove_path(output_path)
         except Exception:
             message = concise_output(collapse_repeated_lines(f"FAIL DURING DRY RUN METRICS COMPUTATION. {traceback.format_exc()}"))
diff --git a/src/agents/steps/model_training.py b/src/agents/steps/model_training.py
index 62cdd169..25b1c83d 100644
--- a/src/agents/steps/model_training.py
+++ b/src/agents/steps/model_training.py
@@ -81,8 +81,8 @@ async def validate_training(ctx: RunContext[dict], result: ModelTrainingOutput)
             if does_file_contain_iteration_pattern(result.path_to_train_file):
                 raise ModelRetry(f"Train file ({result.path_to_train_file}) contains path containing a forbidden string 'iteration_' or references an iteration folder, which will not accessible during final testing. If you want to re-use a file from a past iteration, copy it into the current working directory and use its path.")
             created_files_names = self._validate_training_run(
-                train_data_path=ctx.deps["train_csv_path"],
-                valid_data_path=ctx.deps["validation_csv_path"],
+                train_data_path=ctx.deps["train_split_path"],
+                valid_data_path=ctx.deps["validation_split_path"],
                 train_script_path=result.path_to_train_file,
                 model_file_name=Path(result.path_to_model_file).name,
             )
@@ -127,8 +127,8 @@ def step_prompt(self) -> str:
         {reporting_requirement}
         - Save the training script directly as train.py in the current step directory, not inside a nested folder.
         The train script should take the following parameters
-        --train-data (a path to the training data csv)
-        --validation-data (a path to the validation data csv. For example for the purposes of early-stopping. If the training script doesn't need validation data, still include the argument for compatibility and don't use it.)
+        --train-data (a path to the training split folder. This folder contains input/ with model-readable data, labels.csv with labels keyed by id, and optionally extras/ with additional training files.)
+        --validation-data (a path to the validation split folder. Same contract as --train-data. For early stopping. Include even if unused.)
         --artifacts-dir (path to a directory that will be populated by the training script with artifacts needed to use the trained model for predictions (e.g. produced model weights, produced tokenizers, ...). This directory should not contain any other external sources like imported scripts, conda packages, foundation models, etc..)
         The script must not accept any other parameters.
         """
@@ -137,8 +137,8 @@ def build_deps(self, step_started_at: datetime) -> dict[str, object]:
         data_split = require_step_output(self.config, DataSplitStep.step_id, self.config.current_iteration_dir)
         return {
             "start_time": step_started_at,
-            "train_csv_path": data_split.train_path,
-            "validation_csv_path": data_split.val_path,
+            "train_split_path": data_split.train_path,
+            "validation_split_path": data_split.val_path,
         }
 
     def _validate_training_run(self, train_data_path: str, valid_data_path: str, train_script_path: str, model_file_name: str) -> list[str]:
@@ -147,21 +147,13 @@ def _validate_training_run(self, train_data_path: str, valid_data_path: str, tra
         command_prefix = f"cd {run_dir} && conda run -p {conda_path}"
 
         temp_artifacts_dir = run_dir / "temp_retrain_artifacts"
-        temp_train_path = run_dir / "temp_train_subset.csv"
-        temp_valid_path = run_dir / "temp_valid_subset.csv"
 
         try:
-            target_col = get_numeric_label_col_from_prepared_dataset(self.config.prepared_dataset_dir)
-            train_subset = self._get_dataset_subset(train_data_path, target_col)
-            train_subset.to_csv(temp_train_path, index=False)
-            valid_subset = self._get_dataset_subset(valid_data_path, target_col)
-            valid_subset.to_csv(temp_valid_path, index=False)
-
             temp_artifacts_dir.mkdir(parents=True, exist_ok=True)
             command = (
                 f"{command_prefix} python \"{train_script_path}\" "
-                f"--train-data \"{temp_train_path}\" "
-                f"--validation-data \"{temp_valid_path}\" "
+                f"--train-data \"{train_data_path}\" "
+                f"--validation-data \"{valid_data_path}\" "
                 f"--artifacts-dir \"{temp_artifacts_dir}\""
             )
             training_out = subprocess.run(command, shell=True, executable="/bin/bash", capture_output=True)
@@ -211,26 +203,9 @@ def _validate_training_run(self, train_data_path: str, valid_data_path: str, tra
             traceback_msg = concise_output(collapse_repeated_lines(traceback.format_exc()))
             raise ModelRetry(f"Training script validation failed: {traceback_msg}") from error
         finally:
-            for temporary_path in [temp_train_path, temp_valid_path]:
-                if temporary_path.exists():
-                    temporary_path.unlink()
             if temp_artifacts_dir.exists():
                 shutil.rmtree(temp_artifacts_dir)
 
-    def _get_dataset_subset(self, data_path: str, target_col: str) -> pd.DataFrame:
-        dataframe = pd.read_csv(data_path)
-        if self.config.task_type == TaskTypes.CLASSIFICATION:
-            samples_per_label = 100
-            return dataframe.groupby(target_col, group_keys=False).apply(
-                lambda frame: frame.sample(n=min(len(frame), samples_per_label), random_state=42)
-            ).reset_index(drop=True)
-        if self.config.task_type == TaskTypes.REGRESSION:
-            total_samples = min(len(dataframe), 1000)
-            return dataframe.sample(n=total_samples, random_state=42).reset_index(drop=True)
-        raise ValueError(
-            f"Unknown task type: {self.config.task_type}. Supported types are {TaskTypes}."
-        )
-
     def _build_training_retry_message(self, prefix: str, returncode: int, stdout: bytes | str, stderr: bytes | str) -> str:
         message = f"{prefix}: Return code: {returncode}\nStderr: {stderr}, Stdout: {stdout}"
         return concise_output(collapse_repeated_lines(message))
diff --git a/src/agents/steps/validation_evaluation.py b/src/agents/steps/validation_evaluation.py
index 0e04cc88..ef41c3ab 100644
--- a/src/agents/steps/validation_evaluation.py
+++ b/src/agents/steps/validation_evaluation.py
@@ -11,7 +11,7 @@
 from agents.steps.model_training import ModelTrainingStep
 from run_logging.logging_helpers import log_iteration_metrics, log_new_best
 from runtime.conda_utils import get_shared_environment_path
-from runtime.inference_runner import compute_metrics, run_inference_on_labeled_data
+from runtime.inference_runner import compute_metrics, run_inference_on_split
 from runtime.read_write_utils import (
     load_best_iteration_snapshot_iteration,
     load_dataset_metadata,
@@ -19,6 +19,7 @@
 from runtime.step_outputs import load_step_output, require_step_output
 from utils.exceptions import AgentScriptFailed, IterationRunFailed
 from utils.metrics import get_higher_is_better_map
+from datasets.data_contract import LABELS_FILE_NAME
 
 
 class ValidationEvaluationOutput(BaseModel):
@@ -64,23 +65,22 @@ def _run_inference_on_all_splits(self) -> dict[str, float]:
         data_split = require_step_output(self.config, DataSplitStep.step_id, self.config.current_iteration_dir)
         for evaluation_stage in ["validation", "train"]:
             print(f"  Running {evaluation_stage} inference...")
-            labeled_input_path = Path(data_split.val_path if evaluation_stage == "validation" else data_split.train_path)
+            split_path = Path(data_split.val_path if evaluation_stage == "validation" else data_split.train_path)
+            labels_path = split_path / LABELS_FILE_NAME
             output_path = self.config.current_step_dir / f"eval_predictions_{evaluation_stage}.csv"
             try:
-                result = run_inference_on_labeled_data(
-                    evaluation_stage=evaluation_stage,
-                    labeled_input_path=labeled_input_path,
+                result = run_inference_on_split(
+                    split_path=split_path,
                     output_path=output_path,
                     conda_env_path=conda_env_path,
                     inference_script_path=inference_script_path,
                     training_artifacts_dir=training_artifacts_dir,
-                    label_col=dataset_metadata["numeric_label_col"],
                 )
                 if result.returncode != 0:
                     raise AgentScriptFailed(f"Inference on {evaluation_stage} failed: {str(result)}")
                 evaluation_metrics = compute_metrics(
                     results_file=output_path,
-                    labeled_input_path=labeled_input_path,
+                    labels_path=labels_path,
                     numeric_label_col=dataset_metadata["numeric_label_col"],
                     task_type=dataset_metadata["task_type"],
                     evaluation_stage=evaluation_stage,
diff --git a/src/datasets/create_datasets.py b/src/datasets/create_datasets.py
index 1b39e9f6..85150c59 100644
--- a/src/datasets/create_datasets.py
+++ b/src/datasets/create_datasets.py
@@ -1,6 +1,7 @@
 from pathlib import Path
 import os
 import json
+import shutil
 import pandas as pd
 import argparse
 
@@ -24,6 +25,69 @@
 }
 
 CLASS_COL = "target"
+TABULAR_INPUT_FILE_NAME = "data.csv"
+SUPPORTED_SPLIT_NAMES = ("train", "validation", "test")
+
+
+def _write_classification_metadata(dataset_dir: Path, label_to_scalar: dict) -> None:
+    metadata = {
+        "task_type": "classification",
+        "numeric_label_col": "numeric_label",
+        "label_to_scalar": {
+            str(label): int(value)
+            for label, value in label_to_scalar.items()
+        },
+    }
+    (dataset_dir / "metadata.json").write_text(json.dumps(metadata, indent=4), encoding="utf-8")
+
+
+def _build_label_to_scalar(train_df: pd.DataFrame) -> dict:
+    labels = sorted(train_df[CLASS_COL].dropna().unique(), key=lambda label: str(label))
+    return {label: index for index, label in enumerate(labels)}
+
+
+def _write_folder_split(dataset_dir: Path, split_name: str, df: pd.DataFrame, label_to_scalar: dict) -> None:
+    _remove_split_outputs(dataset_dir, split_name)
+    split_dir = dataset_dir / split_name
+    input_dir = split_dir / "input"
+    input_dir.mkdir(parents=True, exist_ok=True)
+
+    df = df.copy()
+    if "id" not in df.columns:
+        df.insert(0, "id", [f"{split_name}-{index}" for index in range(len(df))])
+    df["id"] = df["id"].astype(str)
+
+    labels = pd.DataFrame({
+        "id": df["id"],
+        "numeric_label": df[CLASS_COL].map(label_to_scalar),
+    })
+    if labels["numeric_label"].isna().any():
+        missing_labels = sorted(
+            df.loc[labels["numeric_label"].isna(), CLASS_COL].dropna().unique(),
+            key=lambda label: str(label),
+        )
+        raise ValueError(f"{split_name} contains labels absent from train split: {missing_labels}")
+
+    df.drop(columns=[CLASS_COL]).to_csv(input_dir / TABULAR_INPUT_FILE_NAME, index=False)
+    labels["numeric_label"] = labels["numeric_label"].astype(int)
+    labels.to_csv(split_dir / "labels.csv", index=False)
+
+
+def _remove_split_outputs(dataset_dir: Path, split_name: str) -> None:
+    split_csv = dataset_dir / f"{split_name}.csv"
+    if split_csv.exists():
+        split_csv.unlink()
+    split_dir = dataset_dir / split_name
+    if split_dir.exists():
+        shutil.rmtree(split_dir)
+
+
+def _remove_stale_dataset_outputs(dataset_dir: Path) -> None:
+    legacy_config = dataset_dir / "dataset_config.json"
+    if legacy_config.exists():
+        legacy_config.unlink()
+    for split_name in SUPPORTED_SPLIT_NAMES:
+        _remove_split_outputs(dataset_dir, split_name)
 
 
 def generate_mirbench_files(dataset: str | None = None):
@@ -35,20 +99,24 @@ def generate_mirbench_files(dataset: str | None = None):
         info = MIRBENCH_DATASETS[dataset_name]
         local_dset_path = REPO_PATH / "datasets" / dataset_name
         os.makedirs(local_dset_path, exist_ok=True)
+        _remove_stale_dataset_outputs(local_dset_path)
 
         with open(f"{local_dset_path}/dataset_description.md", "w") as f:
             f.write(info["description"])
 
-        with open(f"{local_dset_path}/dataset_config.json", "w") as f:
-            json.dump({"task_type": "classification"}, f, indent=4)
-
+        split_frames = {}
         for split in info["splits"]:
             download_path = REPO_PATH / ".miRBench"
             os.makedirs(download_path, exist_ok=True)
             mirbench_download_dataset(dataset_name, download_path=download_path / 'miRBench', split=split)
             df = pd.read_csv(download_path / 'miRBench', sep="\t")
             df = df.rename(columns={"label": CLASS_COL})
-            df.to_csv(f"{local_dset_path}/{split}.csv", index=False)
+            split_frames[split] = df
+
+        label_to_scalar = _build_label_to_scalar(split_frames["train"])
+        _write_classification_metadata(local_dset_path, label_to_scalar)
+        for split, df in split_frames.items():
+            _write_folder_split(local_dset_path, split, df, label_to_scalar)
 
         print(f"Downloaded dataset to {local_dset_path}")
 
@@ -68,13 +136,12 @@ def generate_genomic_benchmarks_files(dataset: str | None = None):
 
         local_dset_path = REPO_PATH / "datasets" / dataset_name
         os.makedirs(local_dset_path, exist_ok=True)
+        _remove_stale_dataset_outputs(local_dset_path)
 
         with open(f"{local_dset_path}/dataset_description.md", "w") as f:
             f.write(GENOMIC_BENCHMARKS_DATASETS[dataset_name])
 
-        with open(f"{local_dset_path}/dataset_config.json", "w") as f:
-            json.dump({"task_type": "classification"}, f, indent=4)
-
+        split_frames = {}
         for split in ["test","train"]:
             data = []
             for label_path in (download_path / split).iterdir():
@@ -83,7 +150,12 @@ def generate_genomic_benchmarks_files(dataset: str | None = None):
                     seq = sequence_file.read_text().strip()
                     data.append({"sequence": seq, CLASS_COL: label})
             df = pd.DataFrame(data)
-            df.to_csv(f"{local_dset_path}/{split}.csv", index=False)
+            split_frames[split] = df
+
+        label_to_scalar = _build_label_to_scalar(split_frames["train"])
+        _write_classification_metadata(local_dset_path, label_to_scalar)
+        for split, df in split_frames.items():
+            _write_folder_split(local_dset_path, split, df, label_to_scalar)
         
         print(f"Downloaded dataset to {local_dset_path}")
 
diff --git a/src/datasets/data_contract.py b/src/datasets/data_contract.py
new file mode 100644
index 00000000..3ec9eafc
--- /dev/null
+++ b/src/datasets/data_contract.py
@@ -0,0 +1,87 @@
+from pathlib import Path
+
+import pandas as pd
+
+TRAIN_SPLIT = "train"
+VALIDATION_SPLIT = "validation"
+TEST_SPLIT = "test"
+
+NON_TEST_SPLIT_NAMES = (TRAIN_SPLIT, VALIDATION_SPLIT)
+
+INPUT_DIR_NAME = "input"
+EXTRAS_DIR_NAME = "extras"
+SUPPLEMENTARY_DIR_NAME = "supplementary"
+LABELS_FILE_NAME = "labels.csv"
+ID_COLUMN_NAME = "id"
+NUMERIC_LABEL_COLUMN_NAME = "numeric_label"
+METADATA_FILE_NAME = "metadata.json"
+DATASET_DESCRIPTION_FILE_NAME = "dataset_description.md"
+
+
+def record_input_structure(input_dir: Path) -> list[str]:
+    """Returns sorted relative paths under input_dir. Directories are suffixed with '/' to distinguish them from same-named files."""
+    input_dir = Path(input_dir)
+    return sorted(
+        f"{item.relative_to(input_dir)}/" if item.is_dir() else str(item.relative_to(input_dir))
+        for item in input_dir.rglob("*")
+    )
+
+
+def validate_input_structure(input_dir: Path, expected_structure: list[str], label: str) -> None:
+    actual = record_input_structure(input_dir)
+    if actual == expected_structure:
+        return
+    missing = sorted(set(expected_structure) - set(actual))
+    extra = sorted(set(actual) - set(expected_structure))
+    parts = []
+    if missing:
+        parts.append(f"Missing: {missing[:10]}")
+    if extra:
+        parts.append(f"Extra: {extra[:10]}")
+    raise ValueError(
+        f"{label} input/ structure doesn't match the recorded train/input/ structure. {'; '.join(parts)}"
+    )
+
+
+def validate_labels_csv(labels_path: Path) -> None:
+    """Validates that labels.csv conforms to the dataset contract. Raises ValueError on any violation."""
+    labels_path = Path(labels_path)
+    if not labels_path.is_file():
+        raise ValueError(f"Labels file does not exist: {labels_path}")
+
+    try:
+        df = pd.read_csv(labels_path, dtype=str, keep_default_na=False)
+    except pd.errors.EmptyDataError:
+        raise ValueError(f"{labels_path} is empty")
+
+    expected_columns = {ID_COLUMN_NAME, NUMERIC_LABEL_COLUMN_NAME}
+    actual_columns = set(df.columns)
+    missing = expected_columns - actual_columns
+    if missing:
+        raise ValueError(f"{labels_path} is missing required columns: {sorted(missing)}")
+    extra = actual_columns - expected_columns
+    if extra:
+        # Duplicate columns in the source file (e.g. id,id,...) surface here as pandas-renamed entries like 'id.1'.
+        raise ValueError(f"{labels_path} has unsupported columns: {sorted(extra)}")
+
+    if df.empty:
+        raise ValueError(f"{labels_path} contains no label rows")
+
+    ids = df[ID_COLUMN_NAME].str.strip()
+    empty_id_mask = ids == ""
+    if empty_id_mask.any():
+        raise ValueError(f"{labels_path} has an empty id on line {int(empty_id_mask.idxmax()) + 2}")
+    dup_id_mask = ids.duplicated()
+    if dup_id_mask.any():
+        raise ValueError(f"{labels_path} contains duplicate id '{ids[dup_id_mask].iloc[0]}'")
+
+    labels = df[NUMERIC_LABEL_COLUMN_NAME].str.strip()
+    empty_label_mask = labels == ""
+    if empty_label_mask.any():
+        raise ValueError(f"{labels_path} has an empty numeric_label on line {int(empty_label_mask.idxmax()) + 2}")
+    non_numeric_mask = pd.to_numeric(labels, errors="coerce").isna()
+    if non_numeric_mask.any():
+        bad_idx = int(non_numeric_mask.idxmax())
+        raise ValueError(
+            f"{labels_path} has non-numeric numeric_label '{labels.iloc[bad_idx]}' on line {bad_idx + 2}"
+        )
diff --git a/src/datasets/dataset_utils.py b/src/datasets/dataset_utils.py
index 3903f8b3..f77aced9 100644
--- a/src/datasets/dataset_utils.py
+++ b/src/datasets/dataset_utils.py
@@ -2,104 +2,121 @@
 import csv
 import json
 import shutil
-import sys
-import pandas as pd
 from pathlib import Path
 from typing import List, Dict
-import subprocess
-from rich.console import Console
-from rich.table import Table
-from rich import box
+
+import pandas as pd
+
+from runtime.filesystem import remove_path
 from utils.config import Config
 from utils.task_types import TaskTypes
+from datasets.data_contract import (
+    DATASET_DESCRIPTION_FILE_NAME,
+    INPUT_DIR_NAME,
+    LABELS_FILE_NAME,
+    METADATA_FILE_NAME,
+    NON_TEST_SPLIT_NAMES,
+    NUMERIC_LABEL_COLUMN_NAME,
+    SUPPLEMENTARY_DIR_NAME,
+    TEST_SPLIT,
+    TRAIN_SPLIT,
+    VALIDATION_SPLIT,
+    record_input_structure,
+    validate_input_structure,
+    validate_labels_csv,
+)
+
+
+def copy_path_overwriting_target(source: Path, target: Path) -> None:
+    source = Path(source)
+    target = Path(target)
+    remove_path(target)
+    if source.is_dir():
+        shutil.copytree(source, target, symlinks=False)
+    else:
+        shutil.copy2(source, target)
 
-
-def count_csv_rows(csv_file: str) -> int:
-    """
-    Count rows in a CSV file (excluding header).
-    """
+def has_complete_split(split_path: Path) -> bool:
+    return (
+        split_path.is_dir()
+        and (split_path / INPUT_DIR_NAME).is_dir()
+        and (split_path / LABELS_FILE_NAME).is_file()
+    )
+    
+def _count_csv_data_rows(csv_path: Path) -> int:
+    """Counts data rows (excluding header) without validating contents. Returns 0 on any read error."""
     try:
-        with open(csv_file, 'r', encoding='utf-8') as f:
-            reader = csv.reader(f)
-            row_count = sum(1 for _ in reader)
-            return max(0, row_count - 1)  # Subtract 1 for header
+        with open(csv_path, newline="", encoding="utf-8") as f:
+            return max(0, sum(1 for _ in csv.reader(f)) - 1)
     except (FileNotFoundError, IOError, UnicodeDecodeError):
         return 0
-    
+
+
 def get_single_dataset_info(dataset_dir: str, prepared_datasets_dir: str) -> Dict:
     if not dataset_dir.is_dir():
         return None
-        
+
     dataset_name = dataset_dir.name
-    train_file = dataset_dir / "train.csv"
-    test_file = dataset_dir / "test.csv"
-    validation_file = dataset_dir / "validation.csv"
-    
-    # Count rows in raw files
-    train_rows = count_csv_rows(str(train_file)) if train_file.exists() else 0
-    test_rows = count_csv_rows(str(test_file)) if test_file.exists() else 0
-    validation_rows = count_csv_rows(str(validation_file)) if validation_file.exists() else 0
-    
-    # Check if already prepared
+    train_split = dataset_dir / TRAIN_SPLIT
+    validation_split = dataset_dir / VALIDATION_SPLIT
+    test_split = dataset_dir / TEST_SPLIT
     is_prepared = check_dataset_prepared(str(dataset_dir), prepared_datasets_dir)
-    
-    # Check if can be prepared
-    can_prepare = train_file.exists() and train_rows > 0
-    
-    if not train_file.exists():
-        status = "Missing train.csv"
+
+    train_rows = _count_csv_data_rows(train_split / LABELS_FILE_NAME) if has_complete_split(train_split) else 0
+    validation_rows = _count_csv_data_rows(validation_split / LABELS_FILE_NAME) if has_complete_split(validation_split) else 0
+    test_rows = _count_csv_data_rows(test_split / LABELS_FILE_NAME) if has_complete_split(test_split) else 0
+
+    if not train_split.exists():
+        status = "Missing train/"
+    elif not has_complete_split(train_split):
+        status = "Incomplete train/"
     elif train_rows == 0:
-        status = "Empty train.csv"
+        status = "Empty train/labels.csv"
     elif is_prepared:
         status = "Already prepared"
-    elif can_prepare:
-        status = "Ready to prepare"
     else:
-        status = "Cannot prepare"
-        
+        status = "Ready to prepare"
+
+    can_prepare = has_complete_split(train_split) and train_rows > 0
+
     return {
         "name": dataset_name,
         "path": dataset_dir,
         "train_rows": train_rows,
-        "test_rows": test_rows,
         "validation_rows": validation_rows,
+        "test_rows": test_rows,
         "status": status,
         "can_prepare": can_prepare,
         "should_prepare": can_prepare and not is_prepared,
         "is_prepared": is_prepared
     }
 
-def get_single_prepared_dataset_info(prepared_dataset_dir: str, prepared_test_sets_dir: str = None) -> Dict:
+def get_single_prepared_dataset_info(prepared_dataset_dir: str) -> Dict:
     if not prepared_dataset_dir.is_dir():
         return None
 
     dataset_name = prepared_dataset_dir.name
-    train_file = prepared_dataset_dir / "train.csv"
-    validation_file = prepared_dataset_dir / "validation.csv"
-    metadata_file = prepared_dataset_dir / "metadata.json"
-
-    if prepared_test_sets_dir:
-        test_file = Path(prepared_test_sets_dir) / dataset_name / "test.csv"
-    else:
-        test_file = None
+    train_split = prepared_dataset_dir / TRAIN_SPLIT
+    metadata_file = prepared_dataset_dir / METADATA_FILE_NAME
 
-    train_rows = count_csv_rows(str(train_file)) if train_file.exists() else 0
-    test_rows = count_csv_rows(str(test_file)) if (test_file and test_file.exists()) else 0
-    validation_rows = count_csv_rows(str(validation_file)) if validation_file.exists() else 0
-
-    # Test row count from metadata
-    test_rows = 0
+    train_rows = validation_rows = test_rows = 0
+    metadata_error = None
     try:
-        meta = json.loads(metadata_file.read_text())
-        splits = meta.get("splits", {}) if isinstance(meta, dict) else {}
-        test_rows = int(splits.get("test_rows", 0) or 0)
-    except Exception:
-        test_rows = 0
-
-    if not train_file.exists():
-        status = "Missing train.csv"
-    elif train_rows == 0:
-        status = "Empty train.csv"
+        splits = json.loads(metadata_file.read_text()).get("splits", {})
+        train_rows = int(splits.get("train_rows") or 0)
+        validation_rows = int(splits.get("validation_rows") or 0)
+        test_rows = int(splits.get("test_rows") or 0)
+    except FileNotFoundError:
+        metadata_error = "Missing metadata.json"
+    except Exception as exc:
+        metadata_error = f"Invalid metadata.json: {exc}"
+
+    if not train_split.exists():
+        status = "Missing train/"
+    elif not has_complete_split(train_split):
+        status = "Incomplete train/"
+    elif metadata_error:
+        status = metadata_error
     else:
         status = "Prepared"
 
@@ -139,17 +156,8 @@ def get_all_datasets_info(datasets_dir: str, prepared_datasets_dir: str) -> List
     datasets_info.sort(key=lambda x: x["name"])
     return datasets_info
 
-def get_all_prepared_datasets_info(prepared_datasets_dir: str, prepared_test_sets_dir: str = None) -> List[Dict]:
-    """
-    Collect information about all prepared datasets.
-
-    Args:
-        prepared_datasets_dir: Path to prepared datasets directory
-        prepared_test_sets_dir: Path to prepared test sets directory
-
-    Returns:
-        List of dataset information dictionaries
-    """
+def get_all_prepared_datasets_info(prepared_datasets_dir: str) -> List[Dict]:
+    """Collect information about all prepared datasets."""
     prepared_datasets_path = Path(prepared_datasets_dir)
 
     if not prepared_datasets_path.exists():
@@ -158,7 +166,7 @@ def get_all_prepared_datasets_info(prepared_datasets_dir: str, prepared_test_set
     prepared_datasets_info = []
 
     for prepared_dataset_dir in prepared_datasets_path.iterdir():
-        dataset_info = get_single_prepared_dataset_info(prepared_dataset_dir, prepared_test_sets_dir)
+        dataset_info = get_single_prepared_dataset_info(prepared_dataset_dir)
         if(dataset_info):
             prepared_datasets_info.append(dataset_info)
 
@@ -170,38 +178,8 @@ def check_dataset_prepared(dataset_dir: str, prepared_datasets_dir: str) -> bool
     """Check if a dataset is already prepared."""
     dataset_name = Path(dataset_dir).name
     prepared_path = Path(prepared_datasets_dir) / dataset_name
-    metadata_file = prepared_path / "metadata.json"
-    train_file = prepared_path / "train.csv"
-    return metadata_file.exists() and train_file.exists()
-
-def auto_detect_target_col(train_df, interactive=False):
-    """Auto-detect target column"""
-    possible_target_cols = ['class', 'target', 'label', 'y', 'CLASS', 'TARGET', 'LABEL', 'Y']
-    for col in possible_target_cols:
-        if col in train_df.columns:
-            print(f'INFO: Auto-detected target column: {col}')
-            return col
-
-    if interactive and sys.stdin.isatty():
-        console = Console()
-        print(f"\nCould not auto-detect target column. Expected one of {possible_target_cols}")
-        cols = train_df.columns.tolist()
-        table = Table(show_header=False, box=box.ROUNDED, padding=(0, 1))
-        num_cols = min(5, len(cols))
-        for _ in range(num_cols):
-            table.add_column(style="cyan", no_wrap=True)
-        for i in range(0, len(cols), num_cols):
-            table.add_row(*cols[i:i+num_cols])
-        console.print(f"\n[bold]Available columns ({len(cols)} total):[/bold]")
-        console.print(table)
-        console.print()
-        while True:
-            target_col = input("Enter the name of the target/label column: ").strip()
-            if target_col in train_df.columns:
-                return target_col
-            print(f"Column '{target_col}' not found. Please try again.")
-
-    raise ValueError(f"Could not auto-detect target column. Expected one of {possible_target_cols}, but found columns: {train_df.columns.tolist()}. Please specify --target-col explicitly.")
+    metadata_file = prepared_path / METADATA_FILE_NAME
+    return metadata_file.exists() and has_complete_split(prepared_path / TRAIN_SPLIT)
 
 def get_task_type_from_prepared_dataset(prepared_dataset_dir: str) -> str:
     metadata_path = prepared_dataset_dir / "metadata.json"
@@ -222,68 +200,6 @@ def get_classes_integers(config: Config):
     metadata = json.loads(metadata_path.read_text())
     # Sort by numeric value to get consistent ordering
     return sorted(metadata["label_to_scalar"].values())
-        
-def select_task_type(train_df, target_col, interactive=False):
-    if not interactive or not sys.stdin.isatty():
-        if interactive:
-            print(
-                "Dataset preparation requires task type selection, but stdin is not interactive. "
-                "Prepare this dataset with --task-type classification or --task-type regression."
-            )
-        raise ValueError("Task type is required. Pass --task-type classification or --task-type regression.")
-
-    print_target_column_summary(train_df, target_col)
-    Console().print("[bold red]Action needed:[/bold red] Select the task type for this dataset.")
-
-    while True:
-        choice = input("Select task type ([c]lassification/[r]egression): ").strip().lower()
-        if choice in ("c", "class", TaskTypes.CLASSIFICATION):
-            return TaskTypes.CLASSIFICATION
-        if choice in ("r", "reg", TaskTypes.REGRESSION):
-            return TaskTypes.REGRESSION
-        print(f"Please enter '{TaskTypes.CLASSIFICATION}' or '{TaskTypes.REGRESSION}'.")
-
-def print_target_column_summary(train_df, target_col):
-    target_values = train_df[target_col]
-    non_null_values = target_values.dropna()
-    unique_values = non_null_values.unique()
-    unique_preview = _format_unique_values_preview(unique_values)
-
-    console = Console()
-    console.print(
-        "\n[bold]Target column summary[/bold]\n"
-        f"- Column: [cyan]{target_col}[/cyan] ({target_values.dtype})\n"
-        f"- Unique non-missing values: {len(unique_values):,}\n"
-        f"- Unique labels: {unique_preview}"
-    )
-
-def _format_unique_values_preview(values, limit=20):
-    preview = _format_preview_values(values[:limit])
-    if len(values) > limit:
-        return f"{preview}, ..."
-    return preview
-
-def _format_preview_values(values):
-    formatted_values = []
-    for value in values:
-        if hasattr(value, "item"):
-            value = value.item()
-        formatted_values.append(repr(value))
-    return ", ".join(formatted_values)
-
-def validate_single_label_classification(train_df, target_col):
-    target_values = train_df[target_col].dropna()
-    for value in target_values:
-        if isinstance(value, (list, set)):
-            raise ValueError(
-                f"Target column '{target_col}' contains multi-label values (e.g., {value!r}). "
-                "Only single-label classification is supported."
-            )
-        if isinstance(value, str) and value.startswith('[') and value.endswith(']'):
-            raise ValueError(
-                f"Target column '{target_col}' appears to contain multi-label values (e.g., {value!r}). "
-                "Only single-label classification is supported."
-            )
 
 def smart_sort_labels(labels):
     """
@@ -318,189 +234,129 @@ def sort_key(label):
     
     return sorted(labels, key=sort_key)
 
-def get_label_to_number_map(train_df, test_df, target_col, positive_class=None, negative_class=None):
-    unique_labels = train_df[target_col].dropna().unique()
-
-    # Check test set for additional labels
-    if test_df is not None:
-        test_unique_labels = test_df[target_col].dropna().unique()
-        if(set(unique_labels) != set(test_unique_labels)):
-            print("WARNING: Mismatch in unique labels between train and test sets.", 
-                f"Train labels: {unique_labels}, Test labels: {test_unique_labels}")
-            unique_labels = set(unique_labels).union(set(test_unique_labels))
-
-    # Generate label to number mapping
-    if len(unique_labels) == 2 and positive_class and negative_class:
-        # binary classification
-        label_map = {negative_class: 0, positive_class: 1}
-    else:
-        # multiclass: smart semantic sorting
-        sorted_labels = smart_sort_labels(unique_labels)
-        label_map = {lbl: i for i, lbl in enumerate(sorted_labels)}
-
-    print(f"INFO: Label to number mapping: {label_map}. If this is wrong, please provide positive-class and negative-class parameters to the script.")
-
-    return label_map
-
-def add_id_column(train_df, validation_df=None, test_df=None):
+def prepare_dataset(dataset_dir, output_dir, test_sets_output_dir, task_type=None):
     """
-    Add ID column to dataframes.
-    - Train and validation share a common ID range (no overlap)
-    - Test has its own ID range starting from 0
-    - If the dataset already has an 'id' column, it's renamed to 'id_original'
-
-    Args:
-        train_df: Training dataframe
-        validation_df: Validation dataframe (optional)
-        test_df: Test dataframe (optional)
+    Copies a folder-contract dataset into the prepared dataset locations.
 
-    Returns:
-        Tuple of (train_df, validation_df, test_df) with ID columns added
-    """
-    # Rename existing 'id' columns if present
-    if 'id' in train_df.columns and 'id_original' in train_df.columns:
-        raise Exception('Cannot have both "id" and "id_original" columns in training data')
-    if 'id' in train_df.columns:
-        train_df.rename(columns={'id': 'id_original'}, inplace=True)
-
-    if validation_df is not None:
-        if 'id' in validation_df.columns and 'id_original' in validation_df.columns:
-            raise Exception('Cannot have both "id" and "id_original" columns in validation data')
-        if 'id' in validation_df.columns:
-            validation_df.rename(columns={'id': 'id_original'}, inplace=True)
-
-    if test_df is not None:
-        if 'id' in test_df.columns and 'id_original' in test_df.columns:
-            raise Exception('Cannot have both "id" and "id_original" columns in test data')
-        if 'id' in test_df.columns:
-            test_df.rename(columns={'id': 'id_original'}, inplace=True)
-
-    # Train IDs start at 0
-    train_df['id'] = range(len(train_df))
-
-    # Validation IDs continue after training
-    if validation_df is not None:
-        validation_start_id = len(train_df)
-        validation_df['id'] = range(validation_start_id, validation_start_id + len(validation_df))
-
-    # Test IDs start at 0
-    if test_df is not None:
-        test_df['id'] = range(len(test_df))
-
-    return train_df, validation_df, test_df
-
-def load_dataset_config(dataset_dir: Path) -> dict:
-    config_path = dataset_dir / "dataset_config.json"
-    if not config_path.exists():
-        return {}
-    config = json.loads(config_path.read_text())
-    if "task_type" in config and config["task_type"] not in TaskTypes:
-        raise ValueError(
-            f"Invalid task_type '{config['task_type']}' in {config_path}. "
-            f"Must be one of: {TaskTypes}"
-        )
-    return config
-
-def prepare_dataset(dataset_dir, target_col,
-                   positive_class, negative_class, task_type, output_dir, test_sets_output_dir, interactive=False):
-    """
-    Preprocesses dataset files to a format digestable by the agent code. Stores test set files in a separate directory.
-    If target_col is None, it will be auto-detected or prompted for in interactive mode.
-    If task_type is None, it will be prompted for in interactive mode.
-    If positive_class and negative_class are None, they will be auto-detected for binary classification and printed out
-    CLI args take precedence over dataset_config.json, which takes precedence over auto-detection.
+    Expected split layout:
+    <dataset>/<split>/input/ contains arbitrary model-readable input files.
+    <dataset>/<split>/labels.csv contains evaluator-readable labels keyed by id.
     """
     dataset_dir = Path(dataset_dir)
     output_dir = Path(output_dir)
     test_sets_output_dir = Path(test_sets_output_dir)
 
-    config = load_dataset_config(dataset_dir)
-    if target_col is None:
-        target_col = config.get("target_col")
-    if task_type is None:
-        task_type = config.get("task_type")
-    if positive_class is None:
-        positive_class = config.get("positive_class")
-    if negative_class is None:
-        negative_class = config.get("negative_class")
-
-    train = dataset_dir / 'train.csv'
-    test = dataset_dir / 'test.csv' if (dataset_dir / 'test.csv').exists() else None
-    validation = dataset_dir / 'validation.csv' if (dataset_dir / 'validation.csv').exists() else None
-    description = dataset_dir / 'dataset_description.md' if (dataset_dir / 'dataset_description.md').exists() else None
     dataset_name = dataset_dir.name
 
-    train_df = pd.read_csv(train)
-    test_df = pd.read_csv(test) if test else None
-    validation_df = pd.read_csv(validation) if validation else None
-
-    if target_col is None:
-        target_col = auto_detect_target_col(train_df, interactive=interactive)
-    if task_type is None:
-        task_type = select_task_type(train_df, target_col, interactive=interactive)
-
-    if task_type == TaskTypes.CLASSIFICATION:
-        validate_single_label_classification(train_df, target_col)
-        label_map = get_label_to_number_map(
-            train_df=train_df,
-            test_df=test_df,
-            target_col=target_col,
-            positive_class=positive_class,
-            negative_class=negative_class
-        )
-    train_df, validation_df, test_df = add_id_column(train_df, validation_df, test_df)
-
-    dataframes = [('train', train_df)]
-    if test_df is not None:
-        dataframes.append(('test', test_df))
-    if validation_df is not None:
-        dataframes.append(('validation', validation_df))
-    
     out_dir = output_dir / dataset_name
-    out_dir.mkdir(parents=True, exist_ok=True)
     test_out_dir = test_sets_output_dir / dataset_name
-    test_out_dir.mkdir(parents=True, exist_ok=True)
-    
-    # Generate prepared split CSV files with numeric labels, plus a labelless held-out test file.
-    for split_name, df in dataframes:
-        try:
-            if task_type == TaskTypes.CLASSIFICATION:
-                df['numeric_label'] = df[target_col].map(label_map)
-            else:
-                df['numeric_label'] = df[target_col]
-        except KeyError as e:
-            raise KeyError(f"Target column '{target_col}' not found in {split_name} dataset. Available columns: {df.columns}") from e
-
-        target_dir = test_out_dir if split_name == 'test' else out_dir
-        df.drop(columns=[target_col]).to_csv(target_dir / f'{split_name}.csv', index=False)
-        if split_name == 'test':
-            df.drop(columns=[target_col, 'numeric_label']).to_csv(target_dir / 'test.no_label.csv', index=False)
-    
-    # Generate dataset description file
-    if(description is not None):
-        description_content = description.read_text()
-        (out_dir / 'dataset_description.md').write_text(description_content)
+    if dataset_dir.resolve() == out_dir.resolve():
+        raise ValueError("Input dataset directory must be different from the prepared output directory.")
+
+    train_split = dataset_dir / TRAIN_SPLIT
+    if not has_complete_split(train_split):
+        raise FileNotFoundError(f"Required split folder is missing or incomplete: {train_split}")
+
+    split_row_counts = {}
+    source_splits = {}
+    for split_name in NON_TEST_SPLIT_NAMES:
+        source_split = dataset_dir / split_name
+        if source_split.exists():
+            if not has_complete_split(source_split):
+                raise FileNotFoundError(f"Split folder is incomplete: {source_split}")
+            labels_path = source_split / LABELS_FILE_NAME
+            validate_labels_csv(labels_path)
+            split_row_counts[split_name] = len(pd.read_csv(labels_path))
+            source_splits[split_name] = source_split
+
+    source_test_split = dataset_dir / TEST_SPLIT
+    if has_complete_split(source_test_split):
+        test_labels_path = source_test_split / LABELS_FILE_NAME
+        validate_labels_csv(test_labels_path)
+        split_row_counts[TEST_SPLIT] = len(pd.read_csv(test_labels_path))
+        source_splits[TEST_SPLIT] = source_test_split
+    elif source_test_split.exists():
+        raise FileNotFoundError(f"Split folder is incomplete: {source_test_split}")
+
+    description = dataset_dir / DATASET_DESCRIPTION_FILE_NAME
+    metadata_path = dataset_dir / METADATA_FILE_NAME
+    if metadata_path.exists():
+        meta = json.loads(metadata_path.read_text())
     else:
-        print("INFO: No dataset description provided.")
-        (out_dir / 'dataset_description.md').write_text("No dataset description available.")
-    
-     # Generate metadata file
-    meta = {
-        'task_type': task_type,
-        'class_col': target_col,
-        'numeric_label_col': 'numeric_label',
-        # Store split row counts
-        'splits': {
-            'train_rows': len(train_df),
-            'validation_rows': len(validation_df) if validation_df is not None else 0,
-            'test_rows': len(test_df) if test_df is not None else 0,
+        if task_type is None:
+            raise FileNotFoundError(
+                f"{METADATA_FILE_NAME} is required for folder-based datasets unless task_type is provided."
+            )
+        meta = {
+            "task_type": task_type,
+            "numeric_label_col": "numeric_label",
         }
+
+    meta["splits"] = {
+        "train_rows": split_row_counts[TRAIN_SPLIT],
+        "validation_rows": split_row_counts.get(VALIDATION_SPLIT, 0),
+        "test_rows": split_row_counts.get(TEST_SPLIT, 0),
     }
-    if task_type == TaskTypes.CLASSIFICATION:
-        json_safe_label_map = {str(k): int(v) for k, v in label_map.items()}
-        meta['label_to_scalar'] = json_safe_label_map
+    meta.setdefault("numeric_label_col", "numeric_label")
+    if meta.get("task_type") == "classification" and "label_to_scalar" not in meta:
+        public_splits = [split_name for split_name in NON_TEST_SPLIT_NAMES if split_name in source_splits]
+        meta["label_to_scalar"] = derive_class_label_map(dataset_dir, public_splits)
+
+    remove_path(out_dir)
+    out_dir.mkdir(parents=True)
+    remove_path(test_out_dir)
+
+    for split_name in NON_TEST_SPLIT_NAMES:
+        if split_name in source_splits:
+            copy_path_overwriting_target(source_splits[split_name], out_dir / split_name)
+
+    if TEST_SPLIT in source_splits:
+        test_out_dir.mkdir(parents=True)
+        copy_path_overwriting_target(source_splits[TEST_SPLIT], test_out_dir / TEST_SPLIT)
+
+    train_input_dir = out_dir / TRAIN_SPLIT / INPUT_DIR_NAME
+    input_structure = record_input_structure(train_input_dir)
+    meta["input_structure"] = input_structure
+
+    if VALIDATION_SPLIT in source_splits:
+        validate_input_structure(
+            out_dir / VALIDATION_SPLIT / INPUT_DIR_NAME, input_structure, "Validation"
+        )
+    if TEST_SPLIT in source_splits:
+        validate_input_structure(
+            test_out_dir / TEST_SPLIT / INPUT_DIR_NAME, input_structure, "Test"
+        )
 
-    (out_dir / 'metadata.json').write_text(json.dumps(meta, indent=4))
+    supplementary_source = dataset_dir / SUPPLEMENTARY_DIR_NAME
+    if supplementary_source.is_dir():
+        copy_path_overwriting_target(supplementary_source, out_dir / SUPPLEMENTARY_DIR_NAME)
+
+    if description.exists():
+        copy_path_overwriting_target(description, out_dir / DATASET_DESCRIPTION_FILE_NAME)
+    else:
+        print("INFO: No dataset description provided.")
+        (out_dir / DATASET_DESCRIPTION_FILE_NAME).write_text(
+            "No dataset description available.",
+            encoding="utf-8",
+        )
+
+    (out_dir / METADATA_FILE_NAME).write_text(json.dumps(meta, indent=4), encoding="utf-8")
+
+def derive_class_label_map(dataset_dir: Path, split_names) -> dict[str, int]:
+    class_ids = set()
+    for split_name in split_names:
+        labels_file = dataset_dir / split_name / LABELS_FILE_NAME
+        with open(labels_file, newline="", encoding="utf-8") as labels_stream:
+            reader = csv.DictReader(labels_stream)
+            for row in reader:
+                numeric_label = float(row[NUMERIC_LABEL_COLUMN_NAME])
+                if not numeric_label.is_integer():
+                    raise ValueError(
+                        f"{split_name}/labels.csv has non-integer classification label: {numeric_label}"
+                    )
+                class_ids.add(int(numeric_label))
+
+    return {str(class_id): class_id for class_id in sorted(class_ids)}
 
 def setup_nonsensitive_dataset_files_for_agent(prepared_datasets_dir: Path, agent_datasets_dir: Path, dataset_name: str):
     """
@@ -512,17 +368,20 @@ def setup_nonsensitive_dataset_files_for_agent(prepared_datasets_dir: Path, agen
 
     assert target_dataset_dir.is_dir()
 
-    target_files = ['dataset_description.md', 'train.csv', 'validation.csv']
-    for file in target_files:
-        source_file = source_dataset_dir / file
-        target_file = target_dataset_dir / file
-
-        if source_file.exists():
-            if target_file.exists() or target_file.is_symlink():
-                    target_file.unlink()
-
+    target_paths = [
+        DATASET_DESCRIPTION_FILE_NAME,
+        METADATA_FILE_NAME,
+        SUPPLEMENTARY_DIR_NAME,
+        TRAIN_SPLIT,
+        VALIDATION_SPLIT,
+    ]
+    for relative_path in target_paths:
+        source_path = source_dataset_dir / relative_path
+        target_path = target_dataset_dir / relative_path
+
+        if source_path.exists():
             #TODO why was this changed from a symlink to a copy?
             #TODO bug was this: if agent changed the file in it's own workspace folder, it was changing the OG prepared files
-            #TODO make it read-only simlink 
+            #TODO make it read-only simlink
             #TODO check raw data -> prepared data is not a symlink but a hard copy!
-            shutil.copy2(source_file, target_file)
+            copy_path_overwriting_target(source_path, target_path)
diff --git a/src/datasets/datasets_interactive.py b/src/datasets/datasets_interactive.py
index 11ef35b9..21bb8f77 100644
--- a/src/datasets/datasets_interactive.py
+++ b/src/datasets/datasets_interactive.py
@@ -71,7 +71,7 @@ def prepare_all_datasets(datasets_dir: str, prepared_datasets_dir: str, prepared
         console.print("")
         console.print("[blue]To add datasets:[/blue]")
         console.print(f"   1. Create directories in {datasets_dir}/ (e.g. {datasets_dir}/my_dataset/)")
-        console.print("   2. Add train.csv (and optionally test.csv) to each dataset directory")
+        console.print("   2. Add train/input/ and train/labels.csv to each dataset directory")
         console.print("   3. Run preparation again")
         sys.exit(1)
     
@@ -88,7 +88,7 @@ def prepare_all_datasets(datasets_dir: str, prepared_datasets_dir: str, prepared
     
     if not need_preparation:
         if not already_prepared:
-            console.print("[yellow]No datasets can be prepared. Check for missing train.csv files.[/yellow]")
+            console.print("[yellow]No datasets can be prepared. Check for missing or invalid train/ split folders.[/yellow]")
         sys.exit(0)
     
     # Prepare datasets with progress display
@@ -103,13 +103,8 @@ def prepare_all_datasets(datasets_dir: str, prepared_datasets_dir: str, prepared
         try:
             prepare_dataset(
                 dataset_dir=dataset_info['path'],
-                target_col=None, #auto-detected inside
-                positive_class=None, #auto-detected inside
-                negative_class=None, #auto-detected inside
-                task_type=None,
                 output_dir=prepared_datasets_dir,
-                interactive=True,
-                test_sets_output_dir=prepared_test_sets_dir
+                test_sets_output_dir=prepared_test_sets_dir,
             )
             success = True
         except Exception as e:
diff --git a/src/datasets/normalize_dataset.py b/src/datasets/normalize_dataset.py
deleted file mode 100644
index ae3d13b7..00000000
--- a/src/datasets/normalize_dataset.py
+++ /dev/null
@@ -1,44 +0,0 @@
-import argparse
-import csv
-import sys
-from pathlib import Path
-
-
-def normalize_input_dataset(input_path: Path, output_path: Path):
-    """
-    Normalize a csv dataset before entering the train/inference pipeline. Ensures it contains the id column.
-    If 'id' column is already provided, keeps original csv.
-    Uses only stdlib csv to avoid requiring pandas in the calling environment.
-    """
-    with open(input_path, newline='') as fin:
-        reader = csv.reader(fin)
-        header = next(reader, None)
-        if header is None:
-            raise ValueError(f"Input CSV is empty: {input_path}")
-        if 'id' in header:
-            return False  # original had id, no normalization needed
-        rows = list(reader)
-
-    with open(output_path, 'w', newline='') as fout:
-        writer = csv.writer(fout)
-        writer.writerow(['id'] + header)
-        for idx, row in enumerate(rows):
-            writer.writerow([idx] + row)
-
-    print("[Warning] Input CSV has no 'id' column. Sequential IDs (0..N-1) added in a temporary file, used for running inference on. If you need specific IDs, include an 'id' column in the input csv", file=sys.stderr)
-    return True
-
-def main():
-    parser = argparse.ArgumentParser(description="Normalize a CSV dataset for use in the pipeline.")
-    parser.add_argument("--input", required=True, help="Path to the input CSV file")
-    args = parser.parse_args()
-
-    input_path = Path(args.input)
-    output_path = input_path.parent / f"normalized_{input_path.name}"
-
-    normalized = normalize_input_dataset(input_path, output_path)
-    if normalized:
-        print(output_path.name)
-
-if __name__ == "__main__":
-    main()
diff --git a/src/prepare_datasets.py b/src/prepare_datasets.py
index d61471a6..040b87fb 100644
--- a/src/prepare_datasets.py
+++ b/src/prepare_datasets.py
@@ -3,19 +3,16 @@
 from pathlib import Path
 from rich.console import Console
 
-from datasets.dataset_utils import prepare_dataset, check_dataset_prepared
+from datasets.dataset_utils import prepare_dataset
 from datasets.datasets_interactive import prepare_all_datasets
 from utils.task_types import TaskTypes
 
 
 def parse_args():
-    parser = argparse.ArgumentParser(description="Dataset preparation")
+    parser = argparse.ArgumentParser(description="Prepare folder-contract datasets")
     parser.add_argument('--dataset-dir', type=Path, help='Single dataset directory to prepare')
-    parser.add_argument('--prepare-all', action='store_true', help='Prepare all datasets in datasets-dir')
-    parser.add_argument('--target-col', type=str, default=None, help='Target column name (auto-detected if not provided)')
+    parser.add_argument('--prepare-all', action='store_true', help='Prepare all folder-contract datasets in datasets-dir')
     parser.add_argument('--task-type', choices=sorted(TaskTypes), default=None, help='Task type (prompted if not provided)')
-    parser.add_argument('--positive-class', help='Value used in the label column for a positive class (affects some binary classification metrics). If not provided, numeric labels are assigned based on the label appearance order in the train csv file.', default=None)
-    parser.add_argument('--negative-class', help='Value used in the label column for a negative class (affects some binary classification metrics). If not provided, numeric labels are assigned based on the label appearance order in the train csv file.', default=None)
     parser.add_argument('--datasets-dir', default='./datasets', help='Directory containing raw datasets')
     parser.add_argument('--prepared-datasets-dir', default='./prepared_datasets', help='Output directory for prepared datasets')
     parser.add_argument('--prepared-test-sets-dir', default='./prepared_test_sets', help='Output directory for prepared test sets')
@@ -35,20 +32,14 @@ def main():
 
     if args.prepare_all or not dataset_dir:
         prepare_all_datasets(datasets_dir, prepared_datasets_dir, prepared_test_sets_dir)
-    elif check_dataset_prepared(str(dataset_dir), str(prepared_datasets_dir)):
-        console.print(f'[blue]Dataset "{dataset_dir.name}" already prepared, skipping preparation[/blue]')
     else:
-        console.print(f'[blue]Preparing dataset "{dataset_dir.name}"[/blue]')# for {task_type} task with target column "{target_col}"[/blue]')
+        console.print(f'[blue]Preparing dataset "{dataset_dir.name}"[/blue]')
         try:
             prepare_dataset(
                 dataset_dir=dataset_dir,
-                target_col=args.target_col,
-                positive_class=args.positive_class,
-                negative_class=args.negative_class,
-                task_type=args.task_type,
                 output_dir=prepared_datasets_dir,
                 test_sets_output_dir=prepared_test_sets_dir,
-                interactive=sys.stdin.isatty(),
+                task_type=args.task_type,
             )
             console.print(f"[green]Dataset '{dataset_dir.name}' prepared successfully![/green]")
         except Exception as e:
@@ -56,4 +47,4 @@ def main():
             sys.exit(1)
         
 if __name__ == "__main__":
-    main()
\ No newline at end of file
+    main()
diff --git a/src/run_agent_interactive.py b/src/run_agent_interactive.py
index 088e5b14..66158426 100755
--- a/src/run_agent_interactive.py
+++ b/src/run_agent_interactive.py
@@ -10,6 +10,7 @@
 from runtime.read_write_utils import get_next_iteration_index, load_config_from_run_dir
 from utils.config import Config
 from datasets.dataset_utils import get_all_prepared_datasets_info, get_task_type_from_prepared_dataset
+from datasets.data_contract import VALIDATION_SPLIT
 from datasets.datasets_interactive import interactive_dataset_selection, print_datasets_table
 from run_logging.env_utils import are_wandb_vars_available
 from utils.metrics import get_classification_metrics_names, get_regression_metrics_names, resolve_val_metric
@@ -258,7 +259,7 @@ def main():
     val_metric = resolve_val_metric(task_type, args.val_metric)
 
     split_allowed_iterations = args.split_allowed_iterations
-    if (args.prepared_datasets_dir / dataset / "validation.csv").exists():
+    if (args.prepared_datasets_dir / dataset / VALIDATION_SPLIT).exists():
         split_allowed_iterations = 0
 
     asyncio.run(
diff --git a/src/run_logging/test_evaluation.py b/src/run_logging/test_evaluation.py
index 69e1c21f..53e29fce 100644
--- a/src/run_logging/test_evaluation.py
+++ b/src/run_logging/test_evaluation.py
@@ -12,6 +12,7 @@
 from runtime.inference_runner import compute_metrics, run_inference_script
 from runtime.read_write_utils import load_config_from_run_dir, load_dataset_metadata
 from utils.config import Config
+from datasets.data_contract import INPUT_DIR_NAME, LABELS_FILE_NAME, TEST_SPLIT
 from utils.exceptions import AgentScriptFailed
 from utils.metrics import get_task_to_metrics_names
 from utils.printing_utils import print_best_iteration_metrics
@@ -39,12 +40,15 @@ def run_test_evaluation(workspace_dir, prepared_test_sets_dir: Path):
         inference_script_path = best_iteration_snapshot_dir / "model_inference" / "inference.py"
         output_path = config.best_iteration_snapshot_dir / "eval_predictions_test.csv"
         remove_path(output_path)
-        prepared_test_set_dir = prepared_test_sets_dir / config.dataset
-        test_input_path = prepared_test_set_dir / "test.no_label.csv"
-        labeled_test_path = prepared_test_set_dir / "test.csv"
-        if not test_input_path.exists() or not labeled_test_path.exists():
+        test_split_path = prepared_test_sets_dir / config.dataset / TEST_SPLIT
+        if not test_split_path.exists():
+            print(f"No prepared test split found at {test_split_path}. Skipping final test evaluation.")
+            return
+        test_input_path = test_split_path / INPUT_DIR_NAME
+        labels_path = test_split_path / LABELS_FILE_NAME
+        if not test_input_path.is_dir() or not labels_path.is_file():
             raise FileNotFoundError(
-                f"Expected both test.no_label.csv and test.csv in {prepared_test_set_dir} for final test evaluation."
+                f"Expected input/ and labels.csv in {test_split_path} for final test evaluation."
             )
         result = run_inference_script(
             env_path=get_best_iteration_snapshot_environment_path(config),
@@ -58,7 +62,7 @@ def run_test_evaluation(workspace_dir, prepared_test_sets_dir: Path):
             raise AgentScriptFailed(f"Inference on test failed: {str(result)}")
         metrics = compute_metrics(
             results_file=output_path,
-            labeled_input_path=labeled_test_path,
+            labels_path=labels_path,
             numeric_label_col=dataset_metadata["numeric_label_col"],
             task_type=dataset_metadata["task_type"],
             evaluation_stage="test",
diff --git a/src/runtime/generate_final_reports.py b/src/runtime/generate_final_reports.py
index 80dfdb47..b9599383 100644
--- a/src/runtime/generate_final_reports.py
+++ b/src/runtime/generate_final_reports.py
@@ -36,6 +36,8 @@
 )
 from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
 
+from datasets.data_contract import LABELS_FILE_NAME, TEST_SPLIT, TRAIN_SPLIT, VALIDATION_SPLIT
+
 
 # Data models
 @dataclass(frozen=True)
@@ -278,15 +280,23 @@ def load_prepared_dataset_meta(prepared_datasets_dir: Path, dataset_name: str) -
     )
 
 # Iteration discovery / inputs
-def _find_split_csv(config: Config, split_name: str, fallback_dir: Path) -> Optional[Path]:
-    if config.splits_dir.exists():
-        for split_dir in sorted(config.splits_dir.iterdir()):
-            candidate = split_dir / f"{split_name}.csv"
-            if candidate.exists():
-                return candidate
-    p = fallback_dir / config.dataset / f"{split_name}.csv"
+def _find_split_labels(
+    config: Config,
+    split_name: str,
+    fallback_dir: Path,
+    split_version: int | None = None,
+) -> Optional[Path]:
+    if split_version is not None:
+        candidate = config.splits_dir / f"split_{split_version}" / split_name / LABELS_FILE_NAME
+        if candidate.exists():
+            return candidate
+    p = fallback_dir / config.dataset / split_name / LABELS_FILE_NAME
     return p if p.exists() else None
 
+def _get_iteration_split_version(config: Config, iter_dir: Path) -> int | None:
+    data_split_output = load_step_output(config, "data_split", iter_dir)
+    return getattr(data_split_output, "split_version", None)
+
 def gather_iteration_inputs(
     config: Config,
     prepared_datasets: Path,
@@ -301,9 +311,10 @@ def gather_iteration_inputs(
     if not report_md.exists():
         report_md = None
 
-    train_csv = _find_split_csv(config, "train", prepared_datasets)
-    val_csv = _find_split_csv(config, "validation", prepared_datasets)
-    test_csv = _find_split_csv(config, "test", prepared_tests)
+    split_version = _get_iteration_split_version(config, iter_dir)
+    train_csv = _find_split_labels(config, TRAIN_SPLIT, prepared_datasets, split_version=split_version)
+    val_csv = _find_split_labels(config, VALIDATION_SPLIT, prepared_datasets, split_version=split_version)
+    test_csv = _find_split_labels(config, TEST_SPLIT, prepared_tests)
 
     validation_evaluation_dir = iter_dir / ValidationEvaluationStep.step_id
     train_preds = validation_evaluation_dir / "eval_predictions_train.csv"
diff --git a/src/runtime/inference_runner.py b/src/runtime/inference_runner.py
index 83462177..ff39bccf 100644
--- a/src/runtime/inference_runner.py
+++ b/src/runtime/inference_runner.py
@@ -3,11 +3,10 @@
 import subprocess
 from pathlib import Path
 
-import pandas as pd
-
 from runtime.evaluate_result import get_metrics
 from runtime.filesystem import remove_path
 from utils.exceptions import AgentScriptFailed
+from datasets.data_contract import INPUT_DIR_NAME
 
 
 def run_inference_script(
@@ -30,49 +29,36 @@ def run_inference_script(
         cwd=script_path.parent,
         check=check,
     )
-#TODO should the temp files be in a specific folder?
-#TODO is this writing in the split folder? That would fail due to permissions
-# Only for runs that need labelless files generated on the fly.
-def run_inference_on_labeled_data(evaluation_stage: str, labeled_input_path: Path, output_path: Path, conda_env_path: Path, inference_script_path: Path, training_artifacts_dir: Path, label_col: str):
-    """Run inference script after creating a temporary labelless input."""
-    inference_input_path = output_path.parent / f"{evaluation_stage}.no_label.csv"
-    try:
-        remove_path(inference_input_path)
-        _create_labelless_file(
-            labeled_input_path=labeled_input_path,
-            target_path=inference_input_path,
-            label_col=label_col,
-        )
-        remove_path(output_path)
-        return run_inference_script(
-            env_path=conda_env_path,
-            script_path=inference_script_path,
-            input_path=inference_input_path,
-            output_path=output_path,
-            artifacts_dir=training_artifacts_dir,
-        )
-    finally:
-        remove_path(inference_input_path)
+def run_inference_on_split(
+    split_path: Path,
+    output_path: Path,
+    conda_env_path: Path,
+    inference_script_path: Path,
+    training_artifacts_dir: Path,
+):
+    input_path = split_path / INPUT_DIR_NAME
+    if not input_path.is_dir():
+        raise FileNotFoundError(f"Split input folder not found at {input_path}")
+    remove_path(output_path)
+    return run_inference_script(
+        env_path=conda_env_path,
+        script_path=inference_script_path,
+        input_path=input_path,
+        output_path=output_path,
+        artifacts_dir=training_artifacts_dir,
+    )
 
-def compute_metrics(results_file: Path, labeled_input_path: Path, numeric_label_col: str, task_type: str, evaluation_stage: str) -> dict | None:
-    """Compute metrics from inference output. Returns None if labeled_input does not exist."""
-    if not labeled_input_path.exists():
+def compute_metrics(results_file: Path, labels_path: Path, numeric_label_col: str, task_type: str, evaluation_stage: str) -> dict | None:
+    """Compute metrics from inference output. Returns None if labels_path does not exist."""
+    if not labels_path.exists():
         return None
     try:
         return get_metrics(
             results_file=results_file,
-            test_file=labeled_input_path,
+            test_file=labels_path,
             numeric_label_col=numeric_label_col,
             task_type=task_type,
         )
     except Exception as error:
         #TODO needs a prediction-specific exception?
         raise AgentScriptFailed(f"Metrics computation failed for {evaluation_stage}.") from error
-
-def _create_labelless_file(labeled_input_path: Path, target_path: Path, label_col: str):
-    if not labeled_input_path.exists():
-        raise FileNotFoundError(f"Labeled input file not found at {labeled_input_path} during the inference stage")
-    labeled_df = pd.read_csv(labeled_input_path)
-    if label_col not in labeled_df.columns:
-        raise ValueError(f"No {label_col} column found in {labeled_input_path}.")
-    labeled_df.drop(columns=[label_col]).to_csv(target_path, index=False)
diff --git a/src/runtime/read_write_utils.py b/src/runtime/read_write_utils.py
index 9b650129..d0923383 100644
--- a/src/runtime/read_write_utils.py
+++ b/src/runtime/read_write_utils.py
@@ -18,7 +18,6 @@
 def initialize_run_directories(config: Config) -> None:
     config.markdown_reports_dir.mkdir(parents=True, exist_ok=True)
     config.pdf_reports_dir.mkdir(parents=True, exist_ok=True)
-    config.fallbacks_dir.mkdir(parents=True, exist_ok=True)
     config.extras_dir.mkdir(parents=True, exist_ok=True)
     config.run_dir.mkdir(parents=True, exist_ok=True)
     config.shared_dir.mkdir(parents=True, exist_ok=True)
diff --git a/src/runtime/stealth_test_evaluation.py b/src/runtime/stealth_test_evaluation.py
index 4123c9b7..9dcfe038 100644
--- a/src/runtime/stealth_test_evaluation.py
+++ b/src/runtime/stealth_test_evaluation.py
@@ -21,6 +21,7 @@
 from runtime.inference_runner import run_inference_script
 from runtime.read_write_utils import get_archived_iterations, load_config_from_run_dir_and_reroot, load_dataset_metadata
 from utils.config import Config
+from datasets.data_contract import INPUT_DIR_NAME, LABELS_FILE_NAME, TEST_SPLIT
 
 ITERATION_TEST_ENVS_DIRNAME = "_iteration_test_envs"
 STEALTH_TEST_METRICS_FILENAME = "stealth_test_metrics.json"
@@ -30,12 +31,12 @@
 def evaluate_stealth_test_history(agent_dir: Path, prepared_test_sets_dir: Path) -> None:
     config = load_config_from_run_dir_and_reroot(agent_dir / Config.RUN_DIRNAME)
     metadata = load_dataset_metadata(config)
-    prepared_test_set_dir = prepared_test_sets_dir / config.dataset
-    test_input_path = prepared_test_set_dir / "test.no_label.csv"
-    labeled_test_path = prepared_test_set_dir / "test.csv"
-    if not test_input_path.exists() or not labeled_test_path.exists():
+    test_split_path = prepared_test_sets_dir / config.dataset / TEST_SPLIT
+    test_input_path = test_split_path / INPUT_DIR_NAME
+    labels_path = test_split_path / LABELS_FILE_NAME
+    if not test_input_path.is_dir() or not labels_path.is_file():
         raise FileNotFoundError(
-            f"Expected both test.no_label.csv and test.csv in {prepared_test_set_dir} for iteration test evaluation."
+            f"Expected input/ and labels.csv in {test_split_path} for iteration test evaluation."
         )
 
     run = resume_wandb_run(config, dir=config.extras_dir / "iteration_test_logs")
@@ -46,7 +47,7 @@ def evaluate_stealth_test_history(agent_dir: Path, prepared_test_sets_dir: Path)
             config=config,
             metadata=metadata,
             test_input_path=test_input_path,
-            labeled_test_path=labeled_test_path,
+            labels_path=labels_path,
         )
     finally:
         if run is not None:
@@ -63,7 +64,7 @@ def _evaluate_iterations(
     config: Config,
     metadata: dict[str, object],
     test_input_path: Path,
-    labeled_test_path: Path,
+    labels_path: Path,
 ) -> list[dict]:
     iteration_dirs = [
         (iteration, config.iteration_dir(iteration))
@@ -82,7 +83,7 @@ def _evaluate_iterations(
                 config=config,
                 metadata=metadata,
                 test_input_path=test_input_path,
-                labeled_test_path=labeled_test_path,
+                labels_path=labels_path,
                 temp_dir=temp_dir_path,
             )
             results.append(result)
@@ -102,7 +103,7 @@ def _evaluate_single_iteration(
     config: Config,
     metadata: dict[str, object],
     test_input_path: Path,
-    labeled_test_path: Path,
+    labels_path: Path,
     temp_dir: Path,
 ) -> dict:
     #TODO should use the step id by importing it
@@ -124,7 +125,7 @@ def _evaluate_single_iteration(
         )
         metrics = get_metrics(
             results_file=output_path,
-            test_file=labeled_test_path,
+            test_file=labels_path,
             numeric_label_col=metadata["numeric_label_col"],
             task_type=metadata["task_type"],
         )
diff --git a/src/utils/config.py b/src/utils/config.py
index 89d2d5cb..2f25b6db 100644
--- a/src/utils/config.py
+++ b/src/utils/config.py
@@ -96,11 +96,6 @@ class Config:
     def run_dir(self) -> Path:
         return Path(self.workspace_dir) / self.RUN_DIRNAME
 
-
-    @property
-    def fallbacks_dir(self) -> Path:
-        return Path(self.workspace_dir) / "fallbacks"
-
     @property
     def reports_dir(self) -> Path:
         return Path(self.workspace_dir) / "reports"
diff --git a/task.md b/task.md
new file mode 100644
index 00000000..90bcb544
--- /dev/null
+++ b/task.md
@@ -0,0 +1,15 @@
+Task: Add any data type as input
+
+Goal is to make the input super-generic (now just csv)
+It will need to be folder-based  
+- each dataset will need 3 folders - train, test, extras (+ optionally valid folder that always follows the same structure as train)
+- train and test folder need to contain the labels file (id + label). The path to these folders will be a requirement of train/inference scripts. They can have different content - inference script can require different folder stucture than train script
+- extras can contain anything (pdfs, scripts, ...) and can be read-accessed (or copied..etc) by the agent at any time, however the extras folder cannot be an input to any of the scripts. this folder should live outside of the dataset folder, but still be accesible (read-only) by the agent. This folder will be mounted separately to the docker container, and should be remembered in some metadatafile, to allow forked runs to mount it the same way without the user re-specifying the location.
+- allow the agent to enhance the TRAIN folder with new files - probably during the split step - it would re-define the structure of that folder, which should be captured in metadata and the same structure needs to be present on training/re-training the model. These new files can come from anywhere. 
+
+This might mean we lose some validators e.g. we now validate the train.csv doesnt overlap with valid.csv -> probably impossible to check that in a generic way
+We can drop csv-specific code to keep the codebase generic and single-sourced
+
+
+train/valid have subfolders - input (visible to training and inference) and extras (visible to training). This way extras can contain files visible to the trianing script and input can define interface needed for the inference script. Track the structures in a metadata file so we can enforce it or check it on demand. The input structure cannot change through the run (its always the same since the beginning to allow test-inference compatibility. This also means the train/valid and test input structure must be the same at all times and should be check in the validators etc)
+
diff --git a/test/run_all_tests.py b/test/run_all_tests.py
index 8b7a4cac..3ed358a8 100644
--- a/test/run_all_tests.py
+++ b/test/run_all_tests.py
@@ -1,5 +1,10 @@
 import unittest
 import sys
+from pathlib import Path
+
+REPO_ROOT = Path(__file__).resolve().parents[1]
+if str(REPO_ROOT) not in sys.path:
+    sys.path.insert(0, str(REPO_ROOT))
 
 if __name__ == "__main__":
     print("="*20, "Running all tests", "="*20)
@@ -8,4 +13,4 @@
     suite = loader.discover('test', pattern="test_*.py")
     runner = unittest.TextTestRunner(verbosity=2)
     result = runner.run(suite)
-    sys.exit(0 if result.wasSuccessful() else 1)
\ No newline at end of file
+    sys.exit(0 if result.wasSuccessful() else 1)
diff --git a/test/test_agent_permissions.py b/test/test_agent_permissions.py
index 7675c58e..bb58f2e5 100644
--- a/test/test_agent_permissions.py
+++ b/test/test_agent_permissions.py
@@ -42,7 +42,7 @@ def test_protection_test_dataset(self):
         result = self.bash_tool.function(f"touch {self.config.agent_dataset_dir}/test_write.txt 2>&1")
         self.assertTrue("Permission denied" or "Command failed" in result, "Agent should not write to datasets directory")
         
-        result = self.bash_tool.function(f"head -5 {self.config.agent_dataset_dir}/{self.config.dataset}/test.csv 2>&1")
+        result = self.bash_tool.function(f"ls {self.config.agent_dataset_dir}/test/input 2>&1")
         self.assertTrue(
             "Permission denied" in result or "Command failed" in result or "No such file" in result,
             "Agent should not read test data content"
@@ -55,14 +55,17 @@ def test_agent_dataset_access_permissions(self):
         self.assertNotIn("Permission denied", result, "Agent should access its dataset directory")
         self.assertNotIn("Command failed", result, "ls command should succeed")
 
-        result = self.bash_tool.function(f"head -5 {self.config.agent_dataset_dir}/train.csv 2>&1")
+        result = self.bash_tool.function(f"ls {self.config.agent_dataset_dir}/train/input 2>&1")
         self.assertNotIn("Permission denied", result, "Agent should read its dataset content")
-        self.assertNotIn("Command failed", result, "head command should succeed")
+        self.assertNotIn("Command failed", result, "ls command should succeed")
     
     def test_agent_access_to_datasets(self):
         """Test that agent can access all files in prepared_datasets and not the ones in prepared_test_sets."""
         for file in self.config.prepared_dataset_dir.iterdir():
-            result = self.bash_tool.function(f"head -5 {self.config.prepared_dataset_dir}/{file.name} 2>&1")
+            if file.is_dir():
+                result = self.bash_tool.function(f"ls {self.config.prepared_dataset_dir}/{file.name} 2>&1")
+            else:
+                result = self.bash_tool.function(f"head -5 {self.config.prepared_dataset_dir}/{file.name} 2>&1")
             self.assertNotIn("Permission denied", result, f"Agent should access the {file.name} file in prepared_datasets")
 
         test_set_dir = self.prepared_test_sets_dir / self.config.dataset
diff --git a/test/test_best_iteration_snapshot_and_splits.py b/test/test_best_iteration_snapshot_and_splits.py
index 1d60fbea..0bb4e0b2 100644
--- a/test/test_best_iteration_snapshot_and_splits.py
+++ b/test/test_best_iteration_snapshot_and_splits.py
@@ -2,6 +2,7 @@
 import shutil
 import sys
 import tempfile
+import types
 import unittest
 from pathlib import Path
 from unittest.mock import Mock, patch
@@ -25,6 +26,12 @@
 from utils.config import Config
 
 
+def _write_split_folder(split_path: Path, row_id: str = "row-1") -> None:
+    (split_path / "input").mkdir(parents=True, exist_ok=True)
+    (split_path / "input" / "examples.txt").write_text(f"{row_id}\n", encoding="utf-8")
+    (split_path / "labels.csv").write_text(f"id,numeric_label\n{row_id},0\n", encoding="utf-8")
+
+
 def _write_step_output(iteration_dir: Path, step_id: str, output) -> None:
     """Write a step output file to an iteration directory, mimicking the archived layout."""
     payload = output.model_dump() if hasattr(output, "model_dump") else output
@@ -40,6 +47,43 @@ def _write_step_output(iteration_dir: Path, step_id: str, output) -> None:
     )
 
 
+def _report_generation_dependency_stubs() -> dict[str, types.ModuleType]:
+    matplotlib_stub = types.ModuleType("matplotlib")
+    matplotlib_stub.use = lambda *args, **kwargs: None
+    pyplot_stub = types.ModuleType("matplotlib.pyplot")
+
+    reportlab_stub = types.ModuleType("reportlab")
+    reportlab_lib_stub = types.ModuleType("reportlab.lib")
+    colors_stub = types.ModuleType("reportlab.lib.colors")
+    pagesizes_stub = types.ModuleType("reportlab.lib.pagesizes")
+    pagesizes_stub.A4 = (595, 842)
+    units_stub = types.ModuleType("reportlab.lib.units")
+    units_stub.cm = 28.35
+    platypus_stub = types.ModuleType("reportlab.platypus")
+    styles_stub = types.ModuleType("reportlab.lib.styles")
+
+    class DummyReportlabObject:
+        def __init__(self, *args, **kwargs):
+            pass
+
+    for name in ("SimpleDocTemplate", "Paragraph", "Spacer", "Table", "TableStyle", "PageBreak", "Image"):
+        setattr(platypus_stub, name, DummyReportlabObject)
+    styles_stub.getSampleStyleSheet = lambda: {}
+    styles_stub.ParagraphStyle = DummyReportlabObject
+
+    return {
+        "matplotlib": matplotlib_stub,
+        "matplotlib.pyplot": pyplot_stub,
+        "reportlab": reportlab_stub,
+        "reportlab.lib": reportlab_lib_stub,
+        "reportlab.lib.colors": colors_stub,
+        "reportlab.lib.pagesizes": pagesizes_stub,
+        "reportlab.lib.units": units_stub,
+        "reportlab.platypus": platypus_stub,
+        "reportlab.lib.styles": styles_stub,
+    }
+
+
 class TestBestIterationSnapshot(unittest.TestCase):
     """update_best_iteration_snapshot must publish, clear, or skip based on is_new_best and split_changed."""
 
@@ -83,8 +127,8 @@ def _create_iteration_with_validation(
             json.dumps({"iteration": iteration}), encoding="utf-8",
         )
         _write_step_output(iteration_dir, "data_split", DataSplitOutput(
-            train_path=str(self.config.splits_dir / f"split_{split_version}" / "train.csv"),
-            val_path=str(self.config.splits_dir / f"split_{split_version}" / "validation.csv"),
+            train_path=str(self.config.splits_dir / f"split_{split_version}" / "train"),
+            val_path=str(self.config.splits_dir / f"split_{split_version}" / "validation"),
             splitting_strategy="test",
             split_changed=split_changed,
         ))
@@ -173,8 +217,8 @@ def _setup_best_iteration(self, best_metrics: dict, split_version: int = 0) -> N
         best_dir = self.config.iteration_dir(0)
         best_dir.mkdir(parents=True, exist_ok=True)
         _write_step_output(best_dir, "data_split", DataSplitOutput(
-            train_path=str(self.config.splits_dir / f"split_{split_version}" / "train.csv"),
-            val_path=str(self.config.splits_dir / f"split_{split_version}" / "validation.csv"),
+            train_path=str(self.config.splits_dir / f"split_{split_version}" / "train"),
+            val_path=str(self.config.splits_dir / f"split_{split_version}" / "validation"),
             splitting_strategy="test", split_changed=False,
         ))
         _write_step_output(best_dir, "validation_evaluation", ValidationEvaluationOutput(
@@ -189,8 +233,8 @@ def _setup_best_iteration(self, best_metrics: dict, split_version: int = 0) -> N
 
     def _set_current_split_version(self, split_version: int) -> None:
         _write_step_output(self.config.current_iteration_dir, "data_split", DataSplitOutput(
-            train_path=str(self.config.splits_dir / f"split_{split_version}" / "train.csv"),
-            val_path=str(self.config.splits_dir / f"split_{split_version}" / "validation.csv"),
+            train_path=str(self.config.splits_dir / f"split_{split_version}" / "train"),
+            val_path=str(self.config.splits_dir / f"split_{split_version}" / "validation"),
             splitting_strategy="test", split_changed=split_version > 0,
             split_version=split_version,
         ))
@@ -261,16 +305,14 @@ def _make_config(self, agent_id: str) -> Config:
         initialize_run_directories(config)
         save_config(config)
         config.agent_dataset_dir.mkdir(parents=True, exist_ok=True)
-        (config.agent_dataset_dir / "train.csv").write_text(
-            "id,feature,numeric_label\n1,a,0\n2,b,1\n", encoding="utf-8",
-        )
+        _write_split_folder(config.agent_dataset_dir / "train", row_id="train-row")
         return config
 
     def _create_split_dir(self, version: int) -> Path:
         split_dir = self.config.splits_dir / f"split_{version}"
         split_dir.mkdir(parents=True, exist_ok=True)
-        (split_dir / "train.csv").write_text("id,feature,numeric_label\n1,a,0\n", encoding="utf-8")
-        (split_dir / "validation.csv").write_text("id,feature,numeric_label\n2,b,1\n", encoding="utf-8")
+        _write_split_folder(split_dir / "train", row_id="train-row")
+        _write_split_folder(split_dir / "validation", row_id="validation-row")
         return split_dir
 
     def _archive_iteration_with_split(self, iteration: int, split_version: int, status: str = "success") -> None:
@@ -281,10 +323,11 @@ def _archive_iteration_with_split(self, iteration: int, split_version: int, stat
             json.dumps({"status": status, "started_at": 100.0, "ended_at": 110.0}), encoding="utf-8",
         )
         _write_step_output(iteration_dir, "data_split", DataSplitOutput(
-            train_path=str(self.config.splits_dir / f"split_{split_version}" / "train.csv"),
-            val_path=str(self.config.splits_dir / f"split_{split_version}" / "validation.csv"),
+            train_path=str(self.config.splits_dir / f"split_{split_version}" / "train"),
+            val_path=str(self.config.splits_dir / f"split_{split_version}" / "validation"),
             splitting_strategy=f"strategy for split {split_version}",
             split_changed=False,
+            split_version=split_version,
         ))
 
     def _make_step_for_iteration(self, iteration: int) -> DataSplitStep:
@@ -306,10 +349,15 @@ def test_split_blocked_after_budget_exhausted(self):
 
         self.assertTrue(step.should_be_simulated())
 
+    def test_non_versioned_split_dir_does_not_count_as_reusable(self):
+        (self.config.splits_dir / "nested").mkdir(parents=True)
+        step = self._make_step_for_iteration(iteration=1)
+        step.on_iteration_start(iteration=1)
+
+        self.assertFalse(step.should_be_simulated())
+
     def test_split_blocked_when_explicit_validation_exists(self):
-        (self.config.agent_dataset_dir / "validation.csv").write_text(
-            "id,feature,numeric_label\n3,c,0\n", encoding="utf-8",
-        )
+        _write_split_folder(self.config.agent_dataset_dir / "validation", row_id="validation-row")
         step = self._make_step_for_iteration(iteration=0)
         step.on_iteration_start(iteration=0)
 
@@ -329,6 +377,72 @@ def test_simulated_output_copies_latest_split(self):
         self.assertIn("split_0", output.train_path)
         self.assertEqual(output.splitting_strategy, "strategy for split 0")
 
+    def test_split_strategy_is_loaded_for_requested_version(self):
+        self._archive_iteration_with_split(iteration=0, split_version=0)
+        self._archive_iteration_with_split(iteration=1, split_version=1)
+        step = self._make_step_for_iteration(iteration=2)
+
+        self.assertEqual("strategy for split 0", step._get_split_strategy(0))
+
+
+class TestFinalReportSplitLabels(unittest.TestCase):
+    def setUp(self):
+        self.temp_dir = tempfile.TemporaryDirectory()
+        self.root = Path(self.temp_dir.name)
+        self.config = Config(
+            agent_id="report_agent",
+            model_name="test-model",
+            iteration_plan_model_name="test-iteration-plan-model",
+            dataset="toy",
+            tags=[],
+            val_metric="ACC",
+            workspace_dir=str(self.root / "workspace"),
+            prepared_datasets_dir=str(self.root / "prepared_datasets"),
+            user_prompt="test",
+            task_type="classification",
+        )
+        initialize_run_directories(self.config)
+        save_config(self.config)
+
+    def tearDown(self):
+        self.temp_dir.cleanup()
+
+    def test_iteration_inputs_use_iteration_split_version_labels(self):
+        with patch.dict(sys.modules, _report_generation_dependency_stubs()):
+            sys.modules.pop("runtime.generate_final_reports", None)
+            from runtime.generate_final_reports import DatasetMeta, gather_iteration_inputs
+            sys.modules.pop("runtime.generate_final_reports", None)
+
+        _write_split_folder(self.config.splits_dir / "split_0" / "train", row_id="old-train")
+        _write_split_folder(self.config.splits_dir / "split_0" / "validation", row_id="old-validation")
+        _write_split_folder(self.config.splits_dir / "split_1" / "train", row_id="new-train")
+        _write_split_folder(self.config.splits_dir / "split_1" / "validation", row_id="new-validation")
+
+        iteration_dir = self.config.iteration_dir(0)
+        iteration_dir.mkdir(parents=True, exist_ok=True)
+        _write_step_output(iteration_dir, "data_split", DataSplitOutput(
+            train_path=str(self.config.splits_dir / "split_1" / "train"),
+            val_path=str(self.config.splits_dir / "split_1" / "validation"),
+            splitting_strategy="new split",
+            split_changed=True,
+            split_version=1,
+        ))
+
+        inputs = gather_iteration_inputs(
+            config=self.config,
+            prepared_datasets=self.root / "prepared_datasets",
+            prepared_tests=self.root / "prepared_tests",
+            dataset_meta=DatasetMeta(task_type="classification", numeric_label_col="numeric_label"),
+            iteration=0,
+        )
+
+        labels_by_split = {split.split_name: split.labeled_csv for split in inputs.splits}
+        self.assertEqual(self.config.splits_dir / "split_1" / "train" / "labels.csv", labels_by_split["train"])
+        self.assertEqual(
+            self.config.splits_dir / "split_1" / "validation" / "labels.csv",
+            labels_by_split["validation"],
+        )
+
 
 if __name__ == "__main__":
     unittest.main()
diff --git a/test/test_dataset_extras_and_input_structure.py b/test/test_dataset_extras_and_input_structure.py
new file mode 100644
index 00000000..6d9ddca7
--- /dev/null
+++ b/test/test_dataset_extras_and_input_structure.py
@@ -0,0 +1,325 @@
+import json
+import sys
+import types
+import unittest
+from pathlib import Path
+from tempfile import TemporaryDirectory
+
+REPO_ROOT = Path(__file__).resolve().parents[1]
+SRC_PATH = REPO_ROOT / "src"
+if str(SRC_PATH) not in sys.path:
+    sys.path.insert(0, str(SRC_PATH))
+
+metrics_stub = types.ModuleType("utils.metrics")
+metrics_stub.get_classification_metrics_functions = lambda: {}
+metrics_stub.get_higher_is_better_map = lambda: {}
+metrics_stub.get_regression_metrics_functions = lambda: {}
+metrics_stub.get_task_to_metrics_names = lambda: {}
+sys.modules.setdefault("utils.metrics", metrics_stub)
+
+logging_helpers_stub = types.ModuleType("run_logging.logging_helpers")
+logging_helpers_stub.is_wandb_active = lambda: False
+logging_helpers_stub.log_test_inference_duration = lambda *args, **kwargs: None
+sys.modules.setdefault("run_logging.logging_helpers", logging_helpers_stub)
+sys.modules.setdefault("wandb", types.SimpleNamespace(log=lambda *args, **kwargs: None))
+
+from datasets.dataset_utils import prepare_dataset, setup_nonsensitive_dataset_files_for_agent
+from datasets.data_contract import record_input_structure, validate_input_structure
+
+DATASET_NAME = "extras_test_dataset"
+
+
+def write_split(dataset_dir: Path, split_name: str, rows: int = 2, input_files: dict | None = None) -> Path:
+    split_dir = dataset_dir / split_name
+    input_dir = split_dir / "input"
+    input_dir.mkdir(parents=True, exist_ok=True)
+    if input_files is None:
+        (input_dir / "data.csv").write_text(
+            "id,feature\n" + "\n".join(f"{split_name}-{i},val{i}" for i in range(rows)),
+            encoding="utf-8",
+        )
+    else:
+        for filename, content in input_files.items():
+            filepath = input_dir / filename
+            filepath.parent.mkdir(parents=True, exist_ok=True)
+            filepath.write_text(content, encoding="utf-8")
+    labels = ["id,numeric_label"]
+    labels.extend(f"{split_name}-{i},{i % 2}" for i in range(rows))
+    (split_dir / "labels.csv").write_text("\n".join(labels) + "\n", encoding="utf-8")
+    return split_dir
+
+
+class RecordInputStructureTest(unittest.TestCase):
+    def test_captures_files_and_dirs(self):
+        with TemporaryDirectory() as tmp:
+            input_dir = Path(tmp) / "input"
+            input_dir.mkdir()
+            (input_dir / "data.csv").write_text("a,b\n1,2\n")
+            subdir = input_dir / "images"
+            subdir.mkdir()
+            (subdir / "img1.png").write_text("fake")
+
+            structure = record_input_structure(input_dir)
+
+            self.assertIn("data.csv", structure)
+            self.assertIn("images/", structure)
+            self.assertIn("images/img1.png", structure)
+
+    def test_distinguishes_file_from_empty_dir_with_same_name(self):
+        with TemporaryDirectory() as tmp:
+            file_root = Path(tmp) / "file_variant"
+            file_root.mkdir()
+            (file_root / "data").write_text("hello")
+
+            dir_root = Path(tmp) / "dir_variant"
+            dir_root.mkdir()
+            (dir_root / "data").mkdir()
+
+            expected = record_input_structure(file_root)
+            with self.assertRaises(ValueError):
+                validate_input_structure(dir_root, expected, "Validation")
+
+    def test_empty_dir_returns_empty_list(self):
+        with TemporaryDirectory() as tmp:
+            input_dir = Path(tmp) / "input"
+            input_dir.mkdir()
+            self.assertEqual([], record_input_structure(input_dir))
+
+
+class ValidateInputStructureTest(unittest.TestCase):
+    def test_accepts_matching_structure(self):
+        with TemporaryDirectory() as tmp:
+            input_dir = Path(tmp) / "input"
+            input_dir.mkdir()
+            (input_dir / "data.csv").write_text("a,b\n1,2\n")
+
+            expected = record_input_structure(input_dir)
+            validate_input_structure(input_dir, expected, "Test")
+
+    def test_rejects_missing_file(self):
+        with TemporaryDirectory() as tmp:
+            input_dir = Path(tmp) / "input"
+            input_dir.mkdir()
+            (input_dir / "data.csv").write_text("content")
+            (input_dir / "extra.csv").write_text("content")
+            expected = record_input_structure(input_dir)
+
+            (input_dir / "extra.csv").unlink()
+            with self.assertRaises(ValueError) as ctx:
+                validate_input_structure(input_dir, expected, "Validation")
+            self.assertIn("Missing", str(ctx.exception))
+
+    def test_rejects_extra_file(self):
+        with TemporaryDirectory() as tmp:
+            input_dir = Path(tmp) / "input"
+            input_dir.mkdir()
+            (input_dir / "data.csv").write_text("content")
+            expected = record_input_structure(input_dir)
+
+            (input_dir / "unexpected.csv").write_text("content")
+            with self.assertRaises(ValueError) as ctx:
+                validate_input_structure(input_dir, expected, "Train")
+            self.assertIn("Extra", str(ctx.exception))
+
+
+class PrepareDatasetExtrasTest(unittest.TestCase):
+    def test_works_without_extras(self):
+        with TemporaryDirectory() as tmp:
+            root = Path(tmp)
+            raw = root / "raw" / DATASET_NAME
+            raw.mkdir(parents=True)
+            write_split(raw, "train")
+
+            prepare_dataset(
+                dataset_dir=raw,
+                output_dir=root / "prepared",
+                test_sets_output_dir=root / "prepared_tests",
+                task_type="classification",
+            )
+
+            self.assertTrue((root / "prepared" / DATASET_NAME / "train" / "input").is_dir())
+            self.assertFalse((root / "prepared" / DATASET_NAME / "extras").exists())
+
+    def test_copies_split_level_extras(self):
+        with TemporaryDirectory() as tmp:
+            root = Path(tmp)
+            raw = root / "raw" / DATASET_NAME
+            raw.mkdir(parents=True)
+            write_split(raw, "train")
+            split_extras = raw / "train" / "extras"
+            split_extras.mkdir()
+            (split_extras / "augmented.csv").write_text("augmented data")
+
+            prepare_dataset(
+                dataset_dir=raw,
+                output_dir=root / "prepared",
+                test_sets_output_dir=root / "prepared_tests",
+                task_type="classification",
+            )
+
+            prepared_split_extras = root / "prepared" / DATASET_NAME / "train" / "extras"
+            self.assertTrue(prepared_split_extras.is_dir())
+            self.assertTrue((prepared_split_extras / "augmented.csv").is_file())
+
+    def test_records_input_structure_in_metadata(self):
+        with TemporaryDirectory() as tmp:
+            root = Path(tmp)
+            raw = root / "raw" / DATASET_NAME
+            raw.mkdir(parents=True)
+            write_split(raw, "train")
+
+            prepare_dataset(
+                dataset_dir=raw,
+                output_dir=root / "prepared",
+                test_sets_output_dir=root / "prepared_tests",
+                task_type="classification",
+            )
+
+            metadata = json.loads((root / "prepared" / DATASET_NAME / "metadata.json").read_text())
+            self.assertIn("input_structure", metadata)
+            self.assertIn("data.csv", metadata["input_structure"])
+
+    def test_rejects_mismatched_validation_input_structure(self):
+        with TemporaryDirectory() as tmp:
+            root = Path(tmp)
+            raw = root / "raw" / DATASET_NAME
+            raw.mkdir(parents=True)
+            write_split(raw, "train", input_files={"data.csv": "id,a\n1,x\n2,y\n"})
+            write_split(raw, "validation", input_files={"different.csv": "id,b\n1,x\n2,y\n"})
+
+            with self.assertRaises(ValueError) as ctx:
+                prepare_dataset(
+                    dataset_dir=raw,
+                    output_dir=root / "prepared",
+                    test_sets_output_dir=root / "prepared_tests",
+                    task_type="classification",
+                )
+            self.assertIn("Validation", str(ctx.exception))
+
+    def test_rejects_mismatched_test_input_structure(self):
+        with TemporaryDirectory() as tmp:
+            root = Path(tmp)
+            raw = root / "raw" / DATASET_NAME
+            raw.mkdir(parents=True)
+            write_split(raw, "train", input_files={"data.csv": "id,a\n1,x\n2,y\n"})
+            write_split(raw, "test", input_files={"wrong.csv": "id,b\n1,x\n2,y\n"})
+
+            with self.assertRaises(ValueError) as ctx:
+                prepare_dataset(
+                    dataset_dir=raw,
+                    output_dir=root / "prepared",
+                    test_sets_output_dir=root / "prepared_tests",
+                    task_type="classification",
+                )
+            self.assertIn("Test", str(ctx.exception))
+
+    def test_accepts_matching_input_structure_across_splits(self):
+        with TemporaryDirectory() as tmp:
+            root = Path(tmp)
+            raw = root / "raw" / DATASET_NAME
+            raw.mkdir(parents=True)
+            input_files = {"data.csv": "placeholder"}
+            write_split(raw, "train", input_files=input_files)
+            write_split(raw, "validation", input_files=input_files)
+            write_split(raw, "test", input_files=input_files)
+
+            prepare_dataset(
+                dataset_dir=raw,
+                output_dir=root / "prepared",
+                test_sets_output_dir=root / "prepared_tests",
+                task_type="classification",
+            )
+
+            metadata = json.loads((root / "prepared" / DATASET_NAME / "metadata.json").read_text())
+            self.assertIn("input_structure", metadata)
+
+
+class PrepareDatasetSupplementaryTest(unittest.TestCase):
+    def test_copies_supplementary_folder(self):
+        with TemporaryDirectory() as tmp:
+            root = Path(tmp)
+            raw = root / "raw" / DATASET_NAME
+            raw.mkdir(parents=True)
+            write_split(raw, "train")
+            supp = raw / "supplementary"
+            supp.mkdir()
+            (supp / "paper.pdf").write_text("fake pdf")
+
+            prepare_dataset(
+                dataset_dir=raw,
+                output_dir=root / "prepared",
+                test_sets_output_dir=root / "prepared_tests",
+                task_type="classification",
+            )
+
+            prepared_supp = root / "prepared" / DATASET_NAME / "supplementary"
+            self.assertTrue(prepared_supp.is_dir())
+            self.assertTrue((prepared_supp / "paper.pdf").is_file())
+
+    def test_works_without_supplementary(self):
+        with TemporaryDirectory() as tmp:
+            root = Path(tmp)
+            raw = root / "raw" / DATASET_NAME
+            raw.mkdir(parents=True)
+            write_split(raw, "train")
+
+            prepare_dataset(
+                dataset_dir=raw,
+                output_dir=root / "prepared",
+                test_sets_output_dir=root / "prepared_tests",
+                task_type="classification",
+            )
+
+            self.assertFalse((root / "prepared" / DATASET_NAME / "supplementary").exists())
+
+
+class SetupNonsensitiveSupplementaryTest(unittest.TestCase):
+    def test_copies_supplementary_to_agent_workspace(self):
+        with TemporaryDirectory() as tmp:
+            root = Path(tmp)
+            prepared_dir = root / "prepared"
+            dataset_dir = prepared_dir / DATASET_NAME
+            dataset_dir.mkdir(parents=True)
+            write_split(dataset_dir, "train")
+            write_split(dataset_dir, "validation")
+            (dataset_dir / "metadata.json").write_text("{}", encoding="utf-8")
+            (dataset_dir / "dataset_description.md").write_text("desc", encoding="utf-8")
+            supp = dataset_dir / "supplementary"
+            supp.mkdir()
+            (supp / "notes.txt").write_text("some notes")
+
+            agent_dir = root / "agent_datasets"
+            setup_nonsensitive_dataset_files_for_agent(
+                prepared_datasets_dir=prepared_dir,
+                agent_datasets_dir=agent_dir,
+                dataset_name=DATASET_NAME,
+            )
+
+            agent_supp = agent_dir / DATASET_NAME / "supplementary"
+            self.assertTrue(agent_supp.is_dir())
+            self.assertTrue((agent_supp / "notes.txt").is_file())
+
+    def test_does_not_copy_dataset_level_extras(self):
+        with TemporaryDirectory() as tmp:
+            root = Path(tmp)
+            prepared_dir = root / "prepared"
+            dataset_dir = prepared_dir / DATASET_NAME
+            dataset_dir.mkdir(parents=True)
+            write_split(dataset_dir, "train")
+            write_split(dataset_dir, "validation")
+            (dataset_dir / "metadata.json").write_text("{}", encoding="utf-8")
+            (dataset_dir / "dataset_description.md").write_text("desc", encoding="utf-8")
+
+            agent_dir = root / "agent_datasets"
+            setup_nonsensitive_dataset_files_for_agent(
+                prepared_datasets_dir=prepared_dir,
+                agent_datasets_dir=agent_dir,
+                dataset_name=DATASET_NAME,
+            )
+
+            self.assertFalse((agent_dir / DATASET_NAME / "extras").exists())
+            self.assertTrue((agent_dir / DATASET_NAME / "train" / "input").is_dir())
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/test/test_dataset_folder_split_contracts.py b/test/test_dataset_folder_split_contracts.py
new file mode 100644
index 00000000..d5af8e8a
--- /dev/null
+++ b/test/test_dataset_folder_split_contracts.py
@@ -0,0 +1,253 @@
+import json
+import shutil
+import sys
+import types
+import unittest
+from pathlib import Path
+from tempfile import TemporaryDirectory
+from types import SimpleNamespace
+
+
+REPO_ROOT = Path(__file__).resolve().parents[1]
+SRC_PATH = REPO_ROOT / "src"
+if str(SRC_PATH) not in sys.path:
+    sys.path.insert(0, str(SRC_PATH))
+
+metrics_stub = types.ModuleType("utils.metrics")
+metrics_stub.get_classification_metrics_functions = lambda: {}
+metrics_stub.get_higher_is_better_map = lambda: {}
+metrics_stub.get_regression_metrics_functions = lambda: {}
+metrics_stub.get_task_to_metrics_names = lambda: {}
+sys.modules.setdefault("utils.metrics", metrics_stub)
+
+logging_helpers_stub = types.ModuleType("run_logging.logging_helpers")
+logging_helpers_stub.is_wandb_active = lambda: False
+logging_helpers_stub.log_test_inference_duration = lambda *args, **kwargs: None
+sys.modules.setdefault("run_logging.logging_helpers", logging_helpers_stub)
+sys.modules.setdefault("wandb", types.SimpleNamespace(log=lambda *args, **kwargs: None))
+
+from datasets.dataset_utils import (
+    check_dataset_prepared,
+    get_single_dataset_info,
+    get_single_prepared_dataset_info,
+    prepare_dataset,
+    setup_nonsensitive_dataset_files_for_agent,
+)
+from datasets.data_contract import validate_labels_csv
+DATASET_NAME = "folder_contract_dataset"
+
+
+def write_split(dataset_dir: Path, split_name: str, rows: int = 2) -> Path:
+    split_dir = dataset_dir / split_name
+    input_dir = split_dir / "input"
+    input_dir.mkdir(parents=True)
+    (input_dir / "examples.txt").write_text(
+        "\n".join(f"{split_name}-example-{index}" for index in range(rows)),
+        encoding="utf-8",
+    )
+    labels = ["id,numeric_label"]
+    labels.extend(f"{split_name}-{index},{index % 2}" for index in range(rows))
+    (split_dir / "labels.csv").write_text("\n".join(labels) + "\n", encoding="utf-8")
+    return split_dir
+
+
+def write_dataset_metadata(dataset_dir: Path) -> None:
+    metadata = {
+        "task_type": "classification",
+        "numeric_label_col": "numeric_label",
+        "splits": {
+            "train_rows": 2,
+            "validation_rows": 2,
+            "test_rows": 2,
+        },
+    }
+    (dataset_dir / "metadata.json").write_text(json.dumps(metadata), encoding="utf-8")
+    (dataset_dir / "dataset_description.md").write_text("temporary dataset", encoding="utf-8")
+
+
+def create_prepared_folder_dataset(prepared_datasets_dir: Path) -> Path:
+    dataset_dir = prepared_datasets_dir / DATASET_NAME
+    dataset_dir.mkdir(parents=True)
+    write_dataset_metadata(dataset_dir)
+    for split_name in ["train", "validation", "test"]:
+        write_split(dataset_dir, split_name)
+    return dataset_dir
+
+
+def create_config(tmpdir: Path) -> SimpleNamespace:
+    agent_id = "agent-under-test"
+    return SimpleNamespace(
+        agent_id=agent_id,
+        runs_dir=tmpdir / "runs",
+        split_allowed_iterations=1,
+        split_time_deadline=None,
+        can_iteration_split_now_cached=lambda iteration: True,
+    )
+
+
+class DatasetFolderSplitContractTest(unittest.TestCase):
+    def test_get_single_dataset_info_accepts_raw_folder_split_contract(self):
+        with TemporaryDirectory() as tmp:
+            root = Path(tmp)
+            raw_dataset_dir = root / "raw" / DATASET_NAME
+            raw_dataset_dir.mkdir(parents=True)
+            write_split(raw_dataset_dir, "train")
+
+            info = get_single_dataset_info(raw_dataset_dir, root / "prepared")
+
+            self.assertTrue(info["can_prepare"])
+            self.assertTrue(info["should_prepare"])
+            self.assertEqual(2, info["train_rows"])
+
+    def test_check_dataset_prepared_accepts_folder_only_split_contract(self):
+        with TemporaryDirectory() as tmp:
+            root = Path(tmp)
+            prepared_datasets_dir = root / "prepared"
+            dataset_dir = create_prepared_folder_dataset(prepared_datasets_dir)
+
+            self.assertTrue(check_dataset_prepared(dataset_dir, prepared_datasets_dir))
+
+    def test_check_dataset_prepared_rejects_legacy_csv_only_contract(self):
+        with TemporaryDirectory() as tmp:
+            root = Path(tmp)
+            prepared_datasets_dir = root / "prepared"
+            dataset_dir = prepared_datasets_dir / DATASET_NAME
+            dataset_dir.mkdir(parents=True)
+            write_dataset_metadata(dataset_dir)
+            (dataset_dir / "train.csv").write_text(
+                "id,feature,numeric_label\nrow-1,A,0\n",
+                encoding="utf-8",
+            )
+
+            self.assertFalse(check_dataset_prepared(dataset_dir, prepared_datasets_dir))
+
+    def test_get_single_prepared_dataset_info_counts_folder_labels(self):
+        with TemporaryDirectory() as tmp:
+            dataset_dir = create_prepared_folder_dataset(Path(tmp) / "prepared")
+
+            info = get_single_prepared_dataset_info(dataset_dir)
+
+            self.assertEqual("Prepared", info["status"])
+            self.assertEqual(2, info["train_rows"])
+            self.assertEqual(2, info["validation_rows"])
+            self.assertEqual(2, info["test_rows"])
+
+    def test_prepare_dataset_rejects_invalid_labels_csv(self):
+        with TemporaryDirectory() as tmp:
+            root = Path(tmp)
+            raw_dataset_dir = root / "raw" / DATASET_NAME
+            raw_dataset_dir.mkdir(parents=True)
+            write_split(raw_dataset_dir, "train")
+            (raw_dataset_dir / "train" / "labels.csv").write_text(
+                "id,numeric_label\ntrain-0,not-a-number\n",
+                encoding="utf-8",
+            )
+
+            with self.assertRaises(ValueError):
+                prepare_dataset(
+                    dataset_dir=raw_dataset_dir,
+                    output_dir=root / "prepared",
+                    test_sets_output_dir=root / "prepared_tests",
+                    task_type="classification",
+                )
+
+    def test_labels_csv_rejects_extra_columns(self):
+        with TemporaryDirectory() as tmp:
+            labels_path = Path(tmp) / "labels.csv"
+            labels_path.write_text(
+                "id,numeric_label,extra\nsample-1,0,unused\n",
+                encoding="utf-8",
+            )
+
+            with self.assertRaises(ValueError):
+                validate_labels_csv(labels_path)
+
+    def test_prepare_dataset_derives_class_labels_from_folder_labels(self):
+        with TemporaryDirectory() as tmp:
+            root = Path(tmp)
+            raw_dataset_dir = root / "raw" / DATASET_NAME
+            raw_dataset_dir.mkdir(parents=True)
+            write_split(raw_dataset_dir, "train")
+
+            prepare_dataset(
+                dataset_dir=raw_dataset_dir,
+                output_dir=root / "prepared",
+                test_sets_output_dir=root / "prepared_tests",
+                task_type="classification",
+            )
+
+            metadata = json.loads((root / "prepared" / DATASET_NAME / "metadata.json").read_text())
+            self.assertEqual({"0": 0, "1": 1}, metadata["label_to_scalar"])
+
+    def test_prepare_dataset_derives_class_labels_without_hidden_test_labels(self):
+        with TemporaryDirectory() as tmp:
+            root = Path(tmp)
+            raw_dataset_dir = root / "raw" / DATASET_NAME
+            raw_dataset_dir.mkdir(parents=True)
+            write_split(raw_dataset_dir, "train")
+            write_split(raw_dataset_dir, "test")
+            (raw_dataset_dir / "test" / "labels.csv").write_text(
+                "id,numeric_label\ntest-0,7\n",
+                encoding="utf-8",
+            )
+
+            prepare_dataset(
+                dataset_dir=raw_dataset_dir,
+                output_dir=root / "prepared",
+                test_sets_output_dir=root / "prepared_tests",
+                task_type="classification",
+            )
+
+            metadata = json.loads((root / "prepared" / DATASET_NAME / "metadata.json").read_text())
+            self.assertEqual({"0": 0, "1": 1}, metadata["label_to_scalar"])
+
+    def test_prepare_dataset_removes_stale_prepared_splits(self):
+        with TemporaryDirectory() as tmp:
+            root = Path(tmp)
+            raw_dataset_dir = root / "raw" / DATASET_NAME
+            raw_dataset_dir.mkdir(parents=True)
+            write_split(raw_dataset_dir, "train")
+
+            stale_prepared_dir = root / "prepared" / DATASET_NAME
+            stale_prepared_dir.mkdir(parents=True)
+            write_split(stale_prepared_dir, "validation")
+
+            stale_test_dir = root / "prepared_tests" / DATASET_NAME
+            stale_test_dir.mkdir(parents=True)
+            write_split(stale_test_dir, "test")
+
+            prepare_dataset(
+                dataset_dir=raw_dataset_dir,
+                output_dir=root / "prepared",
+                test_sets_output_dir=root / "prepared_tests",
+                task_type="classification",
+            )
+
+            self.assertTrue((root / "prepared" / DATASET_NAME / "train").is_dir())
+            self.assertFalse((root / "prepared" / DATASET_NAME / "validation").exists())
+            self.assertFalse((root / "prepared_tests" / DATASET_NAME).exists())
+
+    def test_setup_nonsensitive_dataset_files_copies_public_split_folders_only(self):
+        with TemporaryDirectory() as tmp:
+            root = Path(tmp)
+            prepared_datasets_dir = root / "prepared"
+            agent_datasets_dir = root / "agent_datasets"
+            create_prepared_folder_dataset(prepared_datasets_dir)
+
+            setup_nonsensitive_dataset_files_for_agent(
+                prepared_datasets_dir=prepared_datasets_dir,
+                agent_datasets_dir=agent_datasets_dir,
+                dataset_name=DATASET_NAME,
+            )
+
+            agent_dataset_dir = agent_datasets_dir / DATASET_NAME
+            self.assertTrue((agent_dataset_dir / "train" / "input").is_dir())
+            self.assertTrue((agent_dataset_dir / "train" / "labels.csv").is_file())
+            self.assertTrue((agent_dataset_dir / "validation" / "input").is_dir())
+            self.assertTrue((agent_dataset_dir / "validation" / "labels.csv").is_file())
+            self.assertFalse((agent_dataset_dir / "test").exists())
+            self.assertFalse((agent_dataset_dir / "train.csv").exists())
+            self.assertFalse((agent_dataset_dir / "validation.csv").exists())
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/test/test_evaluate_log_run_folder_contract.py b/test/test_evaluate_log_run_folder_contract.py
new file mode 100644
index 00000000..97cd1e42
--- /dev/null
+++ b/test/test_evaluate_log_run_folder_contract.py
@@ -0,0 +1,144 @@
+import sys
+import types
+import unittest
+from pathlib import Path
+from tempfile import TemporaryDirectory
+from types import SimpleNamespace
+from unittest.mock import patch
+
+REPO_ROOT = Path(__file__).resolve().parents[1]
+SRC_PATH = REPO_ROOT / "src"
+if str(SRC_PATH) not in sys.path:
+    sys.path.insert(0, str(SRC_PATH))
+
+
+def _stub(name, **attrs):
+    mod = types.ModuleType(name)
+    for k, v in attrs.items():
+        setattr(mod, k, v)
+    sys.modules.setdefault(name, mod)
+    return mod
+
+
+wandb_errors = _stub("wandb.errors", AuthenticationError=Exception, UsageError=Exception)
+wandb_mod = _stub("wandb", log=lambda *a, **k: None, run=None, login=lambda *a, **k: True)
+wandb_mod.errors = wandb_errors
+_stub("utils.metrics", get_task_to_metrics_names=lambda: {})
+_stub("run_logging.logging_helpers", is_wandb_active=lambda: False, log_test_inference_duration=lambda *a, **k: None)
+_stub("run_logging.wandb_setup", resume_wandb_run=lambda *a, **k: None)
+_stub("utils.printing_utils", print_best_iteration_metrics=lambda *a, **k: None)
+_stub("runtime.conda_utils", get_best_iteration_snapshot_environment_path=lambda *a: None)
+_stub("runtime.evaluate_result", get_metrics=lambda **k: {})
+_stub("runtime.inference_runner", compute_metrics=lambda **k: {}, run_inference_script=lambda **k: None)
+_stub("runtime.filesystem", remove_path=lambda *a: None)
+_stub("runtime.read_write_utils", load_config_from_run_dir=lambda *a, **k: None, load_dataset_metadata=lambda *a: {})
+_stub("utils.exceptions", AgentScriptFailed=Exception)
+
+from run_logging.test_evaluation import run_test_evaluation  # noqa: E402
+
+
+METADATA = {"numeric_label_col": "numeric_label", "task_type": "classification"}
+
+
+def _make_config(tmp: Path, dataset_name: str = "test_dataset") -> SimpleNamespace:
+    snapshot_dir = tmp / "snapshot"
+    snapshot_dir.mkdir(parents=True)
+    return SimpleNamespace(
+        dataset=dataset_name,
+        best_iteration_snapshot_dir=snapshot_dir,
+        agent_id="agent-under-test",
+        task_type="classification",
+        val_metric="ACC",
+    )
+
+
+def write_split(parent: Path, split_name: str, rows: int = 2) -> None:
+    split_dir = parent / split_name
+    (split_dir / "input").mkdir(parents=True)
+    (split_dir / "input" / "examples.txt").write_text("example\n", encoding="utf-8")
+    labels = ["id,numeric_label"] + [f"{split_name}-{i},{i % 2}" for i in range(rows)]
+    (split_dir / "labels.csv").write_text("\n".join(labels) + "\n", encoding="utf-8")
+
+
+class TestRunTestEvaluationFolderContract(unittest.TestCase):
+    def _run_with_mocks(self, workspace_dir, prepared_test_sets_dir, config, inference_result=None):
+        if inference_result is None:
+            inference_result = SimpleNamespace(returncode=0, stderr=b"", stdout=b"")
+        with (
+            patch("run_logging.test_evaluation.load_config_from_run_dir", return_value=config),
+            patch("run_logging.test_evaluation.resume_wandb_run", return_value=None),
+            patch("run_logging.test_evaluation.load_dataset_metadata", return_value=METADATA),
+            patch("run_logging.test_evaluation.is_wandb_active", return_value=False),
+            patch("run_logging.test_evaluation.run_inference_script", return_value=inference_result) as mock_infer,
+            patch("run_logging.test_evaluation.compute_metrics", return_value={"ACC": 1.0}),
+            patch("run_logging.test_evaluation.log_test_inference_duration", return_value=None),
+            patch("run_logging.test_evaluation.print_best_iteration_metrics", return_value=None),
+        ):
+            run_test_evaluation(workspace_dir, prepared_test_sets_dir)
+            return mock_infer
+
+    def test_skips_when_test_split_absent(self):
+        with TemporaryDirectory() as tmp:
+            root = Path(tmp)
+            mock_infer = self._run_with_mocks(root, root / "prepared_test_sets", _make_config(root))
+            mock_infer.assert_not_called()
+
+    def test_skips_inference_when_test_split_exists_but_has_no_input_folder(self):
+        with TemporaryDirectory() as tmp:
+            root = Path(tmp)
+            config = _make_config(root)
+            test_split_dir = root / "prepared_test_sets" / config.dataset / "test"
+            test_split_dir.mkdir(parents=True)
+            # labels.csv present but no input/ folder — malformed split
+            (test_split_dir / "labels.csv").write_text("id,numeric_label\nr-0,0\n", encoding="utf-8")
+
+            # run_test_evaluation catches the FileNotFoundError internally (to not crash the run),
+            # so we verify inference is not called rather than expecting an exception.
+            mock_infer = self._run_with_mocks(root, root / "prepared_test_sets", config)
+            mock_infer.assert_not_called()
+
+    def test_calls_inference_with_input_folder_not_csv(self):
+        with TemporaryDirectory() as tmp:
+            root = Path(tmp)
+            config = _make_config(root)
+            prepared_test_sets_dir = root / "prepared_test_sets"
+            write_split(prepared_test_sets_dir / config.dataset, "test")
+
+            mock_infer = self._run_with_mocks(root, prepared_test_sets_dir, config)
+
+            mock_infer.assert_called_once()
+            input_path = mock_infer.call_args.kwargs["input_path"]
+            self.assertTrue(str(input_path).endswith("/input"), f"Expected folder input path, got: {input_path}")
+            self.assertNotIn(".csv", str(input_path))
+            self.assertNotIn(".no_label", str(input_path))
+
+    def test_compute_metrics_receives_labels_csv_path(self):
+        with TemporaryDirectory() as tmp:
+            root = Path(tmp)
+            config = _make_config(root)
+            prepared_test_sets_dir = root / "prepared_test_sets"
+            write_split(prepared_test_sets_dir / config.dataset, "test")
+
+            metric_calls = []
+            with (
+                patch("run_logging.test_evaluation.load_config_from_run_dir", return_value=config),
+                patch("run_logging.test_evaluation.resume_wandb_run", return_value=None),
+                patch("run_logging.test_evaluation.load_dataset_metadata", return_value=METADATA),
+                patch("run_logging.test_evaluation.is_wandb_active", return_value=False),
+                patch("run_logging.test_evaluation.run_inference_script",
+                      return_value=SimpleNamespace(returncode=0, stderr=b"", stdout=b"")),
+                patch("run_logging.test_evaluation.compute_metrics",
+                      side_effect=lambda **kw: metric_calls.append(kw) or {"ACC": 1.0}),
+                patch("run_logging.test_evaluation.log_test_inference_duration", return_value=None),
+                patch("run_logging.test_evaluation.print_best_iteration_metrics", return_value=None),
+            ):
+                run_test_evaluation(root, prepared_test_sets_dir)
+
+            self.assertEqual(1, len(metric_calls))
+            labels_path = metric_calls[0]["labels_path"]
+            self.assertTrue(str(labels_path).endswith("labels.csv"), f"Expected labels.csv, got: {labels_path}")
+            self.assertNotIn("input", str(labels_path))
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/test/test_runtime_lifecycle.py b/test/test_runtime_lifecycle.py
index c82d0995..81717a64 100644
--- a/test/test_runtime_lifecycle.py
+++ b/test/test_runtime_lifecycle.py
@@ -73,8 +73,8 @@ def _save_and_archive_step(self, step_id: str, output) -> None:
 
     def test_pydantic_output_is_deserialized_to_original_type(self):
         split_output = DataSplitOutput(
-            train_path="/tmp/train.csv",
-            val_path="/tmp/val.csv",
+            train_path="/tmp/train",
+            val_path="/tmp/validation",
             splitting_strategy="stratified 80/20",
             split_changed=True,
         )
@@ -96,8 +96,8 @@ def test_plain_dict_output_survives_roundtrip(self):
 
     def test_save_creates_current_step_dir_for_skipped_step_flow(self):
         split_output = DataSplitOutput(
-            train_path="/tmp/train.csv",
-            val_path="/tmp/val.csv",
+            train_path="/tmp/train",
+            val_path="/tmp/validation",
             splitting_strategy="reused previous split",
             split_changed=False,
         )