Copyright (C) 2025 ETH Zurich, Switzerland. SPDX-License-Identifier: Apache-2.0. See LICENSE file at the root of the repository for details.
This directory contains scripts for processing raw datasets into formats suitable for efficient training and evaluation.
We provide an example script on how to format TUEG into bipolar. This is then used by our pre-training scripts. Make sure to change all of the occurences of #CHANGEME. The main processing script follow a very similar pattern to the one explained for process_raw_eeg.py here below.
Prerequisites:
- You must have the raw TUEG dataset downloaded from their official sources.
Command-Line Usage: The script takes the dataset name, the path to the raw data, and an output directory as arguments.
python make_datasets/make_tueg_bipolar.py --in_path /path/to/raw_data --hdf5_file_path /path/to/processed_data--in_path: The absolute path to the root directory of the downloaded raw EDF files for the TUEG dataset.--hdf5_file_path: The directory where the script will save the procsessed.h5files.
Before the datasets can be converted to HDF5, the raw EDF files must first be preprocessed, windowed, and saved as intermediate
pickle files. The process_raw_eeg.py script handles this entire pipeline.
Key Processing Steps
- Loads EDF files: Reads the raw EEG data.
- Selects and orders channels: Standardizes all recordings to a 21‑channel layout.
- Applies filters: A band‑pass filter (0.1–75.0 Hz) and a 60 Hz notch filter are applied.
- Resamples data: All data is resampled to 256 Hz.
- Creates bipolar montage: Re‑references the signals to the standard TCP bipolar montage.
- Windows data: Segments the continuous signal into 5‑second windows.
- Generates labels: Applies labels based on the dataset and specified mode:
- tuab: Assigns a single file‑level label (normal/abnormal) to all windows from that file.
- tusl: Assigns a segment‑level label (
bckg,seiz,slow) based on the majority label found in the 5‑second window. - tuar: Assigns segment‑level artifact labels based on the chosen
--mode(e.g., a single 0/1 for Binary or a per‑channel array for MultiLabel).
- Saves segments: Each 5‑second window (data + label) is saved as a separate
.pklfile.
How to Use the Script The script is designed to be run from the command line and requires you to specify which dataset you are processing.
Prerequisites
- You must have the raw TUH datasets (TUAB, TUSL, or TUAR) downloaded from their official sources.
- The required Python packages, including
mne,numpy,pandas, andtqdm, must be installed.
Command‑Line Usage
python make_datasets/process_raw_eeg.py <dataset_name> --root_dir /path/to/raw_data --output_dir /processed_eeg [options]Arguments
<dataset_name>: The dataset to process. Must be one oftuab,tusl, ortuar.--root_dir: The absolute path to the root directory of the downloaded raw EDF files for the specified dataset.--output_dir: The directory where the script will save the processed.pklfiles. The script will automatically createtrain,val, andtestsubdirectories inside aprocessedfolder here.--processes: (Optional) The number of CPU cores to use for parallel processing (default:24).--mode: (Optional) Specifies the labeling mode for the TUAR dataset only. Must be one ofBinary,MultiBinary, orMultiLabel(default:Binary).
Examples
-
To process the TUAB dataset:
python make_datasets/process_raw_eeg.py tuab --root_dir /eeg_data/TUAB/edf --output_dir /processed_eeg
-
To process the TUAR dataset using the MultiLabel mode:
python make_datasets/process_raw_eeg.py tuar --root_dir /eeg_data/TUAR/edf --output_dir /processed_eeg --mode MultiLabel
After you have generated the processed .pkl files using the script above, the make_hdf5.py script is used to bundle these
thousands of small pickle files into large, efficient HDF5 (.h5) files. This is the final step before training.
How to Use the Script The script is run from the command line and takes the path to your processed data as an argument.
Command‑Line Usage
python make_datasets/make_hdf5.py --prepath /processed_eeg [options]Arguments
--prepath(Required): The absolute path to the directory containing your dataset folders (e.g.,/processed_eeg/). This directory should contain theTUAB_data,TUSL_data, etc., folders created byprocess_raw_eeg.py, so this should mirror the--output_dirused previously.--dataset(Optional): Which dataset to process. Choices:TUAR,TUSL,TUAB,All(default:All).--remove_pkl(Optional): If included, this flag will delete theprocesseddirectory (containing all the intermediate.pklfiles) for a dataset after its.h5files are successfully created.
Examples
-
To convert all datasets found in
/processed_eegand remove the.pklfiles after conversion:python make_datasets/make_hdf5.py --prepath /processed_eeg --dataset All --remove_pkl
-
To convert only the TUAB dataset and keep the original
.pklfiles:python make_datasets/make_hdf5.py --prepath /processed_eeg --dataset TUAB