Overview of DP-2Stage. In stage 1, the pre-trained LLM is fine-tuned on the respective pseudo data. Subsequently, in stage 2, the model from stage 1 undergoes further fine-tuning using the real private data
This repository contains the implementation for DP-2Stage, a two-stage fine-tuning framework for differentially private tabular data generation (Published at TMLR 2025).
Authors: Tejumade Afonja, Hui-Po Wang, Raouf Kerkouche, and Mario Fritz
Contact: Tejumade Afonja ([email protected])
This implementation is based on PyTorch (tested for version 2.5.1). Please refer to requirements.txt for the other required packages and version.
-
Set the
PYTHONPATH:export PYTHONPATH=$PWD
-
Create and activate the project environment:
conda create -n dp2stage python=3.9 conda activate dp2stage
-
Install dependencies:
pip install -r requirements.txt
To download the Adult dataset:
python download_dataset.py -name adult --train_subset 30932 --valid_subset 1000 --split_by_seed --seed 1000 --use_full_datasetFor the Airline dataset:
- Create a Kaggle account.
- Generate an API key and save it to
~/.kaggle/kaggle.json.
python download_dataset.py -name airline --train_subset 103904 --valid_subset 1000 --split_by_seed --seed 1000 --use_full_datasetTo generate independent uniform pseudo data for the Adult dataset:
python utils/create_independent_uniform_pseudo_data.py \
--dataset_path ./data/adult/k1000/train.csv \
--seed 1000 \
--output_dir ./data/adult-uniform/k1000 \
--output_name train \
--n_synth_samples 30932Edit the necessary configuration in the scripts, then execute:
For example, to run the out-distribution pseudo data experiment for adult dataset, modify the configuration scripts linked in scripts/run/adult/ood.sh, then execute:
bash scripts/run/adult/ood.shTo evaluate the synthetic data, modify the path to the generated data in scripts/tabular_metrics.py, and then run:
bash scripts/tabular_metrics.shTo run the SmartNoise baseline models:
-
It is recommended to create a separate environment due to dependency conflicts:
conda create -n smartnoise python=3.9 conda activate smartnoise
-
Install the baseline dependencies:
pip install -r requirements.txt pip install smartnoise-synth
Alternatively, you can revert the opacus version to match this project's requirements by reinstalling dependencies from requirements.txt.
To run the baselines, modify the scripts/baseline.sh as needed, and the run:
bash scripts/baselines.sh@article{
afonja2025dpstage,
title={{DP}-2Stage: Adapting Language Models as Differentially Private Tabular Data Generators},
author={Tejumade Afonja and Hui-Po Wang and Raouf Kerkouche and Mario Fritz},
journal={Transactions on Machine Learning Research},
year={2025},
url={https://openreview.net/forum?id=6nBIweDYzZ},
}