DP-2Stage: Adapting Language Models as Differentially Private Tabular Data Generators

Overview of DP-2Stage. In stage 1, the pre-trained LLM is fine-tuned on the respective pseudo data. Subsequently, in stage 2, the model from stage 1 undergoes further fine-tuning using the real private data

This repository contains the implementation for DP-2Stage, a two-stage fine-tuning framework for differentially private tabular data generation (Published at TMLR 2025).

Authors: Tejumade Afonja, Hui-Po Wang, Raouf Kerkouche, and Mario Fritz

Contact: Tejumade Afonja ([email protected])

Requirements

This implementation is based on PyTorch (tested for version 2.5.1). Please refer to requirements.txt for the other required packages and version.

Environment Setup

Set the PYTHONPATH:
```
export PYTHONPATH=$PWD
```

Create and activate the project environment:

conda create -n dp2stage python=3.9
conda activate dp2stage

Install dependencies:
```
pip install -r requirements.txt
```

Data Preparation

Download the Dataset

To download the Adult dataset:

python download_dataset.py -name adult --train_subset 30932 --valid_subset 1000 --split_by_seed --seed 1000 --use_full_dataset

For the Airline dataset:

Create a Kaggle account.
Generate an API key and save it to ~/.kaggle/kaggle.json.

python download_dataset.py -name airline --train_subset 103904 --valid_subset 1000 --split_by_seed --seed 1000 --use_full_dataset

Create Pseudo Data (Stage 1 Training)

To generate independent uniform pseudo data for the Adult dataset:

python utils/create_independent_uniform_pseudo_data.py \
    --dataset_path ./data/adult/k1000/train.csv \
    --seed 1000 \
    --output_dir ./data/adult-uniform/k1000 \
    --output_name train \
    --n_synth_samples 30932

Training and Sampling

Run Training and Sampling

Edit the necessary configuration in the scripts, then execute:

For example, to run the out-distribution pseudo data experiment for adult dataset, modify the configuration scripts linked in scripts/run/adult/ood.sh, then execute:

bash scripts/run/adult/ood.sh

Evaluate Synthetic Data

To evaluate the synthetic data, modify the path to the generated data in scripts/tabular_metrics.py, and then run:

bash scripts/tabular_metrics.sh

Baselines

To run the SmartNoise baseline models:

It is recommended to create a separate environment due to dependency conflicts:
```
conda create -n smartnoise python=3.9
conda activate smartnoise
```

Install the baseline dependencies:

pip install -r requirements.txt
pip install smartnoise-synth

Alternatively, you can revert the opacus version to match this project's requirements by reinstalling dependencies from requirements.txt.

To run the baselines, modify the scripts/baseline.sh as needed, and the run:

bash scripts/baselines.sh

Citation

@article{
   afonja2025dpstage,
   title={{DP}-2Stage: Adapting Language Models as Differentially Private Tabular Data Generators},
   author={Tejumade Afonja and Hui-Po Wang and Raouf Kerkouche and Mario Fritz},
   journal={Transactions on Machine Learning Research},
   year={2025},
   url={https://openreview.net/forum?id=6nBIweDYzZ},
}

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
baselines		baselines
data		data
docs		docs
metrics		metrics
notebooks		notebooks
scripts		scripts
utils		utils
.gitignore		.gitignore
README.md		README.md
download_dataset.py		download_dataset.py
ft_opacus.py		ft_opacus.py
generation.py		generation.py
overview.png		overview.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DP-2Stage: Adapting Language Models as Differentially Private Tabular Data Generators

Requirements

Environment Setup

Data Preparation

Download the Dataset

Create Pseudo Data (Stage 1 Training)

Training and Sampling

Run Training and Sampling

Evaluate Synthetic Data

Baselines

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

tejuafonja/DP-2Stage

Folders and files

Latest commit

History

Repository files navigation

DP-2Stage: Adapting Language Models as Differentially Private Tabular Data Generators

Requirements

Environment Setup

Data Preparation

Download the Dataset

Create Pseudo Data (Stage 1 Training)

Training and Sampling

Run Training and Sampling

Evaluate Synthetic Data

Baselines

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages