ADAL: AI-Generated Text Detection using Adversarial Learning

AI-generated text detection algorithm based on Adaptive Learning and Adversarial Training

Shushanta Pudasaini · TU Dublin PhD Research · September 2023 – September 2027

Overview

ADAL is an adversarially trained AI-generated text detector based on the RADAR framework (Hu et al., NeurIPS 2023), extended to the RAID benchmark with multi-generator training and a multi-evasion attack pool. The system trains a detector (RoBERTa-large) and a paraphraser (T5-base) in an adversarial game: the paraphraser learns to rewrite AI-generated text so it evades detection, while the detector learns to remain robust against those rewrites. The result is a detector that generalises across 11 AI generators and maintains high AUROC under five distinct evasion attacks.

Best result: macro AUROC 0.9940 across all 11 RAID generators, robust to all attack types.

Background

The proliferation of large language models has made AI-generated text increasingly difficult to distinguish from human writing. Existing detectors tend to be brittle — they perform well on clean text but degrade significantly when the text is lightly paraphrased or subjected to simple character-level perturbations. ADAL addresses this by training the detector adversarially: rather than optimising purely for clean-text detection, the detector is hardened against a pool of evasion strategies applied during training.

The approach draws on two complementary resources:

RADAR (Robust AI-Text Detection via Adversarial Learning) — the adversarial training framework using Clipped PPO with Entropy Penalty (cppo-ep) to train the paraphraser against the detector.
RAID (Robust AI-text Detection benchmark) — a large-scale dataset covering 11 generators, 4 decoding strategies, and 12 attack types across 11 domains.

Architecture

RAID train split (attack='none')
        │
        ▼
   ┌────────────┐      ┌─────────────────────────────────┐
   │  xm (AI)   │─────▶│  Gσ — Paraphraser (T5-base)     │──▶ xp_ppo
   └────────────┘      │  ramsrigouthamg/t5_paraphraser  │
                       └─────────────────────────────────┘
                                        │
                              PPO reward R(xp, φ)
                                        │
   ┌────────────┐      ┌─────────────────────────────────┐
   │  xh (human)│─────▶│  Dϕ — Detector (RoBERTa-large)  │──▶ AUROC
   │  xm (AI)   │─────▶│  roberta-large                  │
   │  xp_ppo    │─────▶│  (trained via reweighted        │
   │  xp_det_k  │─────▶│   logistic loss)                │
   └────────────┘      └─────────────────────────────────┘

The detector is trained on human text, original AI text, T5-paraphrased AI text, and four deterministic attack variants simultaneously. The paraphraser is updated via PPO to maximise the reward signal (the detector's human-probability score assigned to its paraphrases).

Dataset

All experiments use the RAID dataset (liamdugan/raid, ACL 2024).

Split	Human texts	AI generators	AI texts per generator	Attack filter
Train	~13,364	11	~26,000–53,000	`attack='none'` (clean only)
Val (internal 10%)	~1,364	11	~2,600–5,300	same

AI generators: chatgpt, gpt-3, gpt-4, gpt-2, cohere, cohere-chat, llama-chat, mistral, mistral-chat, mpt, mpt-chat

Attack Pool

Training uses a two-track multi-evasion attack pool:

Track	Attack	Description
A — Learnable	T5 paraphrase	PPO-trained; provides the reward signal to the paraphraser
B — Deterministic	Synonym replacement	WordNet POS-aware; 20% token replacement rate
B — Deterministic	Homoglyphs	ASCII → Unicode substitution; 10% character rate
B — Deterministic	Article deletion	Removes a / an / the; 50% drop rate
B — Deterministic	Misspelling	QWERTY adjacency typos; 8% character rate

The detector's training loss is extended across all five attack types:

L_D(φ) = L_human + λ·(L_xm + L_t5_para + L_synonym + L_homoglyphs + L_article + L_misspelling)

Training Details

Hyperparameter	Value
Paraphraser	`ramsrigouthamg/t5_paraphraser` (T5-base, 250M)
Detector	`roberta-large` (355M)
Paraphraser LR	2e-5
Detector LR	3e-6
PPO epsilon (clip)	0.2
PPO epochs per buffer	8
PPO buffer size	64
KL coefficient	0.001
Label smoothing α	0.15
Detector update frequency	every 2 outer steps
AUROC freeze threshold	0.995
Max outer steps	200
Early stopping patience	25 steps
Generation	`do_sample=True`, `top_k=50`, `temperature=1.0`

Training stability fixes applied:

NanSafeLogitsProcessor — intercepts NaN/inf logits before torch.multinomial inside every generate() call
PPO log-ratio clamped to [-5, 5] to prevent exp() overflow
Advantage clamped to [-3, 3]
NaN gradients zeroed after PPO backward before optimizer.step()
compute_logprobs output clamped at −100 and passed through nan_to_num

Results

Per-Generator AUROC (validation set, best checkpoint)

Generator	AUROC
gpt-4	0.9995
llama-chat	0.9994
mistral-chat	0.9994
chatgpt	0.9991
mpt-chat	0.9991
gpt-3	0.9982
gpt-2	0.9954
cohere-chat	0.9934
mistral	0.9913
mpt	0.9865
cohere	0.9852
Macro average	0.9951

Per-Attack AUROC (robustness evaluation)

Attack	AUROC
T5 paraphrase	1.0000
Misspelling	0.9996
Homoglyphs	0.9994
No attack	0.9994
Article deletion	0.9993
Synonym replacement	0.9990

All attacks report ~ (no significant degradation vs the no-attack baseline).

Repository Structure

adal/
├── fourth_adal_train.py           # Main adversarial training script
├── submit_to_raid_leaderboard.py  # RAID leaderboard submission script
├── radar_multievasion/
│   ├── best_detector/             # Saved RoBERTa-large detector checkpoint
│   ├── best_paraphraser/          # Saved T5-base paraphraser checkpoint
│   ├── per_generator_auroc.tsv    # Per-generator AUROC at best checkpoint
│   └── per_attack_auroc.tsv       # Per-attack AUROC at best checkpoint
└── FOURTH_ADAL.log                # Full training log

Installation

pip install raid-bench torch transformers scikit-learn nltk
python -c "import nltk; nltk.download('wordnet'); nltk.download('averaged_perceptron_tagger_eng'); nltk.download('punkt_tab')"

Training

CUDA_VISIBLE_DEVICES=0 python fourth_adal_train.py > FOURTH_ADAL.log 2>&1

Key config variables at the top of fourth_adal_train.py:

NUM_HUMAN_SAMPLES     = None   # None = all available (~13,364)
SAMPLES_PER_GENERATOR = None   # None = all available (~26k–53k per generator)
VAL_SPLIT_RATIO       = 0.1    # 10% held out for validation
MAX_OUTER_STEPS       = 200
PATIENCE              = 25     # early stopping

References

@inproceedings{hu2023radar,
  title     = {RADAR: Robust AI-Text Detection via Adversarial Learning},
  author    = {Hu, Xiaomeng and Chen, Pin-Yu and Ho, Tsung-Yi},
  booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
  year      = {2023}
}

@inproceedings{dugan2024raid,
  title     = {RAID: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors},
  author    = {Dugan, Liam and Hwang, Alyssa and Trhlik, Filip and Ippolito, Daphne and Callison-Burch, Chris},
  booktitle = {Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL)},
  year      = {2024}
}

Author

Shushanta Pudasaini
PhD Researcher, Technological University Dublin Supervisors: Dr. Marisa Llorens Salvador · Dr. Luis Miralles-Pechuán · Dr. David Lillis

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
adal		adal
adal_panclef		adal_panclef
adal_second_version		adal_second_version
adal_v3_panclef		adal_v3_panclef
adal_v3_raid		adal_v3_raid
adal_v4_balanced_raid		adal_v4_balanced_raid
adal_v4_panclef		adal_v4_panclef
adal_v4_raid		adal_v4_raid
adal_v4_raid_balanced		adal_v4_raid_balanced
adal_v5_panclef		adal_v5_panclef
adal_v5_raid		adal_v5_raid
evaluation		evaluation
radar_multievasion		radar_multievasion
.gitattributes		.gitattributes
ADAL.log		ADAL.log
ADAL_PANCLEF.log		ADAL_PANCLEF.log
ADAL_V3.log		ADAL_V3.log
ADAL_second_version.log		ADAL_second_version.log
ADAL_v3_raid.log		ADAL_v3_raid.log
ADAL_version_3_with_deberta.log		ADAL_version_3_with_deberta.log
ADAL_version_4_panclef.log		ADAL_version_4_panclef.log
ADAL_version_4_raid.log		ADAL_version_4_raid.log
ADAL_version_4_raid_balanced.log		ADAL_version_4_raid_balanced.log
ADAL_version_5_panclef.log		ADAL_version_5_panclef.log
ADAL_version_5_raid.log		ADAL_version_5_raid.log
ADALv3panclef.log		ADALv3panclef.log
README.md		README.md
adal_improved.py		adal_improved.py
adal_inference.py		adal_inference.py
adal_v3_panclef_train.py		adal_v3_panclef_train.py
adal_v3_train.py		adal_v3_train.py
adal_v4_balanced_raid.py		adal_v4_balanced_raid.py
adal_v4_train.py		adal_v4_train.py
adal_v5_raid.py		adal_v5_raid.py
eval_results.json		eval_results.json
experiment_on_small_data.py		experiment_on_small_data.py
final_predictions.json		final_predictions.json
nohup.out		nohup.out
predictions.json		predictions.json
push_model_to_huggingface.py		push_model_to_huggingface.py
submission_adal.py		submission_adal.py
submission_leaderboard.py		submission_leaderboard.py
train.jsonl		train.jsonl
train.py		train.py
train_pan_clef.py		train_pan_clef.py
training.log		training.log
training_ADAL.log		training_ADAL.log
val.jsonl		val.jsonl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ADAL: AI-Generated Text Detection using Adversarial Learning

Overview

Background

Architecture

Dataset

Attack Pool

Training Details

Results

Per-Generator AUROC (validation set, best checkpoint)

Per-Attack AUROC (robustness evaluation)

Repository Structure

Installation

Training

References

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ADAL: AI-Generated Text Detection using Adversarial Learning

Overview

Background

Architecture

Dataset

Attack Pool

Training Details

Results

Per-Generator AUROC (validation set, best checkpoint)

Per-Attack AUROC (robustness evaluation)

Repository Structure

Installation

Training

References

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages