Open-source benchmarking framework for evaluating and comparing Adversarial Domain Generation Algorithms (DGAs) in a single, unified environment. It assesses each model along three dimensions: lexical characteristics, detection evasion against deep-learning classifiers, and computational cost (training and generation time).
This is the public artifact accompanying the DIMVA 2026 poster "The Simpler, the Stealthier: A Framework for Evaluating Adversarial Domain Generation Algorithm Models" (see Citation).
The framework is modular, with three decoupled layers coordinated by a central
orchestrator (core/framework.py):
- Generation layer — the adversarial DGA models:
DeepDGA, CharBot,
Deception, and MaskDGA. Two control
models provide reference baselines:
malicious_dga(real AGDs from DGArchive) andbenign_domains(legitimate domains from the Tranco list). - Detection layer — two character-level classifiers trained to flag algorithmically generated domains: LSTM (Woodbridge et al.) and a CNN.
- Analysis layer (
core/analysis/) — for every model and control group it computes:- Lexical statistics: Shannon entropy, vowel ratio, consonant ratio, digit ratio, unique-character ratio, maximum consecutive consonants, and domain length.
- Detection statistics: the evasion rate (fraction of generated domains classified as benign by each detector).
- Timing: training, generation, and inference times.
New models and detectors are added by subclassing the abstract base classes in
core/adversarial_model.py and core/detector.py.
core/ Orchestrator, base classes, dataset splits, analysis pipelines
analysis/ Statistical, detection, and time evaluation
models/ Adversarial DGA models + benign/malicious control models
detectors/ LSTM and CNN detectors
main.py End-to-end example run
requirements.txt Python dependencies
- Python 3.9–3.12
- Dependencies (pinned in
requirements.txt):numpy==1.26.4,tensorflow==2.18.0,tldextract==5.3.1
pip install -r requirements.txtThe datasets are not bundled with this repository. You must obtain them from
their original sources and place them under dataset/ as described below.
-
Tranco (benign domains) — download a list from tranco-list.eu and save it as:
dataset/top-1m.csvExpected format is the standard Tranco CSV (
rank,domain); the domain is read from the second column. The file must contain at least 372,000 rows for the split below.Example (
dataset/top-1m.csv):1,google.com 2,gtld-servers.net 3,googleapis.com -
DGArchive (malicious AGDs) — one CSV per malware family, saved as:
dataset/dgarchive/<family>_dga.csvThese files contain one domain per line (the domain is read from the last comma-separated column, so an optional leading
date,prefix is also supported).Example (
dataset/dgarchive/dyre_dga.csv):a000139310b8754d96d02c8bf12955c63f.hk a00029889b4d3d8d9476fc4bd38683d500.tk a0002b50845121aad3fca5367e8eab4ef0.hkThe example run (
main.py) uses four representative families as malicious control baselines:dyre_dga.csv,suppobox_dga.csv,qakbot_dga.csv, androvnix_dga.csv. The detectors are trained on an equal-per-family draw across all family CSVs present indataset/dgarchive/, so add as many families as you want to reproduce the detector training set.
core/data_splits.py is the single source of truth for the disjoint
D1/D2/D3 partitioning (row intervals, half-open):
| Split | Tranco rows (benign) | DGArchive rows (per family) | Role |
|---|---|---|---|
| D1 | [0, 256000) |
— | Training the adversarial models |
| D2 | [256000, 372000) |
[0, 50000) |
Training the detectors |
| D3 | [372000, end) |
[50000, end) |
Control-group baselines |
python3 main.pyThe example pipeline (seeded with SEED = 42 for reproducibility):
- Instantiates and fits every model and the two detectors.
- Generates 100,000 domains per model and saves them under
my_eval_workspace/samples/(pre-generated sample files are reused if present). - Runs the statistical, time, and detection analyses.
Note: no pre-trained weights are shipped. On the first run, the detectors and the learning-based models (DeepDGA, MaskDGA) are trained from scratch and their weights are cached under each component's
weights/directory, so subsequent runs skip training. Training DeepDGA and MaskDGA is computationally expensive; CharBot and Deception are near-instant.
All results are written under my_eval_workspace/:
my_eval_workspace/
samples/
<Model>_samples.txt Generated domains
analysis/
statistical/<Model>_statistical_eval.json Lexical statistics
time/<Model>_time_eval.json Training/generation times
time/<Detector>_time_eval.json Training/inference times
detection/<detector>/<model>_evasion.json Evasion rate per model
Each analysis JSON is stamped with run metadata (timestamp, seed, library
versions, and dataset fingerprints) for reproducibility. Analyses can be run on
full domains (SLD.TLD) or on the second-level domain only via
Framework.set_analysis_mode(sld_only=True).
Distributed under the GNU Affero General Public License v3.0 (AGPL-3.0). See
LICENSE.
TBD