Curriculum-Guided Decentralized Synthethic Data Generation with History-Aware Classifier-Free Guidance
This project explores a method for diversifying instruction datasets for LLMs. The core idea is to iteratively generate augmented instructions, identify dense (semantically similar) regions in the generated data using embeddings, and then use these dense regions as negative prompts for Classifier-Free Guidance (CFG) to encourage the generation of novel, diverse instructions.
└── cohere-aya-expedition/ ├── controller.py # Manages embeddings, vector DB, and density selection ├── dria.py # Orchestrates the DRIA process (bootstrap & guided generation) ├── generate_dataset.py # Script to generate datasets using DRIA and a baseline ├── inst2gen.py # Converts augmented instructions into a question-answer dataset format ├── lm_eval.py # Wrapper for lm-evaluation-harness ├── metrics.py # Computes diversity metrics for generated datasets ├── node.py # LLMNode class for text generation (standard & CFG) ├── prompts.py # Stores various prompt templates ├── requirements.txt # Python dependencies ├── run_model.py # Handles model loading and fine-tuning (SFT) └── test_pipe.py # Example pipeline for data generation, fine-tuning, and evaluation
node.py(LLMNode): Encapsulates an LLM, providing methods for standard text generation and generation with Classifier-Free Guidance (CFG) using positive and negative prompts.controller.py(Controller, E5Embedder, VectorDB):E5Embedder: Generates text embeddings.VectorDB: Stores and searches text embeddings.Controller: Uses theVectorDBtoselect()texts from dense regions in the embedding space. These selected texts are used as negative prompts.
dria.py(Dria): The main orchestrator.- Initializes multiple
LLMNodeinstances. - Bootstrap Phase: Generates initial augmented instructions from a base instruction.
- Guided Generation Phase (Iterative):
- Uses the
Controllertoselect()dense instructions. - Each
LLMNodegenerates a new augmentation using its previous output (or base instruction) as a positive prompt and the selected dense instructions as negative prompts (CFG). - If the new augmentation is still too similar to existing ones, the guidance scale is increased, and regeneration is attempted.
- Uses the
- Initializes multiple
generate_dataset.py: UsesDriato create instruction datasets (driaandbl- baseline) and computes diversity metrics.inst2gen.py: Processes the augmented instructions fromgenerate_dataset.pyand uses anLLMNodeto convert them into a final question-answer pair dataset format.metrics.py: Calculates semantic diversity, MST diversity, and convex hull area from embeddings.run_model.py&test_pipe.py: Support fine-tuning a model on the generated datasets and evaluating it usinglm-evaluation-harness.
- Dataset Generation (
generate_dataset.py):- The
Driaclass is used to generate a set of augmented instructions using the iterative guided process. - A baseline dataset is also generated (likely without the guided diversification).
- Diversity metrics are computed for the generated instruction embeddings.
- The
- Instruction to Q/A Conversion (
inst2gen.py):- The augmented instructions are further processed to create a final dataset, typically in a question-answer format suitable for supervised fine-tuning.
- Model Fine-tuning & Evaluation (
test_pipe.py):- The generated dataset is used to fine-tune a base LLM (e.g.,
microsoft/Phi-4-mini-instruct). - The fine-tuned model is evaluated on downstream tasks (e.g.,
tinyGSM8k) usinglm-evaluation-harness.
- The generated dataset is used to fine-tune a base LLM (e.g.,
- Clone the repository.
- Install dependencies:
pip install -r requirements.txt
- Ensure
lm-evaluation-harnessis cloned and accessible (as suggested bylm_eval.py).
- Execute
generate_dataset.pyto create the instruction datasets. - Execute
inst2gen.pyto convert these instructions into final dataset format. - Execute
test_pipe.pyto run an example fine-tuning and evaluation pipeline. (Modify paths and model names intest_pipe.pyas needed).