Skip to content

An application for generating high-quality SFT and alignment datasets from domain-specific text files using LLMs. Supports single-turn and multi-turn QA generation, ORPO-style preference data, real-time progress logs, and flexible configuration of LLM endpoints. Ideal for fine-tuning and alignment of custom language models.

Notifications You must be signed in to change notification settings

namantiwari2002/DataAugmenToolkit

Repository files navigation

📚 Data-Augmentation Toolkit

A Streamlit-based UI that lets you:

  • Upload single- or multi-turn SFT / Alignment datasets (.jsonl or .csv)
  • Validate the file before launching the heavy pipeline
  • Run the data-augmentation pipeline with live logs & progress bar
  • Download the generated JSONL in one click
  • Check your LLM endpoint instantly via a Health-Check button

Demo UI


✨ Features

Area What it does
LLM Connection Enter model_name, api_key, base_url & press Health-Check to verify connectivity.
Generation Mode Choose between single/multi-turn SFT or Alignment pipelines.
Threading Adjustable worker slider (1-16) controls concurrent requests.
File Validation Early checks for broken JSONL, malformed CSV, or wrong extensions with descriptive errors.
Live Feedback Real-time tqdm progress + log stream in the main pane.
Output Final JSONL is offered for download; CSV deliberately omitted to keep training format consistent.

🛠 Quick Start

# 1. Clone & enter the repo
git clone https://github.com/your-org/data-augmentation-toolkit.git
cd data-augmentation-toolkit

# 2. Create env & install deps
python -m venv .venv
source .venv/bin/activate      # Windows: .venv\Scripts\activate
pip install -r requirements.txt

# 3. Run the Streamlit app
streamlit run app.py

About

An application for generating high-quality SFT and alignment datasets from domain-specific text files using LLMs. Supports single-turn and multi-turn QA generation, ORPO-style preference data, real-time progress logs, and flexible configuration of LLM endpoints. Ideal for fine-tuning and alignment of custom language models.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages