Wake Word Dataset Generation and Augmentation Scripts 🎙️

This repository contains a collection of Python scripts and accompanying shell scripts designed for creating, augmenting, and normalizing datasets specifically for Wake Word (WW) Detection models.

Scripts Overview

File Name	Type	Description
`record_dataset.py`	Python Script	An interactive tool for manually recording a custom wake word dataset. It uses the Silero VAD (Voice Activity Detection) model to detect voice activity, helping the user record positive wake word samples, negative phrases, and ambient background noise.
`ovos_ww_synth.py`	Python Script	The core script for synthetic data generation. It generates wake word audio samples using multiple Text-to-Speech (TTS) engines (e.g., Edge, Google, Piper) and optionally incorporates voice conversion (VC) to simulate multiple speakers.
`ovos_ww_synth.sh`	Shell Script	A driver script that simplifies the execution of `ovos_ww_synth.py`. It is designed to handle the synthesis process for one or more wake words in a specified language, including logging and concurrent job management.
`vc_ww_synth.sh`	Shell Script	A specific example script utilizing the `chatterbox_bulk_tts` tool to perform Voice Converted (VC) TTS synthesis of a wake word, leveraging a dataset of voice references.
`augment_voices.sh`	Shell Script	A utility script that uses the `chatterbox_bulk_vc` tool to revoice an existing dataset. This is used to augment a synthetic or recorded dataset by converting the audio to sound like new, random speakers.
`augment.py`	Python Script	A dedicated script for acoustic dataset augmentation. It applies various real-world audio transformations—such as noise mixing, reverb, pitch shifting, and speed perturbation—to preprocessed audio to increase model robustness.
`adversarial_samples.py`	Python Script	A tool for generating adversarial text samples (hard negatives) that are phonetically similar to the target wake word. It employs a combination of Grapheme Augmentation (single-grapheme edits) and potentially a Large Language Model (LLM) to create confusable words, meant for TTS synthesis later.
`gen_adversarial_words.sh`	Shell Script	An execution script that calls `adversarial_samples.py` to generate a list of adversarial words for a specific wake word and saves the output to a text file.
`normalize_txt.sh`	Shell Script	A utility script for post-processing text files. It converts all text to lowercase, then sorts and deduplicates the lines in place, commonly used for cleaning up word lists like those generated by `gen_adversarial_words.sh`.

Credits

This work was made possible by the generous grant from NGI0 Commons Fund

This project was funded through the NGI0 Commons Fund, a fund established by NLnet with financial support from the European Commission's Next Generation Internet programme, under the aegis of DG Communications Networks, Content and Technology under grant agreement No 101135429. Additional funding is made available by the Swiss State Secretariat for Education, Research and Innovation (SERI).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Wake Word Dataset Generation and Augmentation Scripts 🎙️

Scripts Overview

Credits

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
adversarial_samples.py		adversarial_samples.py
augment.py		augment.py
augment_voices.sh		augment_voices.sh
gen_adversarial_words.sh		gen_adversarial_words.sh
ngi.png		ngi.png
normalize_txt.sh		normalize_txt.sh
ovos_ww_synth.py		ovos_ww_synth.py
ovos_ww_synth.sh		ovos_ww_synth.sh
record_dataset.py		record_dataset.py
requirements.txt		requirements.txt
vc_ww_synth.sh		vc_ww_synth.sh

TigreGotico/synthetic_dataset_generator

Folders and files

Latest commit

History

Repository files navigation

Wake Word Dataset Generation and Augmentation Scripts 🎙️

Scripts Overview

Credits

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages