Skip to content

JHU-LCAP/SynSonic

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 

Repository files navigation

SynSonic: Augmenting Sound Event Detection through Text-to-Audio Diffusion ControlNet and Effective Sample Filtering

arXiv

🟣 SynSonic is a framework that uses Text-to-Audio ControlNet for synthetic strongly-labeled audio data generation, improving the performance of sound event detection models.

Pipeline

1. Generate single-event audio clips using T2A ControlNet

  • Install EzAudio-ControlNet
  • Prepare trimmed reference audio samples (Please follow soundbank used in DCASE SED)
  • Generate audio variants using EzAudio-ControlNet:
from api.ezaudio import EzAudio_ControlNet
import torch
import soundfile as sf

# Load model
device = 'cuda' if torch.cuda.is_available() else 'cpu'
controlnet = EzAudio_ControlNet(model_name='energy', device=device)

prompt = 'dog barking'
audio_path = 'real_dog_barking.wav'  # Path to reference audio

sr, audio = controlnet.generate_audio(prompt, audio_path=audio_path)
sf.write(f"gen_{prompt}.wav", audio, samplerate=sr)

2. Filter the generated audio clips

  • Set up Audioset-finetuned Dasheng and Laion-CLAP
  • Use Dasheng to compute logits
  • Use CLAP to compute text–audio similarity
  • Rank samples separately by logits and similarity
  • Re-rank samples using a weighted score:
    • score = w1 * r1 + w2 * r2, where r1 is the rank from logits and r2 is the rank from similarity
  • Select the top (k%) of samples based on the final score

3. Synthesize strongly labeled audio mixtures

4. Train and evaluate models using FDY-SED

Reference

If you find the code useful for your research, please consider citing:

@article{hai2025synsonic,
  title={SynSonic: Augmenting Sound Event Detection through Text-to-Audio Diffusion ControlNet and Effective Sample Filtering},
  author={Hai, Jiarui and Elhilali, Mounya},
  journal={arXiv preprint arXiv:2509.18603},
  year={2025}
}
@article{hai2024ezaudio,
  title={EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer},
  author={Hai, Jiarui and Xu, Yong and Zhang, Hao and Li, Chenxing and Wang, Helin and Elhilali, Mounya and Yu, Dong},
  journal={arXiv preprint arXiv:2409.10819},
  year={2024}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published