-
[Update Aug. 27, 2025] We present a new variant of MeanAudio: MeanAudio-L-Full a 480M latent flow transformer achieving SOTA performance on both single-step and multi-step audio generation. Try it out at our 🤗 huggingface space !
-
[Update Aug. 17, 2025] We present MeanAudio-S-Full: a 120M latent flow transformer trained with the MeanFlow objective on ~10,000 hours of audio data sourced from AudioCaps, AudioSet, WavCaps, VGGSound, MusicCaps, and LP-MusicCaps.
MeanAudio is a novel MeanFlow-based model tailored for fast and faithful text-to-audio generation. It can synthesize realistic sound in a single step, achieving a real-time factor (RTF) of 0.013 on a single NVIDIA 3090 GPU. Moreover, it also demonstrates strong performance in multi-step generation.
1. Create a new conda environment:
conda create -n meanaudio python=3.11 -y
conda activate meanaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 --upgrade2. Install with pip:
git clone https://github.com/xiquan-li/MeanAudio.git
cd MeanAudio
pip install -e .To generate audio with our pre-trained model, simply run:
python demo.py --prompt 'your prompt' --num_steps 1This will automatically download the pre-trained checkpoints from huggingface, and generate audio according to your prompt.
By default, this will use meanaudio-s-full.
The output audio will be at MeanAudio/output/, and the checkpoints will be at MeanAudio/weights/.
Alternatively, you can download manually the pre-trained models from this Folder, and put them into MeanAudio/weights/. Then, you can use scripts/meanflow/infer_meanflow.sh and scripts/flowmatching/infer_flowmatching.sh to generate audio with pre-trained models.
| Model Name | Size | Dataset | Objective | Pre-trained | Link |
|---|---|---|---|---|---|
| MeanAudio-S-AC | 120M | AudioCaps | Mean Flow | FluxAudio-S-Full | Here |
| FluxAudio-S-Full | 120M | All |
Flow Matching | - | Here |
| MeanAudio-S-Full | 120M | All |
Mean Flow | - | Here |
| MeanAudio-L-Full | 480M | All |
Mean Flow | - | Here |
Before training, make sure that all files from here are placed in MeanAudio/weights.
We first extract VAE latents & text encoder embeddings to enable fast and efficient training. For this, scripts/extract_audio_latents.sh provides a detailed guide for it. The pipeline includes two steps: a) partition audios into 10s clips. b) extract latents & embeddings into npz files.
To avoid the laborious data pre-processing step, we have uploaded an extracted version of AudioCaps. Feel free to download it from this link, unzip it and put it under MeanAudio/data/. Then you can directly jump to the second step. 😊
However, if you want to train the model on other datasets besides AudioCaps, you should still run scripts/extract_audio_latents.sh to do feature extraction.
Remember to adjust config/data/t5_clap.yaml for correct metadata paths.
We rely on av-benchmark for validation & evaluation. Please install it first before training.
Use the script below to train a MeanAudio model. By default, this will initialize the flow transformer from the pretrained ckpt fluxaudio_fm.pth and do MeanFlow fine-tuning.
bash scripts/meanflow/train_meanflow.shUse the script below to train a Flux-style transformer using the conditional flow matching objective:
bash scripts/flowmatching/train_flowmatching.shThe obtained model can serve as a strong initialization for the mixed-flow fine-tuning.
Use the script below to do evaluation, before this, please first install av-benchmark for metrics calculation. You can specify num_steps and ckpt_path to evaluate different models with different sampling steps.
bash scripts/meanflow/eval_meanflow.sh @article{li2025meanaudio,
title={MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows},
author={Li, Xiquan and Liu, Junxi and Liang, Yuzhe and Niu, Zhikang and Chen, Wenxi and Chen, Xie},
journal={arXiv preprint arXiv:2508.06098},
year={2025}
}Many thanks to:
- MMAudio for the MMDiT code and training & inference structure
- MeanFlow-pytorch and MeanFlow-official for the mean flow implementation
- Make-An-Audio 2 BigVGAN Vocoder and the VAE
- av-benchmark for benchmarking results
