MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows

News 🔥

[Update Aug. 27, 2025] We present a new variant of MeanAudio: MeanAudio-L-Full a 480M latent flow transformer achieving SOTA performance on both single-step and multi-step audio generation. Try it out at our 🤗 huggingface space !
[Update Aug. 17, 2025] We present MeanAudio-S-Full: a 120M latent flow transformer trained with the MeanFlow objective on ~10,000 hours of audio data sourced from AudioCaps, AudioSet, WavCaps, VGGSound, MusicCaps, and LP-MusicCaps.

Overview

MeanAudio is a novel MeanFlow-based model tailored for fast and faithful text-to-audio generation. It can synthesize realistic sound in a single step, achieving a real-time factor (RTF) of 0.013 on a single NVIDIA 3090 GPU. Moreover, it also demonstrates strong performance in multi-step generation.

Environmental Setup

1. Create a new conda environment:

conda create -n meanaudio python=3.11 -y
conda activate meanaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 --upgrade

2. Install with pip:

git clone https://github.com/xiquan-li/MeanAudio.git

cd MeanAudio
pip install -e .

Quick Start

To generate audio with our pre-trained model, simply run:

python demo.py --prompt 'your prompt' --num_steps 1

This will automatically download the pre-trained checkpoints from huggingface, and generate audio according to your prompt. By default, this will use meanaudio-s-full. The output audio will be at MeanAudio/output/, and the checkpoints will be at MeanAudio/weights/.

Alternatively, you can download manually the pre-trained models from this Folder, and put them into MeanAudio/weights/. Then, you can use scripts/meanflow/infer_meanflow.sh and scripts/flowmatching/infer_flowmatching.sh to generate audio with pre-trained models.

Variants

Model Name	Size	Dataset	Objective	Pre-trained	Link
MeanAudio-S-AC	120M	AudioCaps	Mean Flow	FluxAudio-S-Full	Here
FluxAudio-S-Full	120M	All $^*$	Flow Matching	-	Here
MeanAudio-S-Full	120M	All $^*$	Mean Flow	-	Here
MeanAudio-L-Full	480M	All $^*$	Mean Flow	-	Here

$^*$: All denotes AudioCaps + WavCaps + AudioSet + VGGSound + LP-MusicCaps-MC + LP-MusicCaps-MTT, forming approximately 3M of audio-text pairs (about 10,000 hours audio data).

Training

Before training, make sure that all files from here are placed in MeanAudio/weights.

1. Latent & Text Feature Extraction:

We first extract VAE latents & text encoder embeddings to enable fast and efficient training. For this, scripts/extract_audio_latents.sh provides a detailed guide for it. The pipeline includes two steps: a) partition audios into 10s clips. b) extract latents & embeddings into npz files.

To avoid the laborious data pre-processing step, we have uploaded an extracted version of AudioCaps. Feel free to download it from this link, unzip it and put it under MeanAudio/data/. Then you can directly jump to the second step. 😊

However, if you want to train the model on other datasets besides AudioCaps, you should still run scripts/extract_audio_latents.sh to do feature extraction. Remember to adjust config/data/t5_clap.yaml for correct metadata paths.

2. Install Validation Packages:

We rely on av-benchmark for validation & evaluation. Please install it first before training.

3. Train with MeanFlow objective:

Use the script below to train a MeanAudio model. By default, this will initialize the flow transformer from the pretrained ckpt fluxaudio_fm.pth and do MeanFlow fine-tuning.

bash scripts/meanflow/train_meanflow.sh

4. (Optional) Pre-training with Standard Flow Matching:

Use the script below to train a Flux-style transformer using the conditional flow matching objective:

bash scripts/flowmatching/train_flowmatching.sh

The obtained model can serve as a strong initialization for the mixed-flow fine-tuning.

Evaluation

Use the script below to do evaluation, before this, please first install av-benchmark for metrics calculation. You can specify num_steps and ckpt_path to evaluate different models with different sampling steps.

bash scripts/meanflow/eval_meanflow.sh

Citation

@article{li2025meanaudio,
  title={MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows},
  author={Li, Xiquan and Liu, Junxi and Liang, Yuzhe and Niu, Zhikang and Chen, Wenxi and Chen, Xie},
  journal={arXiv preprint arXiv:2508.06098},
  year={2025}
}

Acknowledgement

Many thanks to:

MMAudio for the MMDiT code and training & inference structure
MeanFlow-pytorch and MeanFlow-official for the mean flow implementation
Make-An-Audio 2 BigVGAN Vocoder and the VAE
av-benchmark for benchmarking results

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
config		config
data		data
meanaudio		meanaudio
scripts		scripts
sets		sets
training		training
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo.py		demo.py
eval.py		eval.py
infer.py		infer.py
pyproject.toml		pyproject.toml
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows

News 🔥

Overview

Environmental Setup

Quick Start

Variants

Training

1. Latent & Text Feature Extraction:

2. Install Validation Packages:

3. Train with MeanFlow objective:

4. (Optional) Pre-training with Standard Flow Matching:

Evaluation

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

xiquan-li/MeanAudio

Folders and files

Latest commit

History

Repository files navigation

MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows

News 🔥

Overview

Environmental Setup

Quick Start

Variants

Training

1. Latent & Text Feature Extraction:

2. Install Validation Packages:

3. Train with MeanFlow objective:

4. (Optional) Pre-training with Standard Flow Matching:

Evaluation

Citation

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages