The general purpose of this repository is to support real time generation with open source TTS (text to speech) models across common device architectures using the GGML tensor library. Rapid STT (speach to text), embedding generation, and LLM generation are well supported on GGML (via whisper.cpp and llama.cpp respectively). As such, this repo seeks to compliment those functionalities with a similarly optimized and portable TTS library.
In this endeavor, MacOS and metal support will be treated as the primary platform, and, as such, functionality will initially be developed for MacOS and later extended to other OS.
Warning! Currently TTS.cpp should be treated as a proof of concept and is subject to further development. Existing functionality has not be tested outside of a MacOS X environment.
Models | CPU | Metal Acceleration | Quantization | GGUF files |
---|---|---|---|---|
Parler TTS Mini | ✓ | ✓ | ✓ | here |
Parler TTS Large | ✓ | ✓ | ✓ | here |
Kokoro | ✓ | ✗ | ✓ | here |
Dia | ✓ | ✓ | ✓ | here |
Orpheus | ✓ | ✗ | ✗ | here |
Additional Model support will initially be added based on open source model performance in both the old TTS model arena and new TTS model arena as well as the availability of said models' architectures and checkpoints.
Planned Functionality | OS X | Linux | Windows |
---|---|---|---|
Basic CPU Generation | ✓ | ✓ | ✗ |
Metal Acceleration | ✓ | _ | _ |
CUDA support | _ | ✗ | ✗ |
Quantization | ✓* | ✗ | ✗ |
Layer Offloading | ✗ | ✗ | ✗ |
Server Support | ✓ | ✓ | ✗ |
Vulkan Support | _ | ✗ | ✗ |
Kompute Support | _ | ✗ | ✗ |
Streaming Audio | ✗ | ✗ | ✗ |
* Currently only the generative model supports these.
WARNING! This library is only currently supported on OS X
- Local GGUF format model file (see py-gguf for information on how to convert the hugging face models to GGUF).
- C++17 and C17
- XCode Command Line Tools (via
xcode-select --install
) should suffice for OS X
- XCode Command Line Tools (via
- CMake (>=3.14)
- GGML pulled locally
- this can be accomplished via
git clone -b support-for-tts [email protected]:mmwillet/ggml.git
- this can be accomplished via
The local GGML library includes several required patches to the main branch of GGML (making the current TTS ggml branch out of date with modern GGML). Specifically these patches include major modifications to the convolutional transposition operation as well as several new GGML operations which have been implemented for TTS specific purposes; these include ggml_reciprocal
, ggml_round
, ggml_mod
, ggml_cumsum
, STFT, and iSTFT operations.
We are currently working on upstreaming some of these operations inorder to deprecate this patch requirement going forward.
Assuming that the above requirements are met the library and basic CLI example can be built by running the following command in the repository's base directory:
cmake -B build
cmake --build build --config Release
The CLI executable and other exceutables will be in the ./build
directory (e.g. ./build/cli
) and the compiled library will be in the ./build/src
(currently it is named parler as that is the only supported model).
If you wish to install TTS.cpp with Espeak-ng phonemization support, first install Espeak-ng. Depending on your installation method the path of the installed library will vary. Upon identifying the installation path to espeak-ng (it should contain ./lib
, ./bin
, ./include
, and ./share
directories), you can compile TTS.cpp with espeak phonemization support by running the follwing in the repositories base directory:
export ESPEAK_INSTALL_DIR=/absolute/path/to/espeak-ng/dir
cmake -B build
cmake --build build --config Release
On Linux, you don't need to manually download or export
anything. Our build system will automatically detect the development packages installed on your machine:
# Change `apt` and the package names to match your distro
sudo apt install build-essential cmake # Minimum requirements
sudo apt install git libespeak-ng-dev libsdl2-dev pkg-config # Optional requirements
cmake -B build
cmake --build build --config Release
See the CLI example readme for more details on its general usage.
See the quantization cli readme for more details on its general usage and behavior. Please note Quantization and lower precision conversion is currently only supported for Parler TTS models.
Given that the central goal of this library is to support real time speech generation on OS X, generation speed has only been rigorously tested in that environment with supported models (i.e. Parler Mini version 1.0).
With the introduction of metal acceleration support for the DAC audio decoder model, text to speech generation is nearly possible in real time on a standard Apple M1 Max with ~3GB memory overhead. The best real time factor for accelerated models is currently 1.112033. This means that for every second of generated audio, the accelerated models require approximately 1.112033 seconds of generation time (with Q5_0 quantization applied to the generative model). For the latest stats via the performance battery see the readme therein.