Skip to content

mmwillet/TTS.cpp

Repository files navigation

TTS.cpp

Roadmap / Modified GGML

Purpose and Goals

The general purpose of this repository is to support real time generation with open source TTS (text to speech) models across common device architectures using the GGML tensor library. Rapid STT (speach to text), embedding generation, and LLM generation are well supported on GGML (via whisper.cpp and llama.cpp respectively). As such, this repo seeks to compliment those functionalities with a similarly optimized and portable TTS library.

In this endeavor, MacOS and metal support will be treated as the primary platform, and, as such, functionality will initially be developed for MacOS and later extended to other OS.

Supported Functionality

Warning! Currently TTS.cpp should be treated as a proof of concept and is subject to further development. Existing functionality has not be tested outside of a MacOS X environment.

Model Support

Models CPU Metal Acceleration Quantization GGUF files
Parler TTS Mini here
Parler TTS Large here
Kokoro here
Dia here
Orpheus here

Additional Model support will initially be added based on open source model performance in both the old TTS model arena and new TTS model arena as well as the availability of said models' architectures and checkpoints.

Functionality

Planned Functionality OS X Linux Windows
Basic CPU Generation
Metal Acceleration _ _
CUDA support _
Quantization *
Layer Offloading
Server Support
Vulkan Support _
Kompute Support _
Streaming Audio

* Currently only the generative model supports these.

Installation

WARNING! This library is only currently supported on OS X

Requirements:

  • Local GGUF format model file (see py-gguf for information on how to convert the hugging face models to GGUF).
  • C++17 and C17
    • XCode Command Line Tools (via xcode-select --install) should suffice for OS X
  • CMake (>=3.14)
  • GGML pulled locally
    • this can be accomplished via git clone -b support-for-tts [email protected]:mmwillet/ggml.git

GGML Patch

The local GGML library includes several required patches to the main branch of GGML (making the current TTS ggml branch out of date with modern GGML). Specifically these patches include major modifications to the convolutional transposition operation as well as several new GGML operations which have been implemented for TTS specific purposes; these include ggml_reciprocal, ggml_round, ggml_mod, ggml_cumsum, STFT, and iSTFT operations.

We are currently working on upstreaming some of these operations inorder to deprecate this patch requirement going forward.

Build:

Assuming that the above requirements are met the library and basic CLI example can be built by running the following command in the repository's base directory:

cmake -B build                                           
cmake --build build --config Release

The CLI executable and other exceutables will be in the ./build directory (e.g. ./build/cli) and the compiled library will be in the ./build/src (currently it is named parler as that is the only supported model).

If you wish to install TTS.cpp with Espeak-ng phonemization support, first install Espeak-ng. Depending on your installation method the path of the installed library will vary. Upon identifying the installation path to espeak-ng (it should contain ./lib, ./bin, ./include, and ./share directories), you can compile TTS.cpp with espeak phonemization support by running the follwing in the repositories base directory:

export ESPEAK_INSTALL_DIR=/absolute/path/to/espeak-ng/dir
cmake -B build
cmake --build build --config Release

On Linux, you don't need to manually download or export anything. Our build system will automatically detect the development packages installed on your machine:

# Change `apt` and the package names to match your distro
sudo apt install build-essential cmake # Minimum requirements
sudo apt install git libespeak-ng-dev libsdl2-dev pkg-config # Optional requirements
cmake -B build
cmake --build build --config Release

Usage

See the CLI example readme for more details on its general usage.

Quantization and Lower Precision Models

See the quantization cli readme for more details on its general usage and behavior. Please note Quantization and lower precision conversion is currently only supported for Parler TTS models.

Performance

Given that the central goal of this library is to support real time speech generation on OS X, generation speed has only been rigorously tested in that environment with supported models (i.e. Parler Mini version 1.0).

With the introduction of metal acceleration support for the DAC audio decoder model, text to speech generation is nearly possible in real time on a standard Apple M1 Max with ~3GB memory overhead. The best real time factor for accelerated models is currently 1.112033. This means that for every second of generated audio, the accelerated models require approximately 1.112033 seconds of generation time (with Q5_0 quantization applied to the generative model). For the latest stats via the performance battery see the readme therein.

About

TTS support with GGML

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published