| +++ | |
| disableToc = false | |
| title = "🗣 Text to audio (TTS)" | |
| weight = 11 | |
| url = "/features/text-to-audio/" | |
| +++ | |
| ## API Compatibility | |
| The LocalAI TTS API is compatible with the [OpenAI TTS API](https://platform.openai.com/docs/guides/text-to-speech) and the [Elevenlabs](https://api.elevenlabs.io/docs) API. | |
| ## LocalAI API | |
| The `/tts` endpoint can also be used to generate speech from text. | |
| ## Usage | |
| Input: `input`, `model` | |
| For example, to generate an audio file, you can send a POST request to the `/tts` endpoint with the instruction as the request body: | |
| ```bash | |
| curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{ | |
| "input": "Hello world", | |
| "model": "tts" | |
| }' | |
| ``` | |
| Returns an `audio/wav` file. | |
| ## Backends | |
| ### 🐸 Coqui | |
| Required: Don't use `LocalAI` images ending with the `-core` tag,. Python dependencies are required in order to use this backend. | |
| Coqui works without any configuration, to test it, you can run the following curl command: | |
| ``` | |
| curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{ | |
| "backend": "coqui", | |
| "model": "tts_models/en/ljspeech/glow-tts", | |
| "input":"Hello, this is a test!" | |
| }' | |
| ``` | |
| You can use the env variable COQUI_LANGUAGE to set the language used by the coqui backend. | |
| You can also use config files to configure tts models (see section below on how to use config files). | |
| ### Bark | |
| [Bark](https://github.com/suno-ai/bark) allows to generate audio from text prompts. | |
| This is an extra backend - in the container is already available and there is nothing to do for the setup. | |
| #### Model setup | |
| There is nothing to be done for the model setup. You can already start to use bark. The models will be downloaded the first time you use the backend. | |
| #### Usage | |
| Use the `tts` endpoint by specifying the `bark` backend: | |
| ``` | |
| curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{ | |
| "backend": "bark", | |
| "input":"Hello!" | |
| }' | aplay | |
| ``` | |
| To specify a voice from https://github.com/suno-ai/bark#-voice-presets ( https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c ), use the `model` parameter: | |
| ``` | |
| curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{ | |
| "backend": "bark", | |
| "input":"Hello!", | |
| "model": "v2/en_speaker_4" | |
| }' | aplay | |
| ``` | |
| ### Piper | |
| To install the `piper` audio models manually: | |
| - Download Voices from https://github.com/rhasspy/piper/releases/tag/v0.0.2 | |
| - Extract the `.tar.tgz` files (.onnx,.json) inside `models` | |
| - Run the following command to test the model is working | |
| To use the tts endpoint, run the following command. You can specify a backend with the `backend` parameter. For example, to use the `piper` backend: | |
| ```bash | |
| curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{ | |
| "model":"it-riccardo_fasol-x-low.onnx", | |
| "backend": "piper", | |
| "input": "Ciao, sono Ettore" | |
| }' | aplay | |
| ``` | |
| Note: | |
| - `aplay` is a Linux command. You can use other tools to play the audio file. | |
| - The model name is the filename with the extension. | |
| - The model name is case sensitive. | |
| - LocalAI must be compiled with the `GO_TAGS=tts` flag. | |
| ### Transformers-musicgen | |
| LocalAI also has experimental support for `transformers-musicgen` for the generation of short musical compositions. Currently, this is implemented via the same requests used for text to speech: | |
| ``` | |
| curl --request POST \ | |
| --url http://localhost:8080/tts \ | |
| --header 'Content-Type: application/json' \ | |
| --data '{ | |
| "backend": "transformers-musicgen", | |
| "model": "facebook/musicgen-medium", | |
| "input": "Cello Rave" | |
| }' | aplay | |
| ``` | |
| Future versions of LocalAI will expose additional control over audio generation beyond the text prompt. | |
| ### VibeVoice | |
| [VibeVoice-Realtime](https://github.com/microsoft/VibeVoice) is a real-time text-to-speech model that generates natural-sounding speech with voice cloning capabilities. | |
| #### Setup | |
| Install the `vibevoice` model in the Model gallery or run `local-ai run models install vibevoice`. | |
| #### Usage | |
| Use the tts endpoint by specifying the vibevoice backend: | |
| ``` | |
| curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{ | |
| "model": "vibevoice", | |
| "input":"Hello!" | |
| }' | aplay | |
| ``` | |
| #### Voice cloning | |
| VibeVoice supports voice cloning through voice preset files. You can configure a model with a specific voice: | |
| ```yaml | |
| name: vibevoice | |
| backend: vibevoice | |
| parameters: | |
| model: microsoft/VibeVoice-Realtime-0.5B | |
| tts: | |
| voice: "Frank" # or use audio_path to specify a .pt file path | |
| # Available English voices: Carter, Davis, Emma, Frank, Grace, Mike | |
| ``` | |
| Then you can use the model: | |
| ``` | |
| curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{ | |
| "model": "vibevoice", | |
| "input":"Hello!" | |
| }' | aplay | |
| ``` | |
| ### Pocket TTS | |
| [Pocket TTS](https://github.com/kyutai-labs/pocket-tts) is a lightweight text-to-speech model designed to run efficiently on CPUs. It supports voice cloning through HuggingFace voice URLs or local audio files. | |
| #### Setup | |
| Install the `pocket-tts` model in the Model gallery or run `local-ai run models install pocket-tts`. | |
| #### Usage | |
| Use the tts endpoint by specifying the pocket-tts backend: | |
| ``` | |
| curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{ | |
| "model": "pocket-tts", | |
| "input":"Hello world, this is a test." | |
| }' | aplay | |
| ``` | |
| #### Voice cloning | |
| Pocket TTS supports voice cloning through built-in voice names, HuggingFace URLs, or local audio files. You can configure a model with a specific voice: | |
| ```yaml | |
| name: pocket-tts | |
| backend: pocket-tts | |
| tts: | |
| voice: "azelma" # Built-in voice name | |
| # Or use HuggingFace URL: "hf://kyutai/tts-voices/alba-mackenna/casual.wav" | |
| # Or use local file path: "path/to/voice.wav" | |
| # Available built-in voices: alba, marius, javert, jean, fantine, cosette, eponine, azelma | |
| ``` | |
| You can also pre-load a default voice for faster first generation: | |
| ```yaml | |
| name: pocket-tts | |
| backend: pocket-tts | |
| options: | |
| - "default_voice:azelma" # Pre-load this voice when model loads | |
| ``` | |
| Then you can use the model: | |
| ``` | |
| curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{ | |
| "model": "pocket-tts", | |
| "input":"Hello world, this is a test." | |
| }' | aplay | |
| ``` | |
| ### Vall-E-X | |
| [VALL-E-X](https://github.com/Plachtaa/VALL-E-X) is an open source implementation of Microsoft's VALL-E X zero-shot TTS model. | |
| #### Setup | |
| The backend will automatically download the required files in order to run the model. | |
| This is an extra backend - in the container is already available and there is nothing to do for the setup. If you are building manually, you need to install Vall-E-X manually first. | |
| #### Usage | |
| Use the tts endpoint by specifying the vall-e-x backend: | |
| ``` | |
| curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{ | |
| "backend": "vall-e-x", | |
| "input":"Hello!" | |
| }' | aplay | |
| ``` | |
| #### Voice cloning | |
| In order to use voice cloning capabilities you must create a `YAML` configuration file to setup a model: | |
| ```yaml | |
| name: cloned-voice | |
| backend: vall-e-x | |
| parameters: | |
| model: "cloned-voice" | |
| tts: | |
| vall-e: | |
| # The path to the audio file to be cloned | |
| # relative to the models directory | |
| # Max 15s | |
| audio_path: "audio-sample.wav" | |
| ``` | |
| Then you can specify the model name in the requests: | |
| ``` | |
| curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{ | |
| "model": "cloned-voice", | |
| "input":"Hello!" | |
| }' | aplay | |
| ``` | |
| ## Using config files | |
| You can also use a `config-file` to specify TTS models and their parameters. | |
| In the following example we define a custom config to load the `xtts_v2` model, and specify a voice and language. | |
| ```yaml | |
| name: xtts_v2 | |
| backend: coqui | |
| parameters: | |
| language: fr | |
| model: tts_models/multilingual/multi-dataset/xtts_v2 | |
| tts: | |
| voice: Ana Florence | |
| ``` | |
| With this config, you can now use the following curl command to generate a text-to-speech audio file: | |
| ```bash | |
| curl -L http://localhost:8080/tts \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "model": "xtts_v2", | |
| "input": "Bonjour, je suis Ana Florence. Comment puis-je vous aider?" | |
| }' | aplay | |
| ``` | |
| ## Response format | |
| To provide some compatibility with OpenAI API regarding `response_format`, ffmpeg must be installed (or a docker image including ffmpeg used) to leverage converting the generated wav file before the api provide its response. | |
| Warning regarding a change in behaviour. Before this addition, the parameter was ignored and a wav file was always returned, with potential codec errors later in the integration (like trying to decode a mp3 file from a wav, which is the default format used by OpenAI) | |
| Supported format thanks to ffmpeg are `wav`, `mp3`, `aac`, `flac`, `opus`, defaulting to `wav` if an unknown or no format is provided. | |
| ```bash | |
| curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{ | |
| "input": "Hello world", | |
| "model": "tts", | |
| "response_format": "mp3" | |
| }' | |
| ``` | |
| If a `response_format` is added in the query (other than `wav`) and ffmpeg is not available, the call will fail. |