File size: 9,153 Bytes
0f07ba7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309

+++
disableToc = false
title = "🗣 Text to audio (TTS)"
weight = 11
url = "/features/text-to-audio/"
+++

## API Compatibility

The LocalAI TTS API is compatible with the [OpenAI TTS API](https://platform.openai.com/docs/guides/text-to-speech) and the [Elevenlabs](https://api.elevenlabs.io/docs) API.

## LocalAI API

The `/tts` endpoint can also be used to generate speech from text.

## Usage

Input: `input`, `model`

For example, to generate an audio file, you can send a POST request to the `/tts` endpoint with the instruction as the request body:

```bash
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
  "input": "Hello world",
  "model": "tts"
}'
```

Returns an `audio/wav` file.


## Backends

### 🐸 Coqui

Required: Don't use `LocalAI` images ending with the `-core` tag,. Python dependencies are required in order to use this backend.

Coqui works without any configuration, to test it, you can run the following curl command:

```
    curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
        "backend": "coqui",
        "model": "tts_models/en/ljspeech/glow-tts",
        "input":"Hello, this is a test!"
        }'
```

You can use the env variable COQUI_LANGUAGE to set the language used by the coqui backend.

You can also use config files to configure tts models (see section below on how to use config files).

### Bark

[Bark](https://github.com/suno-ai/bark) allows to generate audio from text prompts.

This is an extra backend - in the container is already available and there is nothing to do for the setup.

#### Model setup

There is nothing to be done for the model setup. You can already start to use bark. The models will be downloaded the first time you use the backend.

#### Usage

Use the `tts` endpoint by specifying the `bark` backend:

```
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
     "backend": "bark",
     "input":"Hello!"
   }' | aplay
```

To specify a voice from https://github.com/suno-ai/bark#-voice-presets ( https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c ), use the `model` parameter:

```
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
     "backend": "bark",
     "input":"Hello!",
     "model": "v2/en_speaker_4"
   }' | aplay
```

### Piper

To install the `piper` audio models manually:

- Download Voices from https://github.com/rhasspy/piper/releases/tag/v0.0.2
- Extract the `.tar.tgz` files (.onnx,.json) inside `models`
- Run the following command to test the model is working

To use the tts endpoint, run the following command. You can specify a backend with the `backend` parameter. For example, to use the `piper` backend:
```bash
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
  "model":"it-riccardo_fasol-x-low.onnx",
  "backend": "piper",
  "input": "Ciao, sono Ettore"
}' | aplay
```

Note:

- `aplay` is a Linux command. You can use other tools to play the audio file.
- The model name is the filename with the extension.
- The model name is case sensitive.
- LocalAI must be compiled with the `GO_TAGS=tts` flag.

### Transformers-musicgen

LocalAI also has experimental support for `transformers-musicgen` for the generation of short musical compositions. Currently, this is implemented via the same requests used for text to speech:

```
curl --request POST \
  --url http://localhost:8080/tts \
  --header 'Content-Type: application/json' \
  --data '{
    "backend": "transformers-musicgen",
    "model": "facebook/musicgen-medium",
    "input": "Cello Rave"
}' | aplay
```

Future versions of LocalAI will expose additional control over audio generation beyond the text prompt.

### VibeVoice

[VibeVoice-Realtime](https://github.com/microsoft/VibeVoice) is a real-time text-to-speech model that generates natural-sounding speech with voice cloning capabilities.

#### Setup

Install the `vibevoice` model in the Model gallery or run `local-ai run models install vibevoice`.

#### Usage

Use the tts endpoint by specifying the vibevoice backend:

```
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
     "model": "vibevoice",
     "input":"Hello!"
   }' | aplay
```

#### Voice cloning

VibeVoice supports voice cloning through voice preset files. You can configure a model with a specific voice:

```yaml
name: vibevoice
backend: vibevoice
parameters:
  model: microsoft/VibeVoice-Realtime-0.5B
tts:
  voice: "Frank"  # or use audio_path to specify a .pt file path
  # Available English voices: Carter, Davis, Emma, Frank, Grace, Mike
```

Then you can use the model:

```
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
     "model": "vibevoice",
     "input":"Hello!"
   }' | aplay
```

### Pocket TTS

[Pocket TTS](https://github.com/kyutai-labs/pocket-tts) is a lightweight text-to-speech model designed to run efficiently on CPUs. It supports voice cloning through HuggingFace voice URLs or local audio files.

#### Setup

Install the `pocket-tts` model in the Model gallery or run `local-ai run models install pocket-tts`.

#### Usage

Use the tts endpoint by specifying the pocket-tts backend:

```
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
     "model": "pocket-tts",
     "input":"Hello world, this is a test."
   }' | aplay
```

#### Voice cloning

Pocket TTS supports voice cloning through built-in voice names, HuggingFace URLs, or local audio files. You can configure a model with a specific voice:

```yaml
name: pocket-tts
backend: pocket-tts
tts:
  voice: "azelma"  # Built-in voice name
  # Or use HuggingFace URL: "hf://kyutai/tts-voices/alba-mackenna/casual.wav"
  # Or use local file path: "path/to/voice.wav"
  # Available built-in voices: alba, marius, javert, jean, fantine, cosette, eponine, azelma
```

You can also pre-load a default voice for faster first generation:

```yaml
name: pocket-tts
backend: pocket-tts
options:
  - "default_voice:azelma"  # Pre-load this voice when model loads
```

Then you can use the model:

```
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
     "model": "pocket-tts",
     "input":"Hello world, this is a test."
   }' | aplay
```

### Vall-E-X

[VALL-E-X](https://github.com/Plachtaa/VALL-E-X) is an open source implementation of Microsoft's VALL-E X zero-shot TTS model.

#### Setup

The backend will automatically download the required files in order to run the model.

This is an extra backend - in the container is already available and there is nothing to do for the setup. If you are building manually, you need to install Vall-E-X manually first.

#### Usage

Use the tts endpoint by specifying the vall-e-x backend:

```
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
     "backend": "vall-e-x",
     "input":"Hello!"
   }' | aplay
```

#### Voice cloning

In order to use voice cloning capabilities you must create a `YAML` configuration file to setup a model:

```yaml
name: cloned-voice
backend: vall-e-x
parameters:
  model: "cloned-voice"
tts:
    vall-e:
      # The path to the audio file to be cloned
      # relative to the models directory
      # Max 15s
      audio_path: "audio-sample.wav"
```

Then you can specify the model name in the requests:

```
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
     "model": "cloned-voice",
     "input":"Hello!"
   }' | aplay
```

## Using config files

You can also use a `config-file` to specify TTS models and their parameters.

In the following example we define a custom config to load the `xtts_v2` model, and specify a voice and language.

```yaml

name: xtts_v2
backend: coqui
parameters:
  language: fr
  model: tts_models/multilingual/multi-dataset/xtts_v2

tts:
  voice: Ana Florence
```

With this config, you can now use the following curl command to generate a text-to-speech audio file:
```bash
curl -L http://localhost:8080/tts \
    -H "Content-Type: application/json" \
    -d '{
"model": "xtts_v2",
"input": "Bonjour, je suis Ana Florence. Comment puis-je vous aider?"
}' | aplay
```

## Response format

To provide some compatibility with OpenAI API regarding `response_format`, ffmpeg must be installed (or a docker image including ffmpeg used) to leverage converting the generated wav file before the api provide its response.

Warning regarding a change in behaviour. Before this addition, the parameter was ignored and a wav file was always returned, with potential codec errors later in the integration (like trying to decode a mp3 file from a wav, which is the default format used by OpenAI)

Supported format thanks to ffmpeg are `wav`, `mp3`, `aac`, `flac`, `opus`, defaulting to `wav` if an unknown or no format is provided.

```bash
curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
  "input": "Hello world",
  "model": "tts",
  "response_format": "mp3"
}'
```

If a `response_format` is added in the query (other than `wav`) and ffmpeg is not available, the call will fail.