File size: 14,545 Bytes
6fbb52c
4e9e0e4
6fbb52c
 
 
 
 
 
 
 
 
 
 
 
4e9e0e4
 
 
 
6fbb52c
 
 
 
 
 
 
 
 
 
 
4e9e0e4
 
6fbb52c
 
 
4e9e0e4
 
7b210fd
4e9e0e4
7b210fd
4e9e0e4
 
 
6fbb52c
 
 
 
 
 
 
 
 
 
 
 
 
7b210fd
 
4e9e0e4
 
6fbb52c
 
 
 
 
 
 
4e9e0e4
 
 
 
6fbb52c
 
 
 
 
 
 
4e9e0e4
6fbb52c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4e9e0e4
 
6fbb52c
 
 
 
4e9e0e4
 
6fbb52c
4e9e0e4
 
6fbb52c
 
 
4e9e0e4
 
 
 
6fbb52c
 
 
 
 
 
 
 
 
 
 
 
 
4e9e0e4
 
6fbb52c
4e9e0e4
 
6fbb52c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4e9e0e4
6fbb52c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4e9e0e4
6fbb52c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4e9e0e4
 
6fbb52c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
# BlueMagpie-TTS — Usage

BlueMagpie-TTS is a text-to-speech (TTS) model that synthesizes natural speech
from text. It supports three scenarios:

- **Plain synthesis** — read the text aloud.
- **Voice cloning** — mimic the timbre of a reference clip.
- **Speaker selection** — control the timbre with a prepared speaker vector.

It also supports **streaming output** for synthesize-while-you-play applications.

🔊 **Try it online:** [BlueMagpie-TTS Demo (Hugging Face Space)](https://huggingface.co/spaces/voidful/BlueMagpie-TTS-Demo)

## Install

```bash
git clone https://github.com/OpenFormosa/BlueMagpie-TTS
cd BlueMagpie-TTS
pip install -e .
```

The install pulls in the [`barbet`](https://github.com/OpenFormosa/Barbet)
package (the text-semantic language model) from GitHub. The acoustic modules are
vendored in `bluemagpie/_vendor/` (sourced from
[VoxCPM](https://github.com/OpenBMB/VoxCPM), Apache-2.0) and need no separate
install. To save synthesized audio, also install `soundfile`:

```bash
pip install soundfile
```

## Load the model

### From Hugging Face

```python
import os
from huggingface_hub import snapshot_download
from transformers import PreTrainedTokenizerFast
from bluemagpie import BlueMagpieModel

model_dir = snapshot_download("OpenFormosa/BlueMagpie-TTS", token=True)
# Load the tokenizer straight from tokenizer.json (works on transformers 5.x).
tokenizer = PreTrainedTokenizerFast(tokenizer_file=os.path.join(model_dir, "tokenizer.json"))
model = BlueMagpieModel.from_local(model_dir, tokenizer=tokenizer, training=False, device="cuda")
```

### From a local directory

```python
import os
from transformers import PreTrainedTokenizerFast
from bluemagpie import BlueMagpieModel

model_dir = "checkpoints/bluemagpie"
tokenizer = PreTrainedTokenizerFast(tokenizer_file=os.path.join(model_dir, "tokenizer.json"))
model = BlueMagpieModel.from_local(model_dir, tokenizer=tokenizer, training=False, device="cuda")
```

- `device` may be `"cuda"`, `"mps"`, or `"cpu"` (auto-selected if omitted).
- Always use `training=False` for inference.

## Basic synthesis: text to speech

`generate` returns a speech waveform (`torch.Tensor`); pair it with `soundfile`
to write a `.wav`. The output sample rate is `model.sample_rate` (48 kHz).

```python
import soundfile as sf

audio = model.generate(target_text="今天天氣真好。", cfg_value=2.0)
sf.write("output.wav", audio.squeeze().cpu().numpy(), model.sample_rate)
```

## Voice cloning: mimic a reference speaker

Two ways.

**A. Speaker vector (`speaker_centroid`)** — extract a vector from the reference
audio, then synthesize (no transcript needed):

```bash
pip install -e ".[clone]"   # extraction needs speechbrain (ECAPA-TDNN)
python scripts/extract_speaker_centroid.py --audio reference.wav --out my_voice.pt
# more clips of the same speaker -> cleaner centroid: --audio a.wav b.wav c.wav
```

```python
import torch

centroid = torch.load("my_voice.pt", weights_only=True)   # [192] speaker vector
audio = model.generate(
    target_text="今天天氣真好。",
    speaker_centroid=centroid,
    cfg_value=2.8,
)

# or extract it in-process:
from bluemagpie import extract_speaker_centroid
centroid = extract_speaker_centroid("reference.wav")      # [192]
```

**B. Reference clip (`reference_wav_path`)** — pass a reference clip directly:

```python
audio = model.generate(
    target_text="今天天氣真好。",
    reference_wav_path="reference.wav",
    cfg_value=2.8,
)
```

## Speaker selection: control timbre with a speaker vector

The model bundles a **multi-speaker table** at `checkpoints/speaker_centroids.pt`,
currently holding two speakers:

| speaker id | description | suggested `cfg_value` |
| --- | --- | --- |
| `hung_yi_lee` | Prof. Hung-yi Lee's speaker vector (used with his authorization; the official best params are tuned for this speaker) | 2.0–2.8 |
| `female_voice` | a generic female voice | 2.0–2.8 |

The table has the format `{"speaker_ids": [...], "centroids": tensor[N, 192], "dim": 192}`.
Load it with `torch.load`, **pick a speaker's `[192]` vector by id**, and pass it as
`speaker_centroid`:

```python
import os
import torch

table = torch.load(
    os.path.join(model_dir, "checkpoints", "speaker_centroids.pt"),
    map_location="cpu",
    weights_only=True,
)
print(table["speaker_ids"])          # ['hung_yi_lee', 'female_voice']

# switch speaker by changing this line ("hung_yi_lee" or "female_voice")
speaker_id = "female_voice"
speaker_centroid = table["centroids"][table["speaker_ids"].index(speaker_id)]   # [192]

audio = model.generate(
    target_text="今天天氣真好。",
    speaker_centroid=speaker_centroid,   # or your own authorized speaker vector
    cfg_value=2.0,
)
```

If you only have the model id (haven't `snapshot_download`-ed the whole model yet),
grab just the table:

```python
from huggingface_hub import hf_hub_download

path = hf_hub_download("OpenFormosa/BlueMagpie-TTS", "checkpoints/speaker_centroids.pt")
table = torch.load(path, map_location="cpu", weights_only=True)
```

> To add more speakers, extract your own (authorized) `[192]` vector with
> `extract_speaker_centroid` from the *Voice cloning* section above — it's passed the
> exact same way. The earlier single-speaker file
> `checkpoints/hung_yi_lee_speaker_centroids.pt` (same format) is still available.

## Streaming output

When you need to play while synthesizing, use `generate_streaming`. It is a
generator that yields audio chunks one at a time:

```python
chunks = []
for chunk in model.generate_streaming(target_text="今天天氣真好。"):
    chunks.append(chunk)
    # play or write each chunk in real time here
```

> Note: automatic retry (`retry_badcase`) is not supported in streaming mode.

## Four input modes

The model supports four input combinations through the same `generate` interface:

| Mode | Parameters | Use |
|---|---|---|
| Plain synthesis | `target_text` | Read the text aloud |
| Continuation | `target_text`, `prompt_text`, `prompt_wav_path` | Continue from an existing clip and its text |
| Reference clip | `target_text`, `reference_wav_path` | Mimic the reference speaker's timbre |
| Speaker vector | `target_text`, `speaker_centroid` | Clone a voice from a speaker vector |

## Common `generate` parameters

| Parameter | Default | Description |
|---|---|---|
| `target_text` | (required) | The text to synthesize |
| `prompt_text` | `""` | Prompt text, paired with `prompt_wav_path` for continuation |
| `prompt_wav_path` | `""` | Prompt audio path, for continuation |
| `reference_wav_path` | `""` | Reference audio path, for voice cloning |
| `speaker_centroid` | `None` | Speaker vector, to select a timbre |
| `cfg_value` | `2.0` | Guidance strength; higher follows the condition more closely but can sound less natural |
| `inference_timesteps` | `10` | Sampling steps; more usually means better quality and slower speed |
| `min_len` / `max_len` | `2` / `2000` | Lower / upper bound on output length |
| `retry_badcase` | `False` | Auto-retry on detected bad output (unsupported in streaming) |

## Batch serving engine (multi-request acceleration)

To serve many synthesis requests at once for higher throughput, use the built-in
batch engine `BlueMagpieEngine`. It does **continuous batching**: requests are
decoded together as a batch, new requests can join mid-decode, and they do not
interfere with one another.

Highlights:

- **No extra dependencies** — torch only; no vLLM, flash-attn, etc.
- **Cross-device** — one code path on CUDA, Apple Silicon (MPS), and CPU.
  CUDA-only optimizations are auto-detected and enabled, and skipped elsewhere.
- **Numerically identical to single-call `generate`** at batch=1 (`model.generate`
  is always the reference).

### Basic usage

```python
import soundfile as sf
from bluemagpie.serving import BlueMagpieEngine, EngineConfig, Request

# load `model` and `tokenizer` as shown above (from_local)
engine = BlueMagpieEngine(model, EngineConfig(max_num_seqs=16))

engine.add_request(Request(target_text="今天天氣真好。", seed=0))
engine.add_request(Request(target_text="第二句話。", reference_wav_path="speaker.wav"))

for out in engine.run():            # returned in request-id (submission) order
    # out.audio: 48 kHz waveform (when an AudioVAE is attached); out.latents: [T, p, d]
    sf.write(f"output_{out.request_id}.wav", out.audio.numpy(), out.sample_rate)
```

`Request` supports the same four input modes as `generate` (plain, continuation,
reference clip, speaker vector) via the fields `target_text`, `prompt_text`,
`prompt_wav_path`, `reference_wav_path`, `speaker_centroid`, `cfg_value`,
`inference_timesteps`, etc. Each request may set a `seed`, which makes its output
independent of how many neighbours share the batch and of admission order.

### Streaming

`engine.stream()` is a generator that yields a chunk per request per step:

```python
for chunk in engine.stream():
    # chunk.request_id, chunk.latents, chunk.audio, chunk.finished
    play_or_write(chunk)
```

> Plain synthesis, reference-clip, and speaker-vector modes stream audio
> (`chunk.audio`); prompt-audio continuation streams `latents` only — use `run()`
> when you need its audio.

### Configuration

Common `EngineConfig` parameters:

| Parameter | Default | Description |
|---|---|---|
| `max_num_seqs` | `16` | Max concurrent requests batched together |
| `max_model_len` | `2048` | Max length per sequence (prompt + generated) |
| `inference_timesteps` | `9` | Sampling steps |
| `cfg_value` | `2.8` | Guidance strength |
| `enforce_eager` | `True` | Keep the path numerically identical to single-call `generate` |
| `compile` | `False` | Enable `torch.compile` (CUDA only; auto-skipped elsewhere) |

> See [`src/bluemagpie/serving/DESIGN.md`](src/bluemagpie/serving/DESIGN.md) for the
> engine's design, trade-offs, and known limitations.

### Why not just use vLLM?

People often expect "wrap it in vLLM and it gets fast", but for BlueMagpie that
does not work, for two reasons:

1. **The real compute bottleneck is the diffusion decoder, not the language
   model.** Per generated audio unit the DiT (LocDiT / CFM diffusion decoder) is
   called ~16–18 times (sampling steps × the unconditional/conditional CFG
   pair), while the language models (Barbet, RALM) run once each. vLLM is a
   *text language-model* inference framework — it does not touch the diffusion
   decoder at all, so even moving the LMs onto vLLM leaves the dominant compute
   running eagerly and barely moves end-to-end latency.
2. **vLLM does not support Barbet's hybrid architecture.** Barbet (the
   text-semantic LM) is a Mamba2 + attention hybrid, and vLLM (as well as
   nano-vllm and vllm-omni) has zero support for such a hybrid TSLM — you'd have
   to implement a first-class hybrid model yourself (large effort, CUDA-only).

So this engine **borrows vLLM's architectural techniques without depending on its
CUDA kernels**:

- **Continuous batching** of many requests (the main throughput win), sharing
  batched compute across requests.
- A **padded KV cache + SDPA + masks** instead of vLLM's PagedAttention /
  FlashAttention — trading peak speed and memory efficiency for cross-device,
  zero-dependency portability.
- Barbet's Mamba state handled with a **pure-PyTorch single-step recurrence**, no
  fused kernel required.
- Optional `compile=True` uses `torch.compile` (which captures CUDA graphs
  internally) to accelerate the **DiT and LocEnc** — the actual hot path, and
  exactly what wrapping in vLLM would *not* do for you.

> In short: we don't aim to beat vLLM on a single op; we use vLLM-class **batch
> scheduling** plus **DiT-bottleneck optimization** to raise overall throughput
> with no extra dependencies, across CUDA / MPS / CPU.

## Apple Silicon MLX acceleration (optional)

On Apple Silicon (M-series), a native **MLX** path runs inference directly on the
Apple GPU (Metal, unified memory) — typically faster than PyTorch's MPS backend.
It is an optional extra; the core package stays torch-only:

```bash
pip install -e .[mlx]
```

```python
import soundfile as sf
from bluemagpie import BlueMagpieModel
from bluemagpie.mlx import BlueMagpieMLX, mlx_generate

model = BlueMagpieModel.from_local(model_dir, tokenizer=tokenizer, device="cpu")
mlx_model = BlueMagpieMLX(model)          # converts the weights once

audio = mlx_generate(model, mlx_model, "今天天氣真好。", seed=0)   # 48 kHz waveform
sf.write("output.wav", audio.numpy(), model.sample_rate)
```

- The whole inference path (Barbet, RALM, LocEnc, LocDiT/CFM, the **AudioVAE
  decoder**, the AR loop) is re-implemented in MLX and numerically parity-checked,
  module by module — generation can run torch-free (only tokenization and
  reference-wav encoding stay in torch).
- Decode uses cached single-step kernels (it advances one position per step, not a
  full re-run).
- `mlx_generate` supports the same four input modes as `generate`.
- On the real 7.75 GB model: end-to-end **RTF 0.77** (faster than real time) —
  ~**1.45×** over torch-MPS and ~**3.27×** over torch-CPU (fp32,
  `scripts/bench_rtf.py`). See [`src/bluemagpie/mlx/DESIGN.md`](src/bluemagpie/mlx/DESIGN.md).

## Notes

- The examples load the tokenizer from `tokenizer.json` and pass it to
  `from_local`, which is stable on transformers 5.x. (`from_local`'s automatic
  tokenizer loading can fail on 5.x — see Troubleshooting.)
- A GPU is optional: set `device="cpu"` (slower, but short utterances take only
  tens of seconds). Output is 48 kHz mono.
- The bundled `hung_yi_lee` speaker vector is authorized for example use. For any
  other speaker or voice cloning, use only reference audio or speaker vectors you
  are authorized to use.
- Keep speaker-vector tables and synthesized audio private; do not distribute
  them without authorization.

## Troubleshooting

**Tokenizer loading on newer transformers (5.x).** The examples load the
tokenizer explicitly from `tokenizer.json`, so they work on transformers 5.x with
no extra steps (the model only uses the tokenizer's `encode`).

If you instead rely on `from_local`'s automatic tokenizer loading (passing no
`tokenizer`), transformers 5.x may fail while parsing `tokenizer_config.json`
with `TypeError: ..._patch_mistral_regex() got multiple values for keyword
argument 'fix_mistral_regex'`, or appear to load but raise `ValueError: No
tokenizer attached to BlueMagpieModel` when you call `generate()`. Use the
explicit loading shown above instead.