File size: 14,545 Bytes
6fbb52c 4e9e0e4 6fbb52c 4e9e0e4 6fbb52c 4e9e0e4 6fbb52c 4e9e0e4 7b210fd 4e9e0e4 7b210fd 4e9e0e4 6fbb52c 7b210fd 4e9e0e4 6fbb52c 4e9e0e4 6fbb52c 4e9e0e4 6fbb52c 4e9e0e4 6fbb52c 4e9e0e4 6fbb52c 4e9e0e4 6fbb52c 4e9e0e4 6fbb52c 4e9e0e4 6fbb52c 4e9e0e4 6fbb52c 4e9e0e4 6fbb52c 4e9e0e4 6fbb52c 4e9e0e4 6fbb52c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 | # BlueMagpie-TTS — Usage
BlueMagpie-TTS is a text-to-speech (TTS) model that synthesizes natural speech
from text. It supports three scenarios:
- **Plain synthesis** — read the text aloud.
- **Voice cloning** — mimic the timbre of a reference clip.
- **Speaker selection** — control the timbre with a prepared speaker vector.
It also supports **streaming output** for synthesize-while-you-play applications.
🔊 **Try it online:** [BlueMagpie-TTS Demo (Hugging Face Space)](https://huggingface.co/spaces/voidful/BlueMagpie-TTS-Demo)
## Install
```bash
git clone https://github.com/OpenFormosa/BlueMagpie-TTS
cd BlueMagpie-TTS
pip install -e .
```
The install pulls in the [`barbet`](https://github.com/OpenFormosa/Barbet)
package (the text-semantic language model) from GitHub. The acoustic modules are
vendored in `bluemagpie/_vendor/` (sourced from
[VoxCPM](https://github.com/OpenBMB/VoxCPM), Apache-2.0) and need no separate
install. To save synthesized audio, also install `soundfile`:
```bash
pip install soundfile
```
## Load the model
### From Hugging Face
```python
import os
from huggingface_hub import snapshot_download
from transformers import PreTrainedTokenizerFast
from bluemagpie import BlueMagpieModel
model_dir = snapshot_download("OpenFormosa/BlueMagpie-TTS", token=True)
# Load the tokenizer straight from tokenizer.json (works on transformers 5.x).
tokenizer = PreTrainedTokenizerFast(tokenizer_file=os.path.join(model_dir, "tokenizer.json"))
model = BlueMagpieModel.from_local(model_dir, tokenizer=tokenizer, training=False, device="cuda")
```
### From a local directory
```python
import os
from transformers import PreTrainedTokenizerFast
from bluemagpie import BlueMagpieModel
model_dir = "checkpoints/bluemagpie"
tokenizer = PreTrainedTokenizerFast(tokenizer_file=os.path.join(model_dir, "tokenizer.json"))
model = BlueMagpieModel.from_local(model_dir, tokenizer=tokenizer, training=False, device="cuda")
```
- `device` may be `"cuda"`, `"mps"`, or `"cpu"` (auto-selected if omitted).
- Always use `training=False` for inference.
## Basic synthesis: text to speech
`generate` returns a speech waveform (`torch.Tensor`); pair it with `soundfile`
to write a `.wav`. The output sample rate is `model.sample_rate` (48 kHz).
```python
import soundfile as sf
audio = model.generate(target_text="今天天氣真好。", cfg_value=2.0)
sf.write("output.wav", audio.squeeze().cpu().numpy(), model.sample_rate)
```
## Voice cloning: mimic a reference speaker
Two ways.
**A. Speaker vector (`speaker_centroid`)** — extract a vector from the reference
audio, then synthesize (no transcript needed):
```bash
pip install -e ".[clone]" # extraction needs speechbrain (ECAPA-TDNN)
python scripts/extract_speaker_centroid.py --audio reference.wav --out my_voice.pt
# more clips of the same speaker -> cleaner centroid: --audio a.wav b.wav c.wav
```
```python
import torch
centroid = torch.load("my_voice.pt", weights_only=True) # [192] speaker vector
audio = model.generate(
target_text="今天天氣真好。",
speaker_centroid=centroid,
cfg_value=2.8,
)
# or extract it in-process:
from bluemagpie import extract_speaker_centroid
centroid = extract_speaker_centroid("reference.wav") # [192]
```
**B. Reference clip (`reference_wav_path`)** — pass a reference clip directly:
```python
audio = model.generate(
target_text="今天天氣真好。",
reference_wav_path="reference.wav",
cfg_value=2.8,
)
```
## Speaker selection: control timbre with a speaker vector
The model bundles a **multi-speaker table** at `checkpoints/speaker_centroids.pt`,
currently holding two speakers:
| speaker id | description | suggested `cfg_value` |
| --- | --- | --- |
| `hung_yi_lee` | Prof. Hung-yi Lee's speaker vector (used with his authorization; the official best params are tuned for this speaker) | 2.0–2.8 |
| `female_voice` | a generic female voice | 2.0–2.8 |
The table has the format `{"speaker_ids": [...], "centroids": tensor[N, 192], "dim": 192}`.
Load it with `torch.load`, **pick a speaker's `[192]` vector by id**, and pass it as
`speaker_centroid`:
```python
import os
import torch
table = torch.load(
os.path.join(model_dir, "checkpoints", "speaker_centroids.pt"),
map_location="cpu",
weights_only=True,
)
print(table["speaker_ids"]) # ['hung_yi_lee', 'female_voice']
# switch speaker by changing this line ("hung_yi_lee" or "female_voice")
speaker_id = "female_voice"
speaker_centroid = table["centroids"][table["speaker_ids"].index(speaker_id)] # [192]
audio = model.generate(
target_text="今天天氣真好。",
speaker_centroid=speaker_centroid, # or your own authorized speaker vector
cfg_value=2.0,
)
```
If you only have the model id (haven't `snapshot_download`-ed the whole model yet),
grab just the table:
```python
from huggingface_hub import hf_hub_download
path = hf_hub_download("OpenFormosa/BlueMagpie-TTS", "checkpoints/speaker_centroids.pt")
table = torch.load(path, map_location="cpu", weights_only=True)
```
> To add more speakers, extract your own (authorized) `[192]` vector with
> `extract_speaker_centroid` from the *Voice cloning* section above — it's passed the
> exact same way. The earlier single-speaker file
> `checkpoints/hung_yi_lee_speaker_centroids.pt` (same format) is still available.
## Streaming output
When you need to play while synthesizing, use `generate_streaming`. It is a
generator that yields audio chunks one at a time:
```python
chunks = []
for chunk in model.generate_streaming(target_text="今天天氣真好。"):
chunks.append(chunk)
# play or write each chunk in real time here
```
> Note: automatic retry (`retry_badcase`) is not supported in streaming mode.
## Four input modes
The model supports four input combinations through the same `generate` interface:
| Mode | Parameters | Use |
|---|---|---|
| Plain synthesis | `target_text` | Read the text aloud |
| Continuation | `target_text`, `prompt_text`, `prompt_wav_path` | Continue from an existing clip and its text |
| Reference clip | `target_text`, `reference_wav_path` | Mimic the reference speaker's timbre |
| Speaker vector | `target_text`, `speaker_centroid` | Clone a voice from a speaker vector |
## Common `generate` parameters
| Parameter | Default | Description |
|---|---|---|
| `target_text` | (required) | The text to synthesize |
| `prompt_text` | `""` | Prompt text, paired with `prompt_wav_path` for continuation |
| `prompt_wav_path` | `""` | Prompt audio path, for continuation |
| `reference_wav_path` | `""` | Reference audio path, for voice cloning |
| `speaker_centroid` | `None` | Speaker vector, to select a timbre |
| `cfg_value` | `2.0` | Guidance strength; higher follows the condition more closely but can sound less natural |
| `inference_timesteps` | `10` | Sampling steps; more usually means better quality and slower speed |
| `min_len` / `max_len` | `2` / `2000` | Lower / upper bound on output length |
| `retry_badcase` | `False` | Auto-retry on detected bad output (unsupported in streaming) |
## Batch serving engine (multi-request acceleration)
To serve many synthesis requests at once for higher throughput, use the built-in
batch engine `BlueMagpieEngine`. It does **continuous batching**: requests are
decoded together as a batch, new requests can join mid-decode, and they do not
interfere with one another.
Highlights:
- **No extra dependencies** — torch only; no vLLM, flash-attn, etc.
- **Cross-device** — one code path on CUDA, Apple Silicon (MPS), and CPU.
CUDA-only optimizations are auto-detected and enabled, and skipped elsewhere.
- **Numerically identical to single-call `generate`** at batch=1 (`model.generate`
is always the reference).
### Basic usage
```python
import soundfile as sf
from bluemagpie.serving import BlueMagpieEngine, EngineConfig, Request
# load `model` and `tokenizer` as shown above (from_local)
engine = BlueMagpieEngine(model, EngineConfig(max_num_seqs=16))
engine.add_request(Request(target_text="今天天氣真好。", seed=0))
engine.add_request(Request(target_text="第二句話。", reference_wav_path="speaker.wav"))
for out in engine.run(): # returned in request-id (submission) order
# out.audio: 48 kHz waveform (when an AudioVAE is attached); out.latents: [T, p, d]
sf.write(f"output_{out.request_id}.wav", out.audio.numpy(), out.sample_rate)
```
`Request` supports the same four input modes as `generate` (plain, continuation,
reference clip, speaker vector) via the fields `target_text`, `prompt_text`,
`prompt_wav_path`, `reference_wav_path`, `speaker_centroid`, `cfg_value`,
`inference_timesteps`, etc. Each request may set a `seed`, which makes its output
independent of how many neighbours share the batch and of admission order.
### Streaming
`engine.stream()` is a generator that yields a chunk per request per step:
```python
for chunk in engine.stream():
# chunk.request_id, chunk.latents, chunk.audio, chunk.finished
play_or_write(chunk)
```
> Plain synthesis, reference-clip, and speaker-vector modes stream audio
> (`chunk.audio`); prompt-audio continuation streams `latents` only — use `run()`
> when you need its audio.
### Configuration
Common `EngineConfig` parameters:
| Parameter | Default | Description |
|---|---|---|
| `max_num_seqs` | `16` | Max concurrent requests batched together |
| `max_model_len` | `2048` | Max length per sequence (prompt + generated) |
| `inference_timesteps` | `9` | Sampling steps |
| `cfg_value` | `2.8` | Guidance strength |
| `enforce_eager` | `True` | Keep the path numerically identical to single-call `generate` |
| `compile` | `False` | Enable `torch.compile` (CUDA only; auto-skipped elsewhere) |
> See [`src/bluemagpie/serving/DESIGN.md`](src/bluemagpie/serving/DESIGN.md) for the
> engine's design, trade-offs, and known limitations.
### Why not just use vLLM?
People often expect "wrap it in vLLM and it gets fast", but for BlueMagpie that
does not work, for two reasons:
1. **The real compute bottleneck is the diffusion decoder, not the language
model.** Per generated audio unit the DiT (LocDiT / CFM diffusion decoder) is
called ~16–18 times (sampling steps × the unconditional/conditional CFG
pair), while the language models (Barbet, RALM) run once each. vLLM is a
*text language-model* inference framework — it does not touch the diffusion
decoder at all, so even moving the LMs onto vLLM leaves the dominant compute
running eagerly and barely moves end-to-end latency.
2. **vLLM does not support Barbet's hybrid architecture.** Barbet (the
text-semantic LM) is a Mamba2 + attention hybrid, and vLLM (as well as
nano-vllm and vllm-omni) has zero support for such a hybrid TSLM — you'd have
to implement a first-class hybrid model yourself (large effort, CUDA-only).
So this engine **borrows vLLM's architectural techniques without depending on its
CUDA kernels**:
- **Continuous batching** of many requests (the main throughput win), sharing
batched compute across requests.
- A **padded KV cache + SDPA + masks** instead of vLLM's PagedAttention /
FlashAttention — trading peak speed and memory efficiency for cross-device,
zero-dependency portability.
- Barbet's Mamba state handled with a **pure-PyTorch single-step recurrence**, no
fused kernel required.
- Optional `compile=True` uses `torch.compile` (which captures CUDA graphs
internally) to accelerate the **DiT and LocEnc** — the actual hot path, and
exactly what wrapping in vLLM would *not* do for you.
> In short: we don't aim to beat vLLM on a single op; we use vLLM-class **batch
> scheduling** plus **DiT-bottleneck optimization** to raise overall throughput
> with no extra dependencies, across CUDA / MPS / CPU.
## Apple Silicon MLX acceleration (optional)
On Apple Silicon (M-series), a native **MLX** path runs inference directly on the
Apple GPU (Metal, unified memory) — typically faster than PyTorch's MPS backend.
It is an optional extra; the core package stays torch-only:
```bash
pip install -e .[mlx]
```
```python
import soundfile as sf
from bluemagpie import BlueMagpieModel
from bluemagpie.mlx import BlueMagpieMLX, mlx_generate
model = BlueMagpieModel.from_local(model_dir, tokenizer=tokenizer, device="cpu")
mlx_model = BlueMagpieMLX(model) # converts the weights once
audio = mlx_generate(model, mlx_model, "今天天氣真好。", seed=0) # 48 kHz waveform
sf.write("output.wav", audio.numpy(), model.sample_rate)
```
- The whole inference path (Barbet, RALM, LocEnc, LocDiT/CFM, the **AudioVAE
decoder**, the AR loop) is re-implemented in MLX and numerically parity-checked,
module by module — generation can run torch-free (only tokenization and
reference-wav encoding stay in torch).
- Decode uses cached single-step kernels (it advances one position per step, not a
full re-run).
- `mlx_generate` supports the same four input modes as `generate`.
- On the real 7.75 GB model: end-to-end **RTF 0.77** (faster than real time) —
~**1.45×** over torch-MPS and ~**3.27×** over torch-CPU (fp32,
`scripts/bench_rtf.py`). See [`src/bluemagpie/mlx/DESIGN.md`](src/bluemagpie/mlx/DESIGN.md).
## Notes
- The examples load the tokenizer from `tokenizer.json` and pass it to
`from_local`, which is stable on transformers 5.x. (`from_local`'s automatic
tokenizer loading can fail on 5.x — see Troubleshooting.)
- A GPU is optional: set `device="cpu"` (slower, but short utterances take only
tens of seconds). Output is 48 kHz mono.
- The bundled `hung_yi_lee` speaker vector is authorized for example use. For any
other speaker or voice cloning, use only reference audio or speaker vectors you
are authorized to use.
- Keep speaker-vector tables and synthesized audio private; do not distribute
them without authorization.
## Troubleshooting
**Tokenizer loading on newer transformers (5.x).** The examples load the
tokenizer explicitly from `tokenizer.json`, so they work on transformers 5.x with
no extra steps (the model only uses the tokenizer's `encode`).
If you instead rely on `from_local`'s automatic tokenizer loading (passing no
`tokenizer`), transformers 5.x may fail while parsing `tokenizer_config.json`
with `TypeError: ..._patch_mistral_regex() got multiple values for keyword
argument 'fix_mistral_regex'`, or appear to load but raise `ValueError: No
tokenizer attached to BlueMagpieModel` when you call `generate()`. Use the
explicit loading shown above instead.
|