BlueMagpie-TTS / USAGE.md
voidful's picture
Docs: document bundled multi-speaker centroid table (hung_yi_lee + female_voice)
6fbb52c verified
|
Raw
History Blame Contribute Delete
14.5 kB
# BlueMagpie-TTS — Usage
BlueMagpie-TTS is a text-to-speech (TTS) model that synthesizes natural speech
from text. It supports three scenarios:
- **Plain synthesis** — read the text aloud.
- **Voice cloning** — mimic the timbre of a reference clip.
- **Speaker selection** — control the timbre with a prepared speaker vector.
It also supports **streaming output** for synthesize-while-you-play applications.
🔊 **Try it online:** [BlueMagpie-TTS Demo (Hugging Face Space)](https://huggingface.co/spaces/voidful/BlueMagpie-TTS-Demo)
## Install
```bash
git clone https://github.com/OpenFormosa/BlueMagpie-TTS
cd BlueMagpie-TTS
pip install -e .
```
The install pulls in the [`barbet`](https://github.com/OpenFormosa/Barbet)
package (the text-semantic language model) from GitHub. The acoustic modules are
vendored in `bluemagpie/_vendor/` (sourced from
[VoxCPM](https://github.com/OpenBMB/VoxCPM), Apache-2.0) and need no separate
install. To save synthesized audio, also install `soundfile`:
```bash
pip install soundfile
```
## Load the model
### From Hugging Face
```python
import os
from huggingface_hub import snapshot_download
from transformers import PreTrainedTokenizerFast
from bluemagpie import BlueMagpieModel
model_dir = snapshot_download("OpenFormosa/BlueMagpie-TTS", token=True)
# Load the tokenizer straight from tokenizer.json (works on transformers 5.x).
tokenizer = PreTrainedTokenizerFast(tokenizer_file=os.path.join(model_dir, "tokenizer.json"))
model = BlueMagpieModel.from_local(model_dir, tokenizer=tokenizer, training=False, device="cuda")
```
### From a local directory
```python
import os
from transformers import PreTrainedTokenizerFast
from bluemagpie import BlueMagpieModel
model_dir = "checkpoints/bluemagpie"
tokenizer = PreTrainedTokenizerFast(tokenizer_file=os.path.join(model_dir, "tokenizer.json"))
model = BlueMagpieModel.from_local(model_dir, tokenizer=tokenizer, training=False, device="cuda")
```
- `device` may be `"cuda"`, `"mps"`, or `"cpu"` (auto-selected if omitted).
- Always use `training=False` for inference.
## Basic synthesis: text to speech
`generate` returns a speech waveform (`torch.Tensor`); pair it with `soundfile`
to write a `.wav`. The output sample rate is `model.sample_rate` (48 kHz).
```python
import soundfile as sf
audio = model.generate(target_text="今天天氣真好。", cfg_value=2.0)
sf.write("output.wav", audio.squeeze().cpu().numpy(), model.sample_rate)
```
## Voice cloning: mimic a reference speaker
Two ways.
**A. Speaker vector (`speaker_centroid`)** — extract a vector from the reference
audio, then synthesize (no transcript needed):
```bash
pip install -e ".[clone]" # extraction needs speechbrain (ECAPA-TDNN)
python scripts/extract_speaker_centroid.py --audio reference.wav --out my_voice.pt
# more clips of the same speaker -> cleaner centroid: --audio a.wav b.wav c.wav
```
```python
import torch
centroid = torch.load("my_voice.pt", weights_only=True) # [192] speaker vector
audio = model.generate(
target_text="今天天氣真好。",
speaker_centroid=centroid,
cfg_value=2.8,
)
# or extract it in-process:
from bluemagpie import extract_speaker_centroid
centroid = extract_speaker_centroid("reference.wav") # [192]
```
**B. Reference clip (`reference_wav_path`)** — pass a reference clip directly:
```python
audio = model.generate(
target_text="今天天氣真好。",
reference_wav_path="reference.wav",
cfg_value=2.8,
)
```
## Speaker selection: control timbre with a speaker vector
The model bundles a **multi-speaker table** at `checkpoints/speaker_centroids.pt`,
currently holding two speakers:
| speaker id | description | suggested `cfg_value` |
| --- | --- | --- |
| `hung_yi_lee` | Prof. Hung-yi Lee's speaker vector (used with his authorization; the official best params are tuned for this speaker) | 2.0–2.8 |
| `female_voice` | a generic female voice | 2.0–2.8 |
The table has the format `{"speaker_ids": [...], "centroids": tensor[N, 192], "dim": 192}`.
Load it with `torch.load`, **pick a speaker's `[192]` vector by id**, and pass it as
`speaker_centroid`:
```python
import os
import torch
table = torch.load(
os.path.join(model_dir, "checkpoints", "speaker_centroids.pt"),
map_location="cpu",
weights_only=True,
)
print(table["speaker_ids"]) # ['hung_yi_lee', 'female_voice']
# switch speaker by changing this line ("hung_yi_lee" or "female_voice")
speaker_id = "female_voice"
speaker_centroid = table["centroids"][table["speaker_ids"].index(speaker_id)] # [192]
audio = model.generate(
target_text="今天天氣真好。",
speaker_centroid=speaker_centroid, # or your own authorized speaker vector
cfg_value=2.0,
)
```
If you only have the model id (haven't `snapshot_download`-ed the whole model yet),
grab just the table:
```python
from huggingface_hub import hf_hub_download
path = hf_hub_download("OpenFormosa/BlueMagpie-TTS", "checkpoints/speaker_centroids.pt")
table = torch.load(path, map_location="cpu", weights_only=True)
```
> To add more speakers, extract your own (authorized) `[192]` vector with
> `extract_speaker_centroid` from the *Voice cloning* section above — it's passed the
> exact same way. The earlier single-speaker file
> `checkpoints/hung_yi_lee_speaker_centroids.pt` (same format) is still available.
## Streaming output
When you need to play while synthesizing, use `generate_streaming`. It is a
generator that yields audio chunks one at a time:
```python
chunks = []
for chunk in model.generate_streaming(target_text="今天天氣真好。"):
chunks.append(chunk)
# play or write each chunk in real time here
```
> Note: automatic retry (`retry_badcase`) is not supported in streaming mode.
## Four input modes
The model supports four input combinations through the same `generate` interface:
| Mode | Parameters | Use |
|---|---|---|
| Plain synthesis | `target_text` | Read the text aloud |
| Continuation | `target_text`, `prompt_text`, `prompt_wav_path` | Continue from an existing clip and its text |
| Reference clip | `target_text`, `reference_wav_path` | Mimic the reference speaker's timbre |
| Speaker vector | `target_text`, `speaker_centroid` | Clone a voice from a speaker vector |
## Common `generate` parameters
| Parameter | Default | Description |
|---|---|---|
| `target_text` | (required) | The text to synthesize |
| `prompt_text` | `""` | Prompt text, paired with `prompt_wav_path` for continuation |
| `prompt_wav_path` | `""` | Prompt audio path, for continuation |
| `reference_wav_path` | `""` | Reference audio path, for voice cloning |
| `speaker_centroid` | `None` | Speaker vector, to select a timbre |
| `cfg_value` | `2.0` | Guidance strength; higher follows the condition more closely but can sound less natural |
| `inference_timesteps` | `10` | Sampling steps; more usually means better quality and slower speed |
| `min_len` / `max_len` | `2` / `2000` | Lower / upper bound on output length |
| `retry_badcase` | `False` | Auto-retry on detected bad output (unsupported in streaming) |
## Batch serving engine (multi-request acceleration)
To serve many synthesis requests at once for higher throughput, use the built-in
batch engine `BlueMagpieEngine`. It does **continuous batching**: requests are
decoded together as a batch, new requests can join mid-decode, and they do not
interfere with one another.
Highlights:
- **No extra dependencies** — torch only; no vLLM, flash-attn, etc.
- **Cross-device** — one code path on CUDA, Apple Silicon (MPS), and CPU.
CUDA-only optimizations are auto-detected and enabled, and skipped elsewhere.
- **Numerically identical to single-call `generate`** at batch=1 (`model.generate`
is always the reference).
### Basic usage
```python
import soundfile as sf
from bluemagpie.serving import BlueMagpieEngine, EngineConfig, Request
# load `model` and `tokenizer` as shown above (from_local)
engine = BlueMagpieEngine(model, EngineConfig(max_num_seqs=16))
engine.add_request(Request(target_text="今天天氣真好。", seed=0))
engine.add_request(Request(target_text="第二句話。", reference_wav_path="speaker.wav"))
for out in engine.run(): # returned in request-id (submission) order
# out.audio: 48 kHz waveform (when an AudioVAE is attached); out.latents: [T, p, d]
sf.write(f"output_{out.request_id}.wav", out.audio.numpy(), out.sample_rate)
```
`Request` supports the same four input modes as `generate` (plain, continuation,
reference clip, speaker vector) via the fields `target_text`, `prompt_text`,
`prompt_wav_path`, `reference_wav_path`, `speaker_centroid`, `cfg_value`,
`inference_timesteps`, etc. Each request may set a `seed`, which makes its output
independent of how many neighbours share the batch and of admission order.
### Streaming
`engine.stream()` is a generator that yields a chunk per request per step:
```python
for chunk in engine.stream():
# chunk.request_id, chunk.latents, chunk.audio, chunk.finished
play_or_write(chunk)
```
> Plain synthesis, reference-clip, and speaker-vector modes stream audio
> (`chunk.audio`); prompt-audio continuation streams `latents` only — use `run()`
> when you need its audio.
### Configuration
Common `EngineConfig` parameters:
| Parameter | Default | Description |
|---|---|---|
| `max_num_seqs` | `16` | Max concurrent requests batched together |
| `max_model_len` | `2048` | Max length per sequence (prompt + generated) |
| `inference_timesteps` | `9` | Sampling steps |
| `cfg_value` | `2.8` | Guidance strength |
| `enforce_eager` | `True` | Keep the path numerically identical to single-call `generate` |
| `compile` | `False` | Enable `torch.compile` (CUDA only; auto-skipped elsewhere) |
> See [`src/bluemagpie/serving/DESIGN.md`](src/bluemagpie/serving/DESIGN.md) for the
> engine's design, trade-offs, and known limitations.
### Why not just use vLLM?
People often expect "wrap it in vLLM and it gets fast", but for BlueMagpie that
does not work, for two reasons:
1. **The real compute bottleneck is the diffusion decoder, not the language
model.** Per generated audio unit the DiT (LocDiT / CFM diffusion decoder) is
called ~16–18 times (sampling steps × the unconditional/conditional CFG
pair), while the language models (Barbet, RALM) run once each. vLLM is a
*text language-model* inference framework — it does not touch the diffusion
decoder at all, so even moving the LMs onto vLLM leaves the dominant compute
running eagerly and barely moves end-to-end latency.
2. **vLLM does not support Barbet's hybrid architecture.** Barbet (the
text-semantic LM) is a Mamba2 + attention hybrid, and vLLM (as well as
nano-vllm and vllm-omni) has zero support for such a hybrid TSLM — you'd have
to implement a first-class hybrid model yourself (large effort, CUDA-only).
So this engine **borrows vLLM's architectural techniques without depending on its
CUDA kernels**:
- **Continuous batching** of many requests (the main throughput win), sharing
batched compute across requests.
- A **padded KV cache + SDPA + masks** instead of vLLM's PagedAttention /
FlashAttention — trading peak speed and memory efficiency for cross-device,
zero-dependency portability.
- Barbet's Mamba state handled with a **pure-PyTorch single-step recurrence**, no
fused kernel required.
- Optional `compile=True` uses `torch.compile` (which captures CUDA graphs
internally) to accelerate the **DiT and LocEnc** — the actual hot path, and
exactly what wrapping in vLLM would *not* do for you.
> In short: we don't aim to beat vLLM on a single op; we use vLLM-class **batch
> scheduling** plus **DiT-bottleneck optimization** to raise overall throughput
> with no extra dependencies, across CUDA / MPS / CPU.
## Apple Silicon MLX acceleration (optional)
On Apple Silicon (M-series), a native **MLX** path runs inference directly on the
Apple GPU (Metal, unified memory) — typically faster than PyTorch's MPS backend.
It is an optional extra; the core package stays torch-only:
```bash
pip install -e .[mlx]
```
```python
import soundfile as sf
from bluemagpie import BlueMagpieModel
from bluemagpie.mlx import BlueMagpieMLX, mlx_generate
model = BlueMagpieModel.from_local(model_dir, tokenizer=tokenizer, device="cpu")
mlx_model = BlueMagpieMLX(model) # converts the weights once
audio = mlx_generate(model, mlx_model, "今天天氣真好。", seed=0) # 48 kHz waveform
sf.write("output.wav", audio.numpy(), model.sample_rate)
```
- The whole inference path (Barbet, RALM, LocEnc, LocDiT/CFM, the **AudioVAE
decoder**, the AR loop) is re-implemented in MLX and numerically parity-checked,
module by module — generation can run torch-free (only tokenization and
reference-wav encoding stay in torch).
- Decode uses cached single-step kernels (it advances one position per step, not a
full re-run).
- `mlx_generate` supports the same four input modes as `generate`.
- On the real 7.75 GB model: end-to-end **RTF 0.77** (faster than real time) —
~**1.45×** over torch-MPS and ~**3.27×** over torch-CPU (fp32,
`scripts/bench_rtf.py`). See [`src/bluemagpie/mlx/DESIGN.md`](src/bluemagpie/mlx/DESIGN.md).
## Notes
- The examples load the tokenizer from `tokenizer.json` and pass it to
`from_local`, which is stable on transformers 5.x. (`from_local`'s automatic
tokenizer loading can fail on 5.x — see Troubleshooting.)
- A GPU is optional: set `device="cpu"` (slower, but short utterances take only
tens of seconds). Output is 48 kHz mono.
- The bundled `hung_yi_lee` speaker vector is authorized for example use. For any
other speaker or voice cloning, use only reference audio or speaker vectors you
are authorized to use.
- Keep speaker-vector tables and synthesized audio private; do not distribute
them without authorization.
## Troubleshooting
**Tokenizer loading on newer transformers (5.x).** The examples load the
tokenizer explicitly from `tokenizer.json`, so they work on transformers 5.x with
no extra steps (the model only uses the tokenizer's `encode`).
If you instead rely on `from_local`'s automatic tokenizer loading (passing no
`tokenizer`), transformers 5.x may fail while parsing `tokenizer_config.json`
with `TypeError: ..._patch_mistral_regex() got multiple values for keyword
argument 'fix_mistral_regex'`, or appear to load but raise `ValueError: No
tokenizer attached to BlueMagpieModel` when you call `generate()`. Use the
explicit loading shown above instead.