File size: 14,545 Bytes

# BlueMagpie-TTS — Usage

BlueMagpie-TTS is a text-to-speech (TTS) model that synthesizes natural speech
from text. It supports three scenarios:

- **Plain synthesis** — read the text aloud.
- **Voice cloning** — mimic the timbre of a reference clip.
- **Speaker selection** — control the timbre with a prepared speaker vector.

It also supports **streaming output** for synthesize-while-you-play applications.

🔊 **Try it online:** [BlueMagpie-TTS Demo (Hugging Face Space)](https://huggingface.co/spaces/voidful/BlueMagpie-TTS-Demo)

## Install

```bash
git clone https://github.com/OpenFormosa/BlueMagpie-TTS
cd BlueMagpie-TTS
pip install -e .
```

The install pulls in the [`barbet`](https://github.com/OpenFormosa/Barbet)
package (the text-semantic language model) from GitHub. The acoustic modules are
vendored in `bluemagpie/_vendor/` (sourced from
[VoxCPM](https://github.com/OpenBMB/VoxCPM), Apache-2.0) and need no separate
install. To save synthesized audio, also install `soundfile`:

```bash
pip install soundfile
```

## Load the model

### From Hugging Face

```python
import os
from huggingface_hub import snapshot_download
from transformers import PreTrainedTokenizerFast
from bluemagpie import BlueMagpieModel

model_dir = snapshot_download("OpenFormosa/BlueMagpie-TTS", token=True)
# Load the tokenizer straight from tokenizer.json (works on transformers 5.x).
tokenizer = PreTrainedTokenizerFast(tokenizer_file=os.path.join(model_dir, "tokenizer.json"))
model = BlueMagpieModel.from_local(model_dir, tokenizer=tokenizer, training=False, device="cuda")
```

### From a local directory

```python
import os
from transformers import PreTrainedTokenizerFast
from bluemagpie import BlueMagpieModel

model_dir = "checkpoints/bluemagpie"
tokenizer = PreTrainedTokenizerFast(tokenizer_file=os.path.join(model_dir, "tokenizer.json"))
model = BlueMagpieModel.from_local(model_dir, tokenizer=tokenizer, training=False, device="cuda")
```

- `device` may be `"cuda"`, `"mps"`, or `"cpu"` (auto-selected if omitted).
- Always use `training=False` for inference.

## Basic synthesis: text to speech

`generate` returns a speech waveform (`torch.Tensor`); pair it with `soundfile`
to write a `.wav`. The output sample rate is `model.sample_rate` (48 kHz).

```python
import soundfile as sf

audio = model.generate(target_text="今天天氣真好。", cfg_value=2.0)
sf.write("output.wav", audio.squeeze().cpu().numpy(), model.sample_rate)
```

## Voice cloning: mimic a reference speaker

Two ways.

**A. Speaker vector (`speaker_centroid`)** — extract a vector from the reference
audio, then synthesize (no transcript needed):

```bash
pip install -e ".[clone]"   # extraction needs speechbrain (ECAPA-TDNN)
python scripts/extract_speaker_centroid.py --audio reference.wav --out my_voice.pt
# more clips of the same speaker -> cleaner centroid: --audio a.wav b.wav c.wav
```

```python
import torch

centroid = torch.load("my_voice.pt", weights_only=True)   # [192] speaker vector
audio = model.generate(
    target_text="今天天氣真好。",
    speaker_centroid=centroid,
    cfg_value=2.8,
)

# or extract it in-process:
from bluemagpie import extract_speaker_centroid
centroid = extract_speaker_centroid("reference.wav")      # [192]
```

**B. Reference clip (`reference_wav_path`)** — pass a reference clip directly:

```python
audio = model.generate(
    target_text="今天天氣真好。",
    reference_wav_path="reference.wav",
    cfg_value=2.8,
)
```

## Speaker selection: control timbre with a speaker vector

The model bundles a **multi-speaker table** at `checkpoints/speaker_centroids.pt`,
currently holding two speakers:

| speaker id | description | suggested `cfg_value` |
| --- | --- | --- |
| `hung_yi_lee` | Prof. Hung-yi Lee's speaker vector (used with his authorization; the official best params are tuned for this speaker) | 2.0–2.8 |
| `female_voice` | a generic female voice | 2.0–2.8 |

The table has the format `{"speaker_ids": [...], "centroids": tensor[N, 192], "dim": 192}`.
Load it with `torch.load`, **pick a speaker's `[192]` vector by id**, and pass it as
`speaker_centroid`:

```python
import os
import torch

table = torch.load(
    os.path.join(model_dir, "checkpoints", "speaker_centroids.pt"),
    map_location="cpu",
    weights_only=True,
)
print(table["speaker_ids"])          # ['hung_yi_lee', 'female_voice']

# switch speaker by changing this line ("hung_yi_lee" or "female_voice")
speaker_id = "female_voice"
speaker_centroid = table["centroids"][table["speaker_ids"].index(speaker_id)]   # [192]

audio = model.generate(
    target_text="今天天氣真好。",
    speaker_centroid=speaker_centroid,   # or your own authorized speaker vector
    cfg_value=2.0,
)
```

If you only have the model id (haven't `snapshot_download`-ed the whole model yet),
grab just the table:

```python
from huggingface_hub import hf_hub_download

path = hf_hub_download("OpenFormosa/BlueMagpie-TTS", "checkpoints/speaker_centroids.pt")
table = torch.load(path, map_location="cpu", weights_only=True)
```

> To add more speakers, extract your own (authorized) `[192]` vector with
> `extract_speaker_centroid` from the *Voice cloning* section above — it's passed the
> exact same way. The earlier single-speaker file
> `checkpoints/hung_yi_lee_speaker_centroids.pt` (same format) is still available.

## Streaming output

When you need to play while synthesizing, use `generate_streaming`. It is a
generator that yields audio chunks one at a time:

```python
chunks = []
for chunk in model.generate_streaming(target_text="今天天氣真好。"):
    chunks.append(chunk)
    # play or write each chunk in real time here
```

> Note: automatic retry (`retry_badcase`) is not supported in streaming mode.

## Four input modes

The model supports four input combinations through the same `generate` interface:

| Mode | Parameters | Use |
|---|---|---|
| Plain synthesis | `target_text` | Read the text aloud |
| Continuation | `target_text`, `prompt_text`, `prompt_wav_path` | Continue from an existing clip and its text |
| Reference clip | `target_text`, `reference_wav_path` | Mimic the reference speaker's timbre |
| Speaker vector | `target_text`, `speaker_centroid` | Clone a voice from a speaker vector |

## Common `generate` parameters

| Parameter | Default | Description |
|---|---|---|
| `target_text` | (required) | The text to synthesize |
| `prompt_text` | `""` | Prompt text, paired with `prompt_wav_path` for continuation |
| `prompt_wav_path` | `""` | Prompt audio path, for continuation |
| `reference_wav_path` | `""` | Reference audio path, for voice cloning |
| `speaker_centroid` | `None` | Speaker vector, to select a timbre |
| `cfg_value` | `2.0` | Guidance strength; higher follows the condition more closely but can sound less natural |
| `inference_timesteps` | `10` | Sampling steps; more usually means better quality and slower speed |
| `min_len` / `max_len` | `2` / `2000` | Lower / upper bound on output length |
| `retry_badcase` | `False` | Auto-retry on detected bad output (unsupported in streaming) |

## Batch serving engine (multi-request acceleration)

To serve many synthesis requests at once for higher throughput, use the built-in
batch engine `BlueMagpieEngine`. It does **continuous batching**: requests are
decoded together as a batch, new requests can join mid-decode, and they do not
interfere with one another.

Highlights:

- **No extra dependencies** — torch only; no vLLM, flash-attn, etc.
- **Cross-device** — one code path on CUDA, Apple Silicon (MPS), and CPU.
  CUDA-only optimizations are auto-detected and enabled, and skipped elsewhere.
- **Numerically identical to single-call `generate`** at batch=1 (`model.generate`
  is always the reference).

### Basic usage

```python
import soundfile as sf
from bluemagpie.serving import BlueMagpieEngine, EngineConfig, Request

# load `model` and `tokenizer` as shown above (from_local)
engine = BlueMagpieEngine(model, EngineConfig(max_num_seqs=16))

engine.add_request(Request(target_text="今天天氣真好。", seed=0))
engine.add_request(Request(target_text="第二句話。", reference_wav_path="speaker.wav"))

for out in engine.run():            # returned in request-id (submission) order
    # out.audio: 48 kHz waveform (when an AudioVAE is attached); out.latents: [T, p, d]
    sf.write(f"output_{out.request_id}.wav", out.audio.numpy(), out.sample_rate)
```

`Request` supports the same four input modes as `generate` (plain, continuation,
reference clip, speaker vector) via the fields `target_text`, `prompt_text`,
`prompt_wav_path`, `reference_wav_path`, `speaker_centroid`, `cfg_value`,
`inference_timesteps`, etc. Each request may set a `seed`, which makes its output
independent of how many neighbours share the batch and of admission order.

### Streaming

`engine.stream()` is a generator that yields a chunk per request per step:

```python
for chunk in engine.stream():
    # chunk.request_id, chunk.latents, chunk.audio, chunk.finished
    play_or_write(chunk)
```

> Plain synthesis, reference-clip, and speaker-vector modes stream audio
> (`chunk.audio`); prompt-audio continuation streams `latents` only — use `run()`
> when you need its audio.

### Configuration

Common `EngineConfig` parameters:

| Parameter | Default | Description |
|---|---|---|
| `max_num_seqs` | `16` | Max concurrent requests batched together |
| `max_model_len` | `2048` | Max length per sequence (prompt + generated) |
| `inference_timesteps` | `9` | Sampling steps |
| `cfg_value` | `2.8` | Guidance strength |
| `enforce_eager` | `True` | Keep the path numerically identical to single-call `generate` |
| `compile` | `False` | Enable `torch.compile` (CUDA only; auto-skipped elsewhere) |

> See [`src/bluemagpie/serving/DESIGN.md`](src/bluemagpie/serving/DESIGN.md) for the
> engine's design, trade-offs, and known limitations.

### Why not just use vLLM?

People often expect "wrap it in vLLM and it gets fast", but for BlueMagpie that
does not work, for two reasons:

1. **The real compute bottleneck is the diffusion decoder, not the language
   model.** Per generated audio unit the DiT (LocDiT / CFM diffusion decoder) is
   called ~16–18 times (sampling steps × the unconditional/conditional CFG
   pair), while the language models (Barbet, RALM) run once each. vLLM is a
   *text language-model* inference framework — it does not touch the diffusion
   decoder at all, so even moving the LMs onto vLLM leaves the dominant compute
   running eagerly and barely moves end-to-end latency.
2. **vLLM does not support Barbet's hybrid architecture.** Barbet (the
   text-semantic LM) is a Mamba2 + attention hybrid, and vLLM (as well as
   nano-vllm and vllm-omni) has zero support for such a hybrid TSLM — you'd have
   to implement a first-class hybrid model yourself (large effort, CUDA-only).

So this engine **borrows vLLM's architectural techniques without depending on its
CUDA kernels**:

- **Continuous batching** of many requests (the main throughput win), sharing
  batched compute across requests.
- A **padded KV cache + SDPA + masks** instead of vLLM's PagedAttention /
  FlashAttention — trading peak speed and memory efficiency for cross-device,
  zero-dependency portability.
- Barbet's Mamba state handled with a **pure-PyTorch single-step recurrence**, no
  fused kernel required.
- Optional `compile=True` uses `torch.compile` (which captures CUDA graphs
  internally) to accelerate the **DiT and LocEnc** — the actual hot path, and
  exactly what wrapping in vLLM would *not* do for you.

> In short: we don't aim to beat vLLM on a single op; we use vLLM-class **batch
> scheduling** plus **DiT-bottleneck optimization** to raise overall throughput
> with no extra dependencies, across CUDA / MPS / CPU.

## Apple Silicon MLX acceleration (optional)

On Apple Silicon (M-series), a native **MLX** path runs inference directly on the
Apple GPU (Metal, unified memory) — typically faster than PyTorch's MPS backend.
It is an optional extra; the core package stays torch-only:

```bash
pip install -e .[mlx]
```

```python
import soundfile as sf
from bluemagpie import BlueMagpieModel
from bluemagpie.mlx import BlueMagpieMLX, mlx_generate

model = BlueMagpieModel.from_local(model_dir, tokenizer=tokenizer, device="cpu")
mlx_model = BlueMagpieMLX(model)          # converts the weights once

audio = mlx_generate(model, mlx_model, "今天天氣真好。", seed=0)   # 48 kHz waveform
sf.write("output.wav", audio.numpy(), model.sample_rate)
```

- The whole inference path (Barbet, RALM, LocEnc, LocDiT/CFM, the **AudioVAE
  decoder**, the AR loop) is re-implemented in MLX and numerically parity-checked,
  module by module — generation can run torch-free (only tokenization and
  reference-wav encoding stay in torch).
- Decode uses cached single-step kernels (it advances one position per step, not a
  full re-run).
- `mlx_generate` supports the same four input modes as `generate`.
- On the real 7.75 GB model: end-to-end **RTF 0.77** (faster than real time) —
  ~**1.45×** over torch-MPS and ~**3.27×** over torch-CPU (fp32,
  `scripts/bench_rtf.py`). See [`src/bluemagpie/mlx/DESIGN.md`](src/bluemagpie/mlx/DESIGN.md).

## Notes

- The examples load the tokenizer from `tokenizer.json` and pass it to
  `from_local`, which is stable on transformers 5.x. (`from_local`'s automatic
  tokenizer loading can fail on 5.x — see Troubleshooting.)
- A GPU is optional: set `device="cpu"` (slower, but short utterances take only
  tens of seconds). Output is 48 kHz mono.
- The bundled `hung_yi_lee` speaker vector is authorized for example use. For any
  other speaker or voice cloning, use only reference audio or speaker vectors you
  are authorized to use.
- Keep speaker-vector tables and synthesized audio private; do not distribute
  them without authorization.

## Troubleshooting

**Tokenizer loading on newer transformers (5.x).** The examples load the
tokenizer explicitly from `tokenizer.json`, so they work on transformers 5.x with
no extra steps (the model only uses the tokenizer's `encode`).

If you instead rely on `from_local`'s automatic tokenizer loading (passing no
`tokenizer`), transformers 5.x may fail while parsing `tokenizer_config.json`
with `TypeError: ..._patch_mistral_regex() got multiple values for keyword
argument 'fix_mistral_regex'`, or appear to load but raise `ValueError: No
tokenizer attached to BlueMagpieModel` when you call `generate()`. Use the
explicit loading shown above instead.