| # BlueMagpie-TTS — Usage |
|
|
| BlueMagpie-TTS is a text-to-speech (TTS) model that synthesizes natural speech |
| from text. It supports three scenarios: |
|
|
| - **Plain synthesis** — read the text aloud. |
| - **Voice cloning** — mimic the timbre of a reference clip. |
| - **Speaker selection** — control the timbre with a prepared speaker vector. |
|
|
| It also supports **streaming output** for synthesize-while-you-play applications. |
|
|
| 🔊 **Try it online:** [BlueMagpie-TTS Demo (Hugging Face Space)](https://huggingface.co/spaces/voidful/BlueMagpie-TTS-Demo) |
|
|
| ## Install |
|
|
| ```bash |
| git clone https://github.com/OpenFormosa/BlueMagpie-TTS |
| cd BlueMagpie-TTS |
| pip install -e . |
| ``` |
|
|
| The install pulls in the [`barbet`](https://github.com/OpenFormosa/Barbet) |
| package (the text-semantic language model) from GitHub. The acoustic modules are |
| vendored in `bluemagpie/_vendor/` (sourced from |
| [VoxCPM](https://github.com/OpenBMB/VoxCPM), Apache-2.0) and need no separate |
| install. To save synthesized audio, also install `soundfile`: |
|
|
| ```bash |
| pip install soundfile |
| ``` |
|
|
| ## Load the model |
|
|
| ### From Hugging Face |
|
|
| ```python |
| import os |
| from huggingface_hub import snapshot_download |
| from transformers import PreTrainedTokenizerFast |
| from bluemagpie import BlueMagpieModel |
| |
| model_dir = snapshot_download("OpenFormosa/BlueMagpie-TTS", token=True) |
| # Load the tokenizer straight from tokenizer.json (works on transformers 5.x). |
| tokenizer = PreTrainedTokenizerFast(tokenizer_file=os.path.join(model_dir, "tokenizer.json")) |
| model = BlueMagpieModel.from_local(model_dir, tokenizer=tokenizer, training=False, device="cuda") |
| ``` |
|
|
| ### From a local directory |
|
|
| ```python |
| import os |
| from transformers import PreTrainedTokenizerFast |
| from bluemagpie import BlueMagpieModel |
| |
| model_dir = "checkpoints/bluemagpie" |
| tokenizer = PreTrainedTokenizerFast(tokenizer_file=os.path.join(model_dir, "tokenizer.json")) |
| model = BlueMagpieModel.from_local(model_dir, tokenizer=tokenizer, training=False, device="cuda") |
| ``` |
|
|
| - `device` may be `"cuda"`, `"mps"`, or `"cpu"` (auto-selected if omitted). |
| - Always use `training=False` for inference. |
|
|
| ## Basic synthesis: text to speech |
|
|
| `generate` returns a speech waveform (`torch.Tensor`); pair it with `soundfile` |
| to write a `.wav`. The output sample rate is `model.sample_rate` (48 kHz). |
|
|
| ```python |
| import soundfile as sf |
| |
| audio = model.generate(target_text="今天天氣真好。", cfg_value=2.0) |
| sf.write("output.wav", audio.squeeze().cpu().numpy(), model.sample_rate) |
| ``` |
|
|
| ## Voice cloning: mimic a reference speaker |
|
|
| Two ways. |
|
|
| **A. Speaker vector (`speaker_centroid`)** — extract a vector from the reference |
| audio, then synthesize (no transcript needed): |
| |
| ```bash |
| pip install -e ".[clone]" # extraction needs speechbrain (ECAPA-TDNN) |
| python scripts/extract_speaker_centroid.py --audio reference.wav --out my_voice.pt |
| # more clips of the same speaker -> cleaner centroid: --audio a.wav b.wav c.wav |
| ``` |
| |
| ```python |
| import torch |
| |
| centroid = torch.load("my_voice.pt", weights_only=True) # [192] speaker vector |
| audio = model.generate( |
| target_text="今天天氣真好。", |
| speaker_centroid=centroid, |
| cfg_value=2.8, |
| ) |
| |
| # or extract it in-process: |
| from bluemagpie import extract_speaker_centroid |
| centroid = extract_speaker_centroid("reference.wav") # [192] |
| ``` |
| |
| **B. Reference clip (`reference_wav_path`)** — pass a reference clip directly: |
| |
| ```python |
| audio = model.generate( |
| target_text="今天天氣真好。", |
| reference_wav_path="reference.wav", |
| cfg_value=2.8, |
| ) |
| ``` |
| |
| ## Speaker selection: control timbre with a speaker vector |
| |
| The model bundles a **multi-speaker table** at `checkpoints/speaker_centroids.pt`, |
| currently holding two speakers: |
| |
| | speaker id | description | suggested `cfg_value` | |
| | --- | --- | --- | |
| | `hung_yi_lee` | Prof. Hung-yi Lee's speaker vector (used with his authorization; the official best params are tuned for this speaker) | 2.0–2.8 | |
| | `female_voice` | a generic female voice | 2.0–2.8 | |
| |
| The table has the format `{"speaker_ids": [...], "centroids": tensor[N, 192], "dim": 192}`. |
| Load it with `torch.load`, **pick a speaker's `[192]` vector by id**, and pass it as |
| `speaker_centroid`: |
| |
| ```python |
| import os |
| import torch |
| |
| table = torch.load( |
| os.path.join(model_dir, "checkpoints", "speaker_centroids.pt"), |
| map_location="cpu", |
| weights_only=True, |
| ) |
| print(table["speaker_ids"]) # ['hung_yi_lee', 'female_voice'] |
| |
| # switch speaker by changing this line ("hung_yi_lee" or "female_voice") |
| speaker_id = "female_voice" |
| speaker_centroid = table["centroids"][table["speaker_ids"].index(speaker_id)] # [192] |
| |
| audio = model.generate( |
| target_text="今天天氣真好。", |
| speaker_centroid=speaker_centroid, # or your own authorized speaker vector |
| cfg_value=2.0, |
| ) |
| ``` |
| |
| If you only have the model id (haven't `snapshot_download`-ed the whole model yet), |
| grab just the table: |
| |
| ```python |
| from huggingface_hub import hf_hub_download |
| |
| path = hf_hub_download("OpenFormosa/BlueMagpie-TTS", "checkpoints/speaker_centroids.pt") |
| table = torch.load(path, map_location="cpu", weights_only=True) |
| ``` |
| |
| > To add more speakers, extract your own (authorized) `[192]` vector with |
| > `extract_speaker_centroid` from the *Voice cloning* section above — it's passed the |
| > exact same way. The earlier single-speaker file |
| > `checkpoints/hung_yi_lee_speaker_centroids.pt` (same format) is still available. |
| |
| ## Streaming output |
| |
| When you need to play while synthesizing, use `generate_streaming`. It is a |
| generator that yields audio chunks one at a time: |
| |
| ```python |
| chunks = [] |
| for chunk in model.generate_streaming(target_text="今天天氣真好。"): |
| chunks.append(chunk) |
| # play or write each chunk in real time here |
| ``` |
| |
| > Note: automatic retry (`retry_badcase`) is not supported in streaming mode. |
| |
| ## Four input modes |
| |
| The model supports four input combinations through the same `generate` interface: |
| |
| | Mode | Parameters | Use | |
| |---|---|---| |
| | Plain synthesis | `target_text` | Read the text aloud | |
| | Continuation | `target_text`, `prompt_text`, `prompt_wav_path` | Continue from an existing clip and its text | |
| | Reference clip | `target_text`, `reference_wav_path` | Mimic the reference speaker's timbre | |
| | Speaker vector | `target_text`, `speaker_centroid` | Clone a voice from a speaker vector | |
| |
| ## Common `generate` parameters |
| |
| | Parameter | Default | Description | |
| |---|---|---| |
| | `target_text` | (required) | The text to synthesize | |
| | `prompt_text` | `""` | Prompt text, paired with `prompt_wav_path` for continuation | |
| | `prompt_wav_path` | `""` | Prompt audio path, for continuation | |
| | `reference_wav_path` | `""` | Reference audio path, for voice cloning | |
| | `speaker_centroid` | `None` | Speaker vector, to select a timbre | |
| | `cfg_value` | `2.0` | Guidance strength; higher follows the condition more closely but can sound less natural | |
| | `inference_timesteps` | `10` | Sampling steps; more usually means better quality and slower speed | |
| | `min_len` / `max_len` | `2` / `2000` | Lower / upper bound on output length | |
| | `retry_badcase` | `False` | Auto-retry on detected bad output (unsupported in streaming) | |
| |
| ## Batch serving engine (multi-request acceleration) |
| |
| To serve many synthesis requests at once for higher throughput, use the built-in |
| batch engine `BlueMagpieEngine`. It does **continuous batching**: requests are |
| decoded together as a batch, new requests can join mid-decode, and they do not |
| interfere with one another. |
| |
| Highlights: |
| |
| - **No extra dependencies** — torch only; no vLLM, flash-attn, etc. |
| - **Cross-device** — one code path on CUDA, Apple Silicon (MPS), and CPU. |
| CUDA-only optimizations are auto-detected and enabled, and skipped elsewhere. |
| - **Numerically identical to single-call `generate`** at batch=1 (`model.generate` |
| is always the reference). |
|
|
| ### Basic usage |
|
|
| ```python |
| import soundfile as sf |
| from bluemagpie.serving import BlueMagpieEngine, EngineConfig, Request |
| |
| # load `model` and `tokenizer` as shown above (from_local) |
| engine = BlueMagpieEngine(model, EngineConfig(max_num_seqs=16)) |
| |
| engine.add_request(Request(target_text="今天天氣真好。", seed=0)) |
| engine.add_request(Request(target_text="第二句話。", reference_wav_path="speaker.wav")) |
| |
| for out in engine.run(): # returned in request-id (submission) order |
| # out.audio: 48 kHz waveform (when an AudioVAE is attached); out.latents: [T, p, d] |
| sf.write(f"output_{out.request_id}.wav", out.audio.numpy(), out.sample_rate) |
| ``` |
|
|
| `Request` supports the same four input modes as `generate` (plain, continuation, |
| reference clip, speaker vector) via the fields `target_text`, `prompt_text`, |
| `prompt_wav_path`, `reference_wav_path`, `speaker_centroid`, `cfg_value`, |
| `inference_timesteps`, etc. Each request may set a `seed`, which makes its output |
| independent of how many neighbours share the batch and of admission order. |
|
|
| ### Streaming |
|
|
| `engine.stream()` is a generator that yields a chunk per request per step: |
|
|
| ```python |
| for chunk in engine.stream(): |
| # chunk.request_id, chunk.latents, chunk.audio, chunk.finished |
| play_or_write(chunk) |
| ``` |
|
|
| > Plain synthesis, reference-clip, and speaker-vector modes stream audio |
| > (`chunk.audio`); prompt-audio continuation streams `latents` only — use `run()` |
| > when you need its audio. |
|
|
| ### Configuration |
|
|
| Common `EngineConfig` parameters: |
|
|
| | Parameter | Default | Description | |
| |---|---|---| |
| | `max_num_seqs` | `16` | Max concurrent requests batched together | |
| | `max_model_len` | `2048` | Max length per sequence (prompt + generated) | |
| | `inference_timesteps` | `9` | Sampling steps | |
| | `cfg_value` | `2.8` | Guidance strength | |
| | `enforce_eager` | `True` | Keep the path numerically identical to single-call `generate` | |
| | `compile` | `False` | Enable `torch.compile` (CUDA only; auto-skipped elsewhere) | |
|
|
| > See [`src/bluemagpie/serving/DESIGN.md`](src/bluemagpie/serving/DESIGN.md) for the |
| > engine's design, trade-offs, and known limitations. |
|
|
| ### Why not just use vLLM? |
|
|
| People often expect "wrap it in vLLM and it gets fast", but for BlueMagpie that |
| does not work, for two reasons: |
|
|
| 1. **The real compute bottleneck is the diffusion decoder, not the language |
| model.** Per generated audio unit the DiT (LocDiT / CFM diffusion decoder) is |
| called ~16–18 times (sampling steps × the unconditional/conditional CFG |
| pair), while the language models (Barbet, RALM) run once each. vLLM is a |
| *text language-model* inference framework — it does not touch the diffusion |
| decoder at all, so even moving the LMs onto vLLM leaves the dominant compute |
| running eagerly and barely moves end-to-end latency. |
| 2. **vLLM does not support Barbet's hybrid architecture.** Barbet (the |
| text-semantic LM) is a Mamba2 + attention hybrid, and vLLM (as well as |
| nano-vllm and vllm-omni) has zero support for such a hybrid TSLM — you'd have |
| to implement a first-class hybrid model yourself (large effort, CUDA-only). |
|
|
| So this engine **borrows vLLM's architectural techniques without depending on its |
| CUDA kernels**: |
|
|
| - **Continuous batching** of many requests (the main throughput win), sharing |
| batched compute across requests. |
| - A **padded KV cache + SDPA + masks** instead of vLLM's PagedAttention / |
| FlashAttention — trading peak speed and memory efficiency for cross-device, |
| zero-dependency portability. |
| - Barbet's Mamba state handled with a **pure-PyTorch single-step recurrence**, no |
| fused kernel required. |
| - Optional `compile=True` uses `torch.compile` (which captures CUDA graphs |
| internally) to accelerate the **DiT and LocEnc** — the actual hot path, and |
| exactly what wrapping in vLLM would *not* do for you. |
|
|
| > In short: we don't aim to beat vLLM on a single op; we use vLLM-class **batch |
| > scheduling** plus **DiT-bottleneck optimization** to raise overall throughput |
| > with no extra dependencies, across CUDA / MPS / CPU. |
|
|
| ## Apple Silicon MLX acceleration (optional) |
|
|
| On Apple Silicon (M-series), a native **MLX** path runs inference directly on the |
| Apple GPU (Metal, unified memory) — typically faster than PyTorch's MPS backend. |
| It is an optional extra; the core package stays torch-only: |
|
|
| ```bash |
| pip install -e .[mlx] |
| ``` |
|
|
| ```python |
| import soundfile as sf |
| from bluemagpie import BlueMagpieModel |
| from bluemagpie.mlx import BlueMagpieMLX, mlx_generate |
| |
| model = BlueMagpieModel.from_local(model_dir, tokenizer=tokenizer, device="cpu") |
| mlx_model = BlueMagpieMLX(model) # converts the weights once |
| |
| audio = mlx_generate(model, mlx_model, "今天天氣真好。", seed=0) # 48 kHz waveform |
| sf.write("output.wav", audio.numpy(), model.sample_rate) |
| ``` |
|
|
| - The whole inference path (Barbet, RALM, LocEnc, LocDiT/CFM, the **AudioVAE |
| decoder**, the AR loop) is re-implemented in MLX and numerically parity-checked, |
| module by module — generation can run torch-free (only tokenization and |
| reference-wav encoding stay in torch). |
| - Decode uses cached single-step kernels (it advances one position per step, not a |
| full re-run). |
| - `mlx_generate` supports the same four input modes as `generate`. |
| - On the real 7.75 GB model: end-to-end **RTF 0.77** (faster than real time) — |
| ~**1.45×** over torch-MPS and ~**3.27×** over torch-CPU (fp32, |
| `scripts/bench_rtf.py`). See [`src/bluemagpie/mlx/DESIGN.md`](src/bluemagpie/mlx/DESIGN.md). |
|
|
| ## Notes |
|
|
| - The examples load the tokenizer from `tokenizer.json` and pass it to |
| `from_local`, which is stable on transformers 5.x. (`from_local`'s automatic |
| tokenizer loading can fail on 5.x — see Troubleshooting.) |
| - A GPU is optional: set `device="cpu"` (slower, but short utterances take only |
| tens of seconds). Output is 48 kHz mono. |
| - The bundled `hung_yi_lee` speaker vector is authorized for example use. For any |
| other speaker or voice cloning, use only reference audio or speaker vectors you |
| are authorized to use. |
| - Keep speaker-vector tables and synthesized audio private; do not distribute |
| them without authorization. |
|
|
| ## Troubleshooting |
|
|
| **Tokenizer loading on newer transformers (5.x).** The examples load the |
| tokenizer explicitly from `tokenizer.json`, so they work on transformers 5.x with |
| no extra steps (the model only uses the tokenizer's `encode`). |
|
|
| If you instead rely on `from_local`'s automatic tokenizer loading (passing no |
| `tokenizer`), transformers 5.x may fail while parsing `tokenizer_config.json` |
| with `TypeError: ..._patch_mistral_regex() got multiple values for keyword |
| argument 'fix_mistral_regex'`, or appear to load but raise `ValueError: No |
| tokenizer attached to BlueMagpieModel` when you call `generate()`. Use the |
| explicit loading shown above instead. |
| |