---
language:
- zh
- en
tags:
- text-to-speech
- speech-synthesis
- code-switching
- private
library_name: bluemagpie
pipeline_tag: text-to-speech
license: other
---

# BlueMagpie-TTS

Private BlueMagpie-TTS checkpoint for internal research and evaluation.

This repository contains the inference artifact only: model weights, AudioVAE
weights, tokenizer files, config, a default Hung-yi Lee speaker centroid, and
usage documentation. It does not include optimizer state, scheduler state,
training logs, local configs, or training-data metadata.

## Intended Use

- Mandarin and mixed Mandarin/English text-to-speech evaluation.
- Internal experiments with reference-audio prompting, continuation prompting,
  and optional speaker-centroid conditioning.
- Do not redistribute the checkpoint or generated speech unless rights and
  consent are cleared for the intended use.

## Install

```bash
git clone https://github.com/OpenFormosa/BlueMagpie-TTS
cd BlueMagpie-TTS
pip install -e ".[train]"
pip install soundfile huggingface_hub
```

## Quick Start

```python
import os
from huggingface_hub import snapshot_download
import soundfile as sf
import torch
from transformers import PreTrainedTokenizerFast

from bluemagpie import BlueMagpieModel

model_dir = snapshot_download("OpenFormosa/BlueMagpie-TTS", token=True)
# load tokenizer from tokenizer.json (works with newer transformers 5.x)
tokenizer = PreTrainedTokenizerFast(tokenizer_file=os.path.join(model_dir, "tokenizer.json"))
model = BlueMagpieModel.from_local(model_dir, tokenizer=tokenizer, training=False, device="cuda")
centroids = torch.load(
    f"{model_dir}/checkpoints/hung_yi_lee_speaker_centroids.pt",
    map_location="cpu",
    weights_only=True,
)
speaker_centroid = centroids["centroids"][centroids["speaker_ids"].index("hung_yi_lee")]

audio = model.generate(
    target_text="這是 AI TTS code switching 測試。",
    cfg_value=2.8,
    inference_timesteps=9,
    max_len=2000,
    retry_badcase=True,
    speaker_centroid=speaker_centroid,
)

sf.write("sample.wav", audio.detach().cpu().numpy(), model.sample_rate)
```

## Reference Audio Prompting

```python
audio = model.generate(
    target_text="今天的會議改到下午三點。",
    reference_wav_path="reference_speaker.wav",
    cfg_value=2.8,
    inference_timesteps=9,
)
```

Only use reference audio from speakers you have permission to synthesize.

## Recommended Defaults

The current recommended defaults are also recorded in `config.json` under
`generation_defaults` and in `release_metadata.json` under
`recommended_generation_defaults`.

- `cfg_value=2.8`
- `inference_timesteps=9`
- `max_len=2000`
- `retry_badcase=True`
- default speaker centroid: `checkpoints/hung_yi_lee_speaker_centroids.pt`
  (`speaker_id="hung_yi_lee"`, source dataset `voidful/hung-yi_lee`)

The bundled `hung_yi_lee` speaker centroid is included with the speaker's
permission for use as an example voice. For any other speaker, obtain that
speaker's authorization before synthesizing.

These defaults were selected on `/home/voidful/tts_hard_sentences_zh_500.txt`
using `MediaTek-Research/Breeze-ASR-25` with normalized CER. The best trial was
`hy_cfg2p8_steps9`: CER `0.09669792733863977`, TER `0.0911015155363644`,
with `1227/12689` character errors.

## Long Text

For long-form synthesis, split text into sentence-sized chunks and concatenate
the generated waveforms. For stronger continuity, pass a short approved prompt
clip with `prompt_text` and `prompt_wav_path`, then synthesize the next chunk.

## Evaluation

Numbers below are from an internal held-out evaluation set. The eval set and
training data are intentionally not described in this private model card.

| System | Setting | CER | WER |
| --- | --- | ---: | ---: |
| BlueMagpie-TTS | selected checkpoint | 4.81% | 5.36% |
| Reference baseline | same internal eval | 11.45% | 14.83% |

Selected checkpoint speed diagnostics on the same internal eval:

| Metric | Value |
| --- | ---: |
| Median duration-units/sec | 4.748 |
| Max duration-units/sec | 5.288 |

## Limitations

- Metrics are not a public benchmark and should be used only for internal model
  selection.
- Speaker similarity depends on the quality and rights-cleared status of the
  supplied reference audio or centroid.
- Very long passages should be chunked to avoid stop-token and prosody drift.
- Generated speech may be incorrect; do not use it as a real-world notification
  without human review.

## Files

- `pytorch_model.bin`: BlueMagpie model weights.
- `audiovae.pth`: AudioVAE weights.
- `config.json`: BlueMagpie architecture/runtime config.
- `tokenizer.json`, `tokenizer_config.json`: tokenizer files.
- `checkpoints/hung_yi_lee_speaker_centroids.pt`: default speaker centroid
  table used by the recommended defaults.
- `USAGE.md`: expanded usage guide.