# BlueMagpie-TTS Usage

This is a private inference checkpoint. Install the local package first:

```bash
git clone https://github.com/OpenFormosa/BlueMagpie-TTS
cd BlueMagpie-TTS
pip install -e ".[train]"
pip install soundfile huggingface_hub
```

Download and load:

```python
import os
from huggingface_hub import snapshot_download
from transformers import PreTrainedTokenizerFast
from bluemagpie import BlueMagpieModel

model_dir = snapshot_download("OpenFormosa/BlueMagpie-TTS", token=True)
# load tokenizer from tokenizer.json (works with newer transformers 5.x)
tokenizer = PreTrainedTokenizerFast(tokenizer_file=os.path.join(model_dir, "tokenizer.json"))
model = BlueMagpieModel.from_local(model_dir, tokenizer=tokenizer, training=False, device="cuda")
```

Generate a short utterance:

```python
import soundfile as sf
import torch

centroids = torch.load(
    f"{model_dir}/checkpoints/hung_yi_lee_speaker_centroids.pt",
    map_location="cpu",
    weights_only=True,
)
speaker_centroid = centroids["centroids"][centroids["speaker_ids"].index("hung_yi_lee")]

wav = model.generate(
    target_text="這是合成語音測試。",
    cfg_value=2.8,
    inference_timesteps=9,
    max_len=2000,
    retry_badcase=True,
    speaker_centroid=speaker_centroid,
)
sf.write("short.wav", wav.detach().cpu().numpy(), model.sample_rate)
```

Generate with reference audio:

```python
wav = model.generate(
    target_text="明天 meeting 取消一次，我們大家放假。",
    reference_wav_path="reference_speaker.wav",
    cfg_value=2.8,
    inference_timesteps=9,
)
```

Streaming:

```python
import torch
import soundfile as sf

chunks = []
for chunk in model.generate_streaming(
    target_text="這是一段串流語音合成測試。",
    cfg_value=2.8,
    inference_timesteps=9,
):
    chunks.append(chunk.detach().cpu())

wav = torch.cat(chunks, dim=-1)
sf.write("streaming.wav", wav.numpy(), model.sample_rate)
```

Long text:

1. Split the text by sentence or paragraph.
2. Generate each chunk separately.
3. Concatenate the waveforms with a short crossfade if needed.
4. For continuation-style context, pass a short approved prompt clip with
   `prompt_text` and `prompt_wav_path`.

Recommended generation defaults:

- `cfg_value=2.8`
- `inference_timesteps=9`
- `retry_badcase=True` for offline generation
- `max_len=2000` unless a longer chunk is intentionally needed
- `speaker_centroid=...` from
  `checkpoints/hung_yi_lee_speaker_centroids.pt` when using the default
  `hung_yi_lee` voice

The same defaults are stored in `config.json` as `generation_defaults` and in
`release_metadata.json` as `recommended_generation_defaults`. They were selected
from the `hy_cfg2p8_steps9` trial on the 500-sentence hard Chinese TTS set using
`MediaTek-Research/Breeze-ASR-25`: normalized CER `0.09669792733863977`, TER
`0.0911015155363644`, `1227/12689` character errors.

Safety:

- The bundled `hung_yi_lee` speaker centroid is included with the speaker's
  permission as an example; obtain authorization for any other speaker.
- Use only rights-cleared reference audio or speaker embeddings.
- Do not present generated speech as a real person or real notification unless
  that is explicitly authorized and reviewed.