BlueMagpie-TTS Usage
This is a private inference checkpoint. Install the local package first:
git clone https://github.com/OpenFormosa/BlueMagpie-TTS
cd BlueMagpie-TTS
pip install -e ".[train]"
pip install soundfile huggingface_hub
Download and load:
import os
from huggingface_hub import snapshot_download
from transformers import PreTrainedTokenizerFast
from bluemagpie import BlueMagpieModel
model_dir = snapshot_download("OpenFormosa/BlueMagpie-TTS", token=True)
# load tokenizer from tokenizer.json (works with newer transformers 5.x)
tokenizer = PreTrainedTokenizerFast(tokenizer_file=os.path.join(model_dir, "tokenizer.json"))
model = BlueMagpieModel.from_local(model_dir, tokenizer=tokenizer, training=False, device="cuda")
Generate a short utterance:
import soundfile as sf
import torch
centroids = torch.load(
f"{model_dir}/checkpoints/hung_yi_lee_speaker_centroids.pt",
map_location="cpu",
weights_only=True,
)
speaker_centroid = centroids["centroids"][centroids["speaker_ids"].index("hung_yi_lee")]
wav = model.generate(
target_text="這是合成語音測試。",
cfg_value=2.8,
inference_timesteps=9,
max_len=2000,
retry_badcase=True,
speaker_centroid=speaker_centroid,
)
sf.write("short.wav", wav.detach().cpu().numpy(), model.sample_rate)
Generate with reference audio:
wav = model.generate(
target_text="明天 meeting 取消一次,我們大家放假。",
reference_wav_path="reference_speaker.wav",
cfg_value=2.8,
inference_timesteps=9,
)
Streaming:
import torch
import soundfile as sf
chunks = []
for chunk in model.generate_streaming(
target_text="這是一段串流語音合成測試。",
cfg_value=2.8,
inference_timesteps=9,
):
chunks.append(chunk.detach().cpu())
wav = torch.cat(chunks, dim=-1)
sf.write("streaming.wav", wav.numpy(), model.sample_rate)
Long text:
- Split the text by sentence or paragraph.
- Generate each chunk separately.
- Concatenate the waveforms with a short crossfade if needed.
- For continuation-style context, pass a short approved prompt clip with
prompt_textandprompt_wav_path.
Recommended generation defaults:
cfg_value=2.8inference_timesteps=9retry_badcase=Truefor offline generationmax_len=2000unless a longer chunk is intentionally neededspeaker_centroid=...fromcheckpoints/hung_yi_lee_speaker_centroids.ptwhen using the defaulthung_yi_leevoice
The same defaults are stored in config.json as generation_defaults and in
release_metadata.json as recommended_generation_defaults. They were selected
from the hy_cfg2p8_steps9 trial on the 500-sentence hard Chinese TTS set using
MediaTek-Research/Breeze-ASR-25: normalized CER 0.09669792733863977, TER
0.0911015155363644, 1227/12689 character errors.
Safety:
- The bundled
hung_yi_leespeaker centroid is included with the speaker's permission as an example; obtain authorization for any other speaker. - Use only rights-cleared reference audio or speaker embeddings.
- Do not present generated speech as a real person or real notification unless that is explicitly authorized and reviewed.