BlueMagpie-TTS

Private BlueMagpie-TTS checkpoint for internal research and evaluation.

This repository contains the inference artifact only: model weights, AudioVAE weights, tokenizer files, config, a default Hung-yi Lee speaker centroid, and usage documentation. It does not include optimizer state, scheduler state, training logs, local configs, or training-data metadata.

Intended Use

  • Mandarin and mixed Mandarin/English text-to-speech evaluation.
  • Internal experiments with reference-audio prompting, continuation prompting, and optional speaker-centroid conditioning.
  • Do not redistribute the checkpoint or generated speech unless rights and consent are cleared for the intended use.

Install

git clone https://github.com/OpenFormosa/BlueMagpie-TTS
cd BlueMagpie-TTS
pip install -e ".[train]"
pip install soundfile huggingface_hub

Quick Start

import os
from huggingface_hub import snapshot_download
import soundfile as sf
import torch
from transformers import PreTrainedTokenizerFast

from bluemagpie import BlueMagpieModel

model_dir = snapshot_download("OpenFormosa/BlueMagpie-TTS", token=True)
# load tokenizer from tokenizer.json (works with newer transformers 5.x)
tokenizer = PreTrainedTokenizerFast(tokenizer_file=os.path.join(model_dir, "tokenizer.json"))
model = BlueMagpieModel.from_local(model_dir, tokenizer=tokenizer, training=False, device="cuda")
centroids = torch.load(
    f"{model_dir}/checkpoints/hung_yi_lee_speaker_centroids.pt",
    map_location="cpu",
    weights_only=True,
)
speaker_centroid = centroids["centroids"][centroids["speaker_ids"].index("hung_yi_lee")]

audio = model.generate(
    target_text="這是 AI TTS code switching 測試。",
    cfg_value=2.8,
    inference_timesteps=9,
    max_len=2000,
    retry_badcase=True,
    speaker_centroid=speaker_centroid,
)

sf.write("sample.wav", audio.detach().cpu().numpy(), model.sample_rate)

Reference Audio Prompting

audio = model.generate(
    target_text="今天的會議改到下午三點。",
    reference_wav_path="reference_speaker.wav",
    cfg_value=2.8,
    inference_timesteps=9,
)

Only use reference audio from speakers you have permission to synthesize.

Recommended Defaults

The current recommended defaults are also recorded in config.json under generation_defaults and in release_metadata.json under recommended_generation_defaults.

  • cfg_value=2.8
  • inference_timesteps=9
  • max_len=2000
  • retry_badcase=True
  • default speaker centroid: checkpoints/hung_yi_lee_speaker_centroids.pt (speaker_id="hung_yi_lee", source dataset voidful/hung-yi_lee)

The bundled hung_yi_lee speaker centroid is included with the speaker's permission for use as an example voice. For any other speaker, obtain that speaker's authorization before synthesizing.

These defaults were selected on /home/voidful/tts_hard_sentences_zh_500.txt using MediaTek-Research/Breeze-ASR-25 with normalized CER. The best trial was hy_cfg2p8_steps9: CER 0.09669792733863977, TER 0.0911015155363644, with 1227/12689 character errors.

Long Text

For long-form synthesis, split text into sentence-sized chunks and concatenate the generated waveforms. For stronger continuity, pass a short approved prompt clip with prompt_text and prompt_wav_path, then synthesize the next chunk.

Evaluation

Numbers below are from an internal held-out evaluation set. The eval set and training data are intentionally not described in this private model card.

System Setting CER WER
BlueMagpie-TTS selected checkpoint 4.81% 5.36%
Reference baseline same internal eval 11.45% 14.83%

Selected checkpoint speed diagnostics on the same internal eval:

Metric Value
Median duration-units/sec 4.748
Max duration-units/sec 5.288

Limitations

  • Metrics are not a public benchmark and should be used only for internal model selection.
  • Speaker similarity depends on the quality and rights-cleared status of the supplied reference audio or centroid.
  • Very long passages should be chunked to avoid stop-token and prosody drift.
  • Generated speech may be incorrect; do not use it as a real-world notification without human review.

Files

  • pytorch_model.bin: BlueMagpie model weights.
  • audiovae.pth: AudioVAE weights.
  • config.json: BlueMagpie architecture/runtime config.
  • tokenizer.json, tokenizer_config.json: tokenizer files.
  • checkpoints/hung_yi_lee_speaker_centroids.pt: default speaker centroid table used by the recommended defaults.
  • USAGE.md: expanded usage guide.
Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using OpenFormosa/BlueMagpie-TTS 1