BlueMagpie-TTS / USAGE.md
voidful's picture
docs: note Hung-yi Lee speaker centroid is bundled with permission (USAGE.md)
ec4bd18 verified
|
Raw
History Blame Contribute Delete
3.23 kB

BlueMagpie-TTS Usage

This is a private inference checkpoint. Install the local package first:

git clone https://github.com/OpenFormosa/BlueMagpie-TTS
cd BlueMagpie-TTS
pip install -e ".[train]"
pip install soundfile huggingface_hub

Download and load:

import os
from huggingface_hub import snapshot_download
from transformers import PreTrainedTokenizerFast
from bluemagpie import BlueMagpieModel

model_dir = snapshot_download("OpenFormosa/BlueMagpie-TTS", token=True)
# load tokenizer from tokenizer.json (works with newer transformers 5.x)
tokenizer = PreTrainedTokenizerFast(tokenizer_file=os.path.join(model_dir, "tokenizer.json"))
model = BlueMagpieModel.from_local(model_dir, tokenizer=tokenizer, training=False, device="cuda")

Generate a short utterance:

import soundfile as sf
import torch

centroids = torch.load(
    f"{model_dir}/checkpoints/hung_yi_lee_speaker_centroids.pt",
    map_location="cpu",
    weights_only=True,
)
speaker_centroid = centroids["centroids"][centroids["speaker_ids"].index("hung_yi_lee")]

wav = model.generate(
    target_text="這是合成語音測試。",
    cfg_value=2.8,
    inference_timesteps=9,
    max_len=2000,
    retry_badcase=True,
    speaker_centroid=speaker_centroid,
)
sf.write("short.wav", wav.detach().cpu().numpy(), model.sample_rate)

Generate with reference audio:

wav = model.generate(
    target_text="明天 meeting 取消一次,我們大家放假。",
    reference_wav_path="reference_speaker.wav",
    cfg_value=2.8,
    inference_timesteps=9,
)

Streaming:

import torch
import soundfile as sf

chunks = []
for chunk in model.generate_streaming(
    target_text="這是一段串流語音合成測試。",
    cfg_value=2.8,
    inference_timesteps=9,
):
    chunks.append(chunk.detach().cpu())

wav = torch.cat(chunks, dim=-1)
sf.write("streaming.wav", wav.numpy(), model.sample_rate)

Long text:

  1. Split the text by sentence or paragraph.
  2. Generate each chunk separately.
  3. Concatenate the waveforms with a short crossfade if needed.
  4. For continuation-style context, pass a short approved prompt clip with prompt_text and prompt_wav_path.

Recommended generation defaults:

  • cfg_value=2.8
  • inference_timesteps=9
  • retry_badcase=True for offline generation
  • max_len=2000 unless a longer chunk is intentionally needed
  • speaker_centroid=... from checkpoints/hung_yi_lee_speaker_centroids.pt when using the default hung_yi_lee voice

The same defaults are stored in config.json as generation_defaults and in release_metadata.json as recommended_generation_defaults. They were selected from the hy_cfg2p8_steps9 trial on the 500-sentence hard Chinese TTS set using MediaTek-Research/Breeze-ASR-25: normalized CER 0.09669792733863977, TER 0.0911015155363644, 1227/12689 character errors.

Safety:

  • The bundled hung_yi_lee speaker centroid is included with the speaker's permission as an example; obtain authorization for any other speaker.
  • Use only rights-cleared reference audio or speaker embeddings.
  • Do not present generated speech as a real person or real notification unless that is explicitly authorized and reviewed.