# BlueMagpie-TTS Usage This is a private inference checkpoint. Install the local package first: ```bash git clone https://github.com/OpenFormosa/BlueMagpie-TTS cd BlueMagpie-TTS pip install -e ".[train]" pip install soundfile huggingface_hub ``` Download and load: ```python import os from huggingface_hub import snapshot_download from transformers import PreTrainedTokenizerFast from bluemagpie import BlueMagpieModel model_dir = snapshot_download("OpenFormosa/BlueMagpie-TTS", token=True) # load tokenizer from tokenizer.json (works with newer transformers 5.x) tokenizer = PreTrainedTokenizerFast(tokenizer_file=os.path.join(model_dir, "tokenizer.json")) model = BlueMagpieModel.from_local(model_dir, tokenizer=tokenizer, training=False, device="cuda") ``` Generate a short utterance: ```python import soundfile as sf import torch centroids = torch.load( f"{model_dir}/checkpoints/hung_yi_lee_speaker_centroids.pt", map_location="cpu", weights_only=True, ) speaker_centroid = centroids["centroids"][centroids["speaker_ids"].index("hung_yi_lee")] wav = model.generate( target_text="這是合成語音測試。", cfg_value=2.8, inference_timesteps=9, max_len=2000, retry_badcase=True, speaker_centroid=speaker_centroid, ) sf.write("short.wav", wav.detach().cpu().numpy(), model.sample_rate) ``` Generate with reference audio: ```python wav = model.generate( target_text="明天 meeting 取消一次,我們大家放假。", reference_wav_path="reference_speaker.wav", cfg_value=2.8, inference_timesteps=9, ) ``` Streaming: ```python import torch import soundfile as sf chunks = [] for chunk in model.generate_streaming( target_text="這是一段串流語音合成測試。", cfg_value=2.8, inference_timesteps=9, ): chunks.append(chunk.detach().cpu()) wav = torch.cat(chunks, dim=-1) sf.write("streaming.wav", wav.numpy(), model.sample_rate) ``` Long text: 1. Split the text by sentence or paragraph. 2. Generate each chunk separately. 3. Concatenate the waveforms with a short crossfade if needed. 4. For continuation-style context, pass a short approved prompt clip with `prompt_text` and `prompt_wav_path`. Recommended generation defaults: - `cfg_value=2.8` - `inference_timesteps=9` - `retry_badcase=True` for offline generation - `max_len=2000` unless a longer chunk is intentionally needed - `speaker_centroid=...` from `checkpoints/hung_yi_lee_speaker_centroids.pt` when using the default `hung_yi_lee` voice The same defaults are stored in `config.json` as `generation_defaults` and in `release_metadata.json` as `recommended_generation_defaults`. They were selected from the `hy_cfg2p8_steps9` trial on the 500-sentence hard Chinese TTS set using `MediaTek-Research/Breeze-ASR-25`: normalized CER `0.09669792733863977`, TER `0.0911015155363644`, `1227/12689` character errors. Safety: - The bundled `hung_yi_lee` speaker centroid is included with the speaker's permission as an example; obtain authorization for any other speaker. - Use only rights-cleared reference audio or speaker embeddings. - Do not present generated speech as a real person or real notification unless that is explicitly authorized and reviewed.