--- language: - zh - en tags: - text-to-speech - speech-synthesis - code-switching - private library_name: bluemagpie pipeline_tag: text-to-speech license: other --- # BlueMagpie-TTS Private BlueMagpie-TTS checkpoint for internal research and evaluation. This repository contains the inference artifact only: model weights, AudioVAE weights, tokenizer files, config, a default Hung-yi Lee speaker centroid, and usage documentation. It does not include optimizer state, scheduler state, training logs, local configs, or training-data metadata. ## Intended Use - Mandarin and mixed Mandarin/English text-to-speech evaluation. - Internal experiments with reference-audio prompting, continuation prompting, and optional speaker-centroid conditioning. - Do not redistribute the checkpoint or generated speech unless rights and consent are cleared for the intended use. ## Install ```bash git clone https://github.com/OpenFormosa/BlueMagpie-TTS cd BlueMagpie-TTS pip install -e ".[train]" pip install soundfile huggingface_hub ``` ## Quick Start ```python import os from huggingface_hub import snapshot_download import soundfile as sf import torch from transformers import PreTrainedTokenizerFast from bluemagpie import BlueMagpieModel model_dir = snapshot_download("OpenFormosa/BlueMagpie-TTS", token=True) # load tokenizer from tokenizer.json (works with newer transformers 5.x) tokenizer = PreTrainedTokenizerFast(tokenizer_file=os.path.join(model_dir, "tokenizer.json")) model = BlueMagpieModel.from_local(model_dir, tokenizer=tokenizer, training=False, device="cuda") centroids = torch.load( f"{model_dir}/checkpoints/hung_yi_lee_speaker_centroids.pt", map_location="cpu", weights_only=True, ) speaker_centroid = centroids["centroids"][centroids["speaker_ids"].index("hung_yi_lee")] audio = model.generate( target_text="這是 AI TTS code switching 測試。", cfg_value=2.8, inference_timesteps=9, max_len=2000, retry_badcase=True, speaker_centroid=speaker_centroid, ) sf.write("sample.wav", audio.detach().cpu().numpy(), model.sample_rate) ``` ## Reference Audio Prompting ```python audio = model.generate( target_text="今天的會議改到下午三點。", reference_wav_path="reference_speaker.wav", cfg_value=2.8, inference_timesteps=9, ) ``` Only use reference audio from speakers you have permission to synthesize. ## Recommended Defaults The current recommended defaults are also recorded in `config.json` under `generation_defaults` and in `release_metadata.json` under `recommended_generation_defaults`. - `cfg_value=2.8` - `inference_timesteps=9` - `max_len=2000` - `retry_badcase=True` - default speaker centroid: `checkpoints/hung_yi_lee_speaker_centroids.pt` (`speaker_id="hung_yi_lee"`, source dataset `voidful/hung-yi_lee`) The bundled `hung_yi_lee` speaker centroid is included with the speaker's permission for use as an example voice. For any other speaker, obtain that speaker's authorization before synthesizing. These defaults were selected on `/home/voidful/tts_hard_sentences_zh_500.txt` using `MediaTek-Research/Breeze-ASR-25` with normalized CER. The best trial was `hy_cfg2p8_steps9`: CER `0.09669792733863977`, TER `0.0911015155363644`, with `1227/12689` character errors. ## Long Text For long-form synthesis, split text into sentence-sized chunks and concatenate the generated waveforms. For stronger continuity, pass a short approved prompt clip with `prompt_text` and `prompt_wav_path`, then synthesize the next chunk. ## Evaluation Numbers below are from an internal held-out evaluation set. The eval set and training data are intentionally not described in this private model card. | System | Setting | CER | WER | | --- | --- | ---: | ---: | | BlueMagpie-TTS | selected checkpoint | 4.81% | 5.36% | | Reference baseline | same internal eval | 11.45% | 14.83% | Selected checkpoint speed diagnostics on the same internal eval: | Metric | Value | | --- | ---: | | Median duration-units/sec | 4.748 | | Max duration-units/sec | 5.288 | ## Limitations - Metrics are not a public benchmark and should be used only for internal model selection. - Speaker similarity depends on the quality and rights-cleared status of the supplied reference audio or centroid. - Very long passages should be chunked to avoid stop-token and prosody drift. - Generated speech may be incorrect; do not use it as a real-world notification without human review. ## Files - `pytorch_model.bin`: BlueMagpie model weights. - `audiovae.pth`: AudioVAE weights. - `config.json`: BlueMagpie architecture/runtime config. - `tokenizer.json`, `tokenizer_config.json`: tokenizer files. - `checkpoints/hung_yi_lee_speaker_centroids.pt`: default speaker centroid table used by the recommended defaults. - `USAGE.md`: expanded usage guide.