SE-Bridge-TTS Weights

This model repository hosts the public release checkpoints for SE-Bridge-TTS, the project page for the ICML 2026 paper Bridging the Stability-Expressivity Gap: Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models.

Links

Hugging Face Classification

  • Repository type: model
  • Task / pipeline: text-to-speech
  • Library: pytorch
  • Languages: Thai (th) and Lao (lo)
  • Primary tags: text-to-speech, speech-synthesis, thai, lao, low-resource, spoken-language-model

Files

File Description
thai_tts.pt Public Thai TTS checkpoint.
lao_tts.pt Public Lao TTS checkpoint.
release_config.json Sanitized release metadata for the two checkpoints.

Inference

The released files are CosyVoice2 LLM checkpoints. They are intended to be loaded with a CosyVoice2-compatible checkout and the standard CosyVoice2 base model assets. The base model directory should contain the normal CosyVoice2 configuration and acoustic/vocoder weights, while this repository supplies the Thai or Lao LLM checkpoint.

Install or prepare CosyVoice first:

git clone https://github.com/FunAudioLLM/CosyVoice.git
cd CosyVoice
pip install -r requirements.txt
pip install huggingface_hub torchaudio

Minimal zero-shot inference example:

import sys
from pathlib import Path

import torch
import torchaudio
from huggingface_hub import snapshot_download

sys.path.append("third_party/Matcha-TTS")

from cosyvoice.cli.cosyvoice import CosyVoice2
from cosyvoice.utils.file_utils import load_wav


HF_REPO_ID = "isabeth/SE-Bridge-TTS"
BASE_MODEL_DIR = Path("pretrained_models/CosyVoice2-0.5B")

language = "lao"  # choose "thai" or "lao"
checkpoint_name = {
    "thai": "thai_tts.pt",
    "lao": "lao_tts.pt",
}[language]

weights_dir = Path(snapshot_download(HF_REPO_ID))
checkpoint_path = weights_dir / checkpoint_name

cosyvoice = CosyVoice2(
    str(BASE_MODEL_DIR),
    load_jit=False,
    load_trt=False,
    load_vllm=False,
    fp16=False,
)
state_dict = torch.load(checkpoint_path, map_location="cpu")
cosyvoice.model.llm.load_state_dict(state_dict, strict=False)

prompt_speech_16k = load_wav("prompt.wav", 16000)
prompt_text = "Transcript of prompt.wav."
tts_text = "Text to synthesize in the selected language."

for idx, output in enumerate(
    cosyvoice.inference_zero_shot(
        tts_text,
        prompt_text,
        prompt_speech_16k,
        stream=False,
    )
):
    torchaudio.save(
        f"se_bridge_tts_{language}_{idx}.wav",
        output["tts_speech"],
        cosyvoice.sample_rate,
    )

For cross-lingual prompting, use the same loaded model and replace the generation loop with:

for idx, output in enumerate(
    cosyvoice.inference_cross_lingual(
        tts_text,
        prompt_speech_16k,
        stream=False,
    )
):
    torchaudio.save(
        f"se_bridge_tts_{language}_cross_lingual_{idx}.wav",
        output["tts_speech"],
        cosyvoice.sample_rate,
    )

Release Notes

This release package has been sanitized for public distribution. Internal server paths, private data paths, training-stage names, and operational configuration details are intentionally omitted. The repository does not describe per-stage checkpoint construction methods.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for isabeth/SE-Bridge-TTS