A newer version of this model is available: unsloth/orpheus-3b-0.1-ft

🗣️ Hmong TTS — Orpheus‑3B (Fine‑Tuned) | LocalVoice.org


license: apache-2.0

Hmong Text‑To‑Speech (TTS) model fine‑tuned from Orpheus‑3B‑TTS, optimized with Unsloth + SNAC codec. Built by LocalVoice.org to support Hmong language technology.

🙏 Special thanks to ThaiSC & HPC Ignite Program for HPC resources.


🌟 Model Highlights

  • ⚙️ Base Model: Orpheus‑3B TTS
  • 🔉 Codec: SNAC 24kHz (hubertsiuzdak/snac_24khz)
  • 🌍 Language: Hmong (Hmoob / Hmong Daw)
  • 🧠 Finetuned using Unsloth PEFT LoRA
  • 🎙️ Supports emotion tags: <giggle>, <laugh>, <sigh>, <cough>, <sniffle>, <groan>, <yawn>, <gasp>
  • 🎭 Optional multi‑speaker prompt prefix
  • ⚡ Real‑time inference on a single GPU

🧪 Quick Inference Example

from unsloth import FastLanguageModel
import torch
from snac import SNAC

# === Load Language Model (4bit optional) ===
model_path = "Pakorn2112/Orpheus-3B-TTS-hmong/model-single-speaker"
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_path,
    max_seq_length = 2048,
    dtype = None,            # Auto-detect precision
    load_in_4bit = False,    # Set True for 4-bit inference
)

# === Load SNAC codec ===
snac_path = "hubertsiuzdak/snac_24khz"
snac_model = SNAC.from_pretrained(snac_path).to("cuda")

# === Optional Voice ID (multi-speaker) ===
chosen_voice = 3   # Set None for single‑speaker

# === Emotion tags supported ===
# <giggle> <laugh> <chuckle> <sigh> <cough> <sniffle> <groan> <yawn> <gasp>

prompts = [
    "kuv hu ua paj ntaub, <giggle> Koj lub npe hu li cas.",
]

# Enable fast inference
FastLanguageModel.for_inference(model)

# Move codec back to CPU to free GPU RAM
detect_cpu = snac_model.to("cpu")

🎧 Full Token Generation + Decoding (SNAC)

This script generates SNAC tokens and reconstructs audio. For full code, see: inference.py in this repository.

# (Token formatting, generation & decoding)
# Extract 128xxx audio tokens → reshape → decode via SNAC
# Full example in repository (same as provided in training logs)

🛑 Note: Output tokens must be split into 7‑tuple quantized layers before SNAC decoding.


🎙️ Example Usage With Audio Output (IPython)

from IPython.display import display, Audio

# Generate & play audio
for i in range(len(prompts)):
    print(prompts[i])
    samples = my_samples[i]
    display(Audio(samples.detach().squeeze().cpu().numpy(), rate=24000))

📌 Recommended Dataset Format (metadata.json)

[
  {
    "audio": "wavs/001.wav",
    "text": "koj nyob li cas?",
    "speaker": "spk_f1"
  },
  {
    "audio": "wavs/002.wav",
    "text": "kuv nyob zoo ua tsaug.",
    "speaker": "spk_f1"
  }
]

💡 Tips for Best Quality

  • Use 24kHz mono WAV recordings
  • Trim silence and remove heavy noise
  • Keep clips 1‑8 seconds long per utterance
  • Use clear, natural speaking tone
  • Add optional emotion tokens for expressive voices

📄 License

apache-2.0

This model is released publicly for research & educational use. Commercial applications may require dataset rights & additional review.


🤝 Credits

  • Hmong TTS Model: LocalVoice.org
  • HPC Support: ThaiSC Supercomputer (LANTA) — HPC Ignite Program
  • SNAC Codec Team: hubertsiuzdak (24kHz codec)
  • Fine‑Tuning Framework: Unsloth

🎉 Thank you for supporting Hmong language technology! 🖤💚💙

Downloads last month
20
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Pakorn2112/Orpheus-3B-TTS-hmong