A newer version of this model is available:
unsloth/orpheus-3b-0.1-ft
🗣️ Hmong TTS — Orpheus‑3B (Fine‑Tuned) | LocalVoice.org
license: apache-2.0
Hmong Text‑To‑Speech (TTS) model fine‑tuned from Orpheus‑3B‑TTS, optimized with Unsloth + SNAC codec. Built by LocalVoice.org to support Hmong language technology.
🙏 Special thanks to ThaiSC & HPC Ignite Program for HPC resources.
🌟 Model Highlights
- ⚙️ Base Model: Orpheus‑3B TTS
- 🔉 Codec: SNAC 24kHz (hubertsiuzdak/snac_24khz)
- 🌍 Language: Hmong (Hmoob / Hmong Daw)
- 🧠 Finetuned using Unsloth PEFT LoRA
- 🎙️ Supports emotion tags:
<giggle>,<laugh>,<sigh>,<cough>,<sniffle>,<groan>,<yawn>,<gasp> - 🎭 Optional multi‑speaker prompt prefix
- ⚡ Real‑time inference on a single GPU
🧪 Quick Inference Example
from unsloth import FastLanguageModel
import torch
from snac import SNAC
# === Load Language Model (4bit optional) ===
model_path = "Pakorn2112/Orpheus-3B-TTS-hmong/model-single-speaker"
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = model_path,
max_seq_length = 2048,
dtype = None, # Auto-detect precision
load_in_4bit = False, # Set True for 4-bit inference
)
# === Load SNAC codec ===
snac_path = "hubertsiuzdak/snac_24khz"
snac_model = SNAC.from_pretrained(snac_path).to("cuda")
# === Optional Voice ID (multi-speaker) ===
chosen_voice = 3 # Set None for single‑speaker
# === Emotion tags supported ===
# <giggle> <laugh> <chuckle> <sigh> <cough> <sniffle> <groan> <yawn> <gasp>
prompts = [
"kuv hu ua paj ntaub, <giggle> Koj lub npe hu li cas.",
]
# Enable fast inference
FastLanguageModel.for_inference(model)
# Move codec back to CPU to free GPU RAM
detect_cpu = snac_model.to("cpu")
🎧 Full Token Generation + Decoding (SNAC)
This script generates SNAC tokens and reconstructs audio. For full code, see:
inference.pyin this repository.
# (Token formatting, generation & decoding)
# Extract 128xxx audio tokens → reshape → decode via SNAC
# Full example in repository (same as provided in training logs)
🛑 Note: Output tokens must be split into 7‑tuple quantized layers before SNAC decoding.
🎙️ Example Usage With Audio Output (IPython)
from IPython.display import display, Audio
# Generate & play audio
for i in range(len(prompts)):
print(prompts[i])
samples = my_samples[i]
display(Audio(samples.detach().squeeze().cpu().numpy(), rate=24000))
📌 Recommended Dataset Format (metadata.json)
[
{
"audio": "wavs/001.wav",
"text": "koj nyob li cas?",
"speaker": "spk_f1"
},
{
"audio": "wavs/002.wav",
"text": "kuv nyob zoo ua tsaug.",
"speaker": "spk_f1"
}
]
💡 Tips for Best Quality
- Use 24kHz mono WAV recordings
- Trim silence and remove heavy noise
- Keep clips 1‑8 seconds long per utterance
- Use clear, natural speaking tone
- Add optional emotion tokens for expressive voices
📄 License
apache-2.0
This model is released publicly for research & educational use. Commercial applications may require dataset rights & additional review.
🤝 Credits
- Hmong TTS Model: LocalVoice.org
- HPC Support: ThaiSC Supercomputer (LANTA) — HPC Ignite Program
- SNAC Codec Team: hubertsiuzdak (24kHz codec)
- Fine‑Tuning Framework: Unsloth
🎉 Thank you for supporting Hmong language technology! 🖤💚💙
- Downloads last month
- 20
Model tree for Pakorn2112/Orpheus-3B-TTS-hmong
Base model
meta-llama/Llama-3.2-3B-Instruct
Finetuned
canopylabs/orpheus-3b-0.1-pretrained
Finetuned
canopylabs/orpheus-3b-0.1-ft
Finetuned
unsloth/orpheus-3b-0.1-ft