Instructions to use mfuni/lanna-voice with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use mfuni/lanna-voice with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
lanna-voice — Kham Mueang STT (Pathumma + LoRA)
⚠️ Status: experimental — known regression on in-domain SLSCU. This adapter is published for transparency / reproducibility while a retrained version (v3, higher LoRA rank) is in progress. Do not use for production. See Known regression below.
LoRA adapter for nectec/Pathumma-llm-audio-1.0.0,
fine-tuned for Kham Mueang (Northern Thai / คำเมือง) on a mix of:
- SLSCU Khummuang (~33 h, CMKL/Porjai-Thai-voice-dataset-khummuang)
- CMKL Porjai Central Thai (~50 h, CMKL/Porjai-Thai-voice-dataset-central)
This replaces the previous stt_lora/ + stt_ct2/ Whisper-LoRA setup, which
collapsed onto SLSCU's narrow market-template distribution and produced
unusable output on out-of-domain audio. The Pathumma swap fixed the OOD
collapse but introduced a different problem (see below).
- Adapter rank: 8 (LoRA on
q_proj,v_proj; ~2.5 M trainable params) - Training: 1 epoch, 12,203 steps, batch 4, bf16, LR 1e-4
- Hardware: single L4 24 GB
- Final loss: ~0.27
Known regression
On a 20-clip sample of the SLSCU val set (in-domain), v2 hits ~90% CER — vs the previous Whisper-LoRA's 5.39% on the same set. The adapter has memorized 3 high-frequency SLSCU phrases and emits one of them on most short SLSCU-acoustics clips:
| v2 output (greedy) | Hit rate in 20-clip sample |
|---|---|
จ้วย ปิด หน้าต่าง หื้อ ตวย |
6 / 20 |
จ้วย ปิด ไฟ หื้อกำ |
5 / 20 |
ก๋าน จ่าย สตังค์ ของ จ้าว หั้น ยัง บ่ได้ ลง บัญชี |
3 / 20 |
Beam search does not help (beam=4 corpus CER 90.6% vs greedy 90.1%).
Root cause: rank-8 LoRA × 1 epoch × 50:50 data mix is under-capacity. With a broader training distribution than v1, the rank-8 adapter (2.5 M params) can only memorize a few high-frequency patterns rather than the full SLSCU template space. v3 will use rank 32 + lower dropout + 2 epochs + 2× oversampled SLSCU to fix this.
Where v2 still helps
The medical-dialogue audio used for our OOD test (53 s patient describing
diabetic-like symptoms) is so far from SLSCU acoustics that the LoRA's
template trigger doesn't fire — instead the base Pathumma transcribes the
content correctly and the LoRA applies modest Kham-Mueang flavoring. With
silero-vad chunking + beam search, v2 produces mostly-coherent medical
content with เจ้า / เปิ้น / ฮู / หื้อ / ตวย particles preserved.
So: v2 is roughly base Pathumma + Kham-Mueang accent overlay on real OOD audio, but it is broken on short, SLSCU-acoustics clips. Use base Pathumma directly for production until v3 is ready.
Repo layout
pathumma_lora/
adapter_config.json
adapter_model.safetensors
README.md ← you are here
How to use (with caveats)
PEFT's save_pretrained strips the adapter name from key paths, so when
loading we have to insert .default. back into each LoRA tensor name.
import torch, librosa
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from transformers import AutoModel
device = "cuda"
# 1) Base Pathumma in inference mode
model = AutoModel.from_pretrained(
"nectec/Pathumma-llm-audio-1.0.0",
torch_dtype=torch.bfloat16,
lora_infer_mode=True,
init_from_scratch=True,
trust_remote_code=True,
)
# 2) Overlay our LoRA, with the .default. rename rule
adapter_path = hf_hub_download(
"mfuni/lanna-voice", "pathumma_lora/adapter_model.safetensors",
)
sd = load_file(adapter_path)
renamed = {}
for k, v in sd.items():
if ".lora_A.weight" in k:
renamed[k.replace(".lora_A.weight", ".lora_A.default.weight")] = v
elif ".lora_B.weight" in k:
renamed[k.replace(".lora_B.weight", ".lora_B.default.weight")] = v
else:
renamed[k] = v
model.qwen2_model.load_state_dict(renamed, strict=False)
model = model.to(device).eval()
# 3) Transcribe (use beam search; greedy collapses on short clips)
audio, _ = librosa.load("clip.wav", sr=16000, mono=True)
out = model.generate(
raw_wave=audio,
prompts="ถอดเสียงตามต้นฉบับโดยไม่แปลและไม่ดัดแปลง",
device=device,
max_new_tokens=200,
num_beams=4,
repetition_penalty=1.2,
length_penalty=0.8,
no_repeat_ngram_size=4,
)
print(out[0])
For audio longer than ~15 s, segment with silero-vad into ≤ 5 s chunks and stitch the outputs.
Training data
| Source | Hours | Style | License |
|---|---|---|---|
| SLSCU Khummuang | 33 | Read e-commerce + survey | CC-BY-SA-4.0 |
| CMKL Porjai Central Thai | 50 (capped) | Read news + Wikipedia | CC-BY-SA-4.0 |
Total ~83 h, ~50:50 split. Validation is SLSCU-only (487 clips, 0.7 h).
Citation
@inproceedings{suwanbandit23_interspeech,
author = {Artit Suwanbandit and Burin Naowarat and Orathai Sangpetch and Ekapol Chuangsuwanich},
title = {{Thai Dialect Corpus and Transfer-based Curriculum Learning Investigation for Dialect Automatic Speech Recognition}},
year = 2023,
booktitle = {Proc. INTERSPEECH 2023},
pages = {4069--4073}
}
Built by
Mae Fah Luang University — internal academic toolkit. Source: https://github.com/cnacha-mfu/mfu-lanna-voice
- Downloads last month
- -
Model tree for mfuni/lanna-voice
Base model
nectec/Pathumma-llm-audio-1.0.0