lanna-voice — Kham Mueang STT (Pathumma + LoRA)

⚠️ Status: experimental — known regression on in-domain SLSCU. This adapter is published for transparency / reproducibility while a retrained version (v3, higher LoRA rank) is in progress. Do not use for production. See Known regression below.

LoRA adapter for nectec/Pathumma-llm-audio-1.0.0, fine-tuned for Kham Mueang (Northern Thai / คำเมือง) on a mix of:

SLSCU Khummuang (~33 h, CMKL/Porjai-Thai-voice-dataset-khummuang)
CMKL Porjai Central Thai (~50 h, CMKL/Porjai-Thai-voice-dataset-central)

This replaces the previous stt_lora/ + stt_ct2/ Whisper-LoRA setup, which collapsed onto SLSCU's narrow market-template distribution and produced unusable output on out-of-domain audio. The Pathumma swap fixed the OOD collapse but introduced a different problem (see below).

Adapter rank: 8 (LoRA on q_proj, v_proj; ~2.5 M trainable params)
Training: 1 epoch, 12,203 steps, batch 4, bf16, LR 1e-4
Hardware: single L4 24 GB
Final loss: ~0.27

Known regression

On a 20-clip sample of the SLSCU val set (in-domain), v2 hits ~90% CER — vs the previous Whisper-LoRA's 5.39% on the same set. The adapter has memorized 3 high-frequency SLSCU phrases and emits one of them on most short SLSCU-acoustics clips:

v2 output (greedy)	Hit rate in 20-clip sample
`จ้วย ปิด หน้าต่าง หื้อ ตวย`	6 / 20
`จ้วย ปิด ไฟ หื้อกำ`	5 / 20
`ก๋าน จ่าย สตังค์ ของ จ้าว หั้น ยัง บ่ได้ ลง บัญชี`	3 / 20

Beam search does not help (beam=4 corpus CER 90.6% vs greedy 90.1%).

Root cause: rank-8 LoRA × 1 epoch × 50:50 data mix is under-capacity. With a broader training distribution than v1, the rank-8 adapter (2.5 M params) can only memorize a few high-frequency patterns rather than the full SLSCU template space. v3 will use rank 32 + lower dropout + 2 epochs + 2× oversampled SLSCU to fix this.

Where v2 still helps

The medical-dialogue audio used for our OOD test (53 s patient describing diabetic-like symptoms) is so far from SLSCU acoustics that the LoRA's template trigger doesn't fire — instead the base Pathumma transcribes the content correctly and the LoRA applies modest Kham-Mueang flavoring. With silero-vad chunking + beam search, v2 produces mostly-coherent medical content with เจ้า / เปิ้น / ฮู / หื้อ / ตวย particles preserved.

So: v2 is roughly base Pathumma + Kham-Mueang accent overlay on real OOD audio, but it is broken on short, SLSCU-acoustics clips. Use base Pathumma directly for production until v3 is ready.

Repo layout

pathumma_lora/
  adapter_config.json
  adapter_model.safetensors
README.md  ← you are here

How to use (with caveats)

PEFT's save_pretrained strips the adapter name from key paths, so when loading we have to insert .default. back into each LoRA tensor name.

import torch, librosa
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from transformers import AutoModel

device = "cuda"

# 1) Base Pathumma in inference mode
model = AutoModel.from_pretrained(
    "nectec/Pathumma-llm-audio-1.0.0",
    torch_dtype=torch.bfloat16,
    lora_infer_mode=True,
    init_from_scratch=True,
    trust_remote_code=True,
)

# 2) Overlay our LoRA, with the .default. rename rule
adapter_path = hf_hub_download(
    "mfuni/lanna-voice", "pathumma_lora/adapter_model.safetensors",
)
sd = load_file(adapter_path)
renamed = {}
for k, v in sd.items():
    if ".lora_A.weight" in k:
        renamed[k.replace(".lora_A.weight", ".lora_A.default.weight")] = v
    elif ".lora_B.weight" in k:
        renamed[k.replace(".lora_B.weight", ".lora_B.default.weight")] = v
    else:
        renamed[k] = v
model.qwen2_model.load_state_dict(renamed, strict=False)
model = model.to(device).eval()

# 3) Transcribe (use beam search; greedy collapses on short clips)
audio, _ = librosa.load("clip.wav", sr=16000, mono=True)
out = model.generate(
    raw_wave=audio,
    prompts="ถอดเสียงตามต้นฉบับโดยไม่แปลและไม่ดัดแปลง",
    device=device,
    max_new_tokens=200,
    num_beams=4,
    repetition_penalty=1.2,
    length_penalty=0.8,
    no_repeat_ngram_size=4,
)
print(out[0])

For audio longer than ~15 s, segment with silero-vad into ≤ 5 s chunks and stitch the outputs.

Training data

Source	Hours	Style	License
SLSCU Khummuang	33	Read e-commerce + survey	CC-BY-SA-4.0
CMKL Porjai Central Thai	50 (capped)	Read news + Wikipedia	CC-BY-SA-4.0

Total ~83 h, ~50:50 split. Validation is SLSCU-only (487 clips, 0.7 h).

Citation

@inproceedings{suwanbandit23_interspeech,
  author = {Artit Suwanbandit and Burin Naowarat and Orathai Sangpetch and Ekapol Chuangsuwanich},
  title  = {{Thai Dialect Corpus and Transfer-based Curriculum Learning Investigation for Dialect Automatic Speech Recognition}},
  year   = 2023,
  booktitle = {Proc. INTERSPEECH 2023},
  pages  = {4069--4073}
}

Built by

Mae Fah Luang University — internal academic toolkit. Source: https://github.com/cnacha-mfu/mfu-lanna-voice

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mfuni/lanna-voice

Base model

nectec/Pathumma-llm-audio-1.0.0

Adapter

(1)

this model