Instructions to use mfuni/lanna-voice with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use mfuni/lanna-voice with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
Mark v2 as experimental — known SLSCU val regression
Browse files20-clip SLSCU val sample shows v2 ~90% CER vs v1's 5.39%; LoRA rank 8 collapsed to ~3 templates. v3 (rank 32, 2 epochs, oversampled) in progress.
README.md
CHANGED
|
@@ -8,6 +8,7 @@ tags:
|
|
| 8 |
- speech-recognition
|
| 9 |
- lora
|
| 10 |
- pathumma
|
|
|
|
| 11 |
license: cc-by-sa-4.0
|
| 12 |
base_model: nectec/Pathumma-llm-audio-1.0.0
|
| 13 |
library_name: peft
|
|
@@ -15,21 +16,60 @@ library_name: peft
|
|
| 15 |
|
| 16 |
# lanna-voice — Kham Mueang STT (Pathumma + LoRA)
|
| 17 |
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
**
|
| 22 |
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
- **Adapter rank:** 8 (LoRA on `q_proj`, `v_proj`; ~2.5 M trainable params)
|
| 30 |
-
- **Training:** 1 epoch, 12,203 steps, batch 4
|
| 31 |
- **Hardware:** single L4 24 GB
|
| 32 |
-
- **Final loss:** ~0.27
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
## Repo layout
|
| 35 |
|
|
@@ -40,21 +80,20 @@ pathumma_lora/
|
|
| 40 |
README.md ← you are here
|
| 41 |
```
|
| 42 |
|
| 43 |
-
## How to use
|
| 44 |
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
when loading we have to insert `.default.` back into each LoRA tensor name.
|
| 48 |
|
| 49 |
```python
|
| 50 |
-
import torch
|
| 51 |
from huggingface_hub import hf_hub_download
|
| 52 |
from safetensors.torch import load_file
|
| 53 |
from transformers import AutoModel
|
| 54 |
|
| 55 |
device = "cuda"
|
| 56 |
|
| 57 |
-
# 1)
|
| 58 |
model = AutoModel.from_pretrained(
|
| 59 |
"nectec/Pathumma-llm-audio-1.0.0",
|
| 60 |
torch_dtype=torch.bfloat16,
|
|
@@ -63,7 +102,7 @@ model = AutoModel.from_pretrained(
|
|
| 63 |
trust_remote_code=True,
|
| 64 |
)
|
| 65 |
|
| 66 |
-
# 2)
|
| 67 |
adapter_path = hf_hub_download(
|
| 68 |
"mfuni/lanna-voice", "pathumma_lora/adapter_model.safetensors",
|
| 69 |
)
|
|
@@ -79,8 +118,7 @@ for k, v in sd.items():
|
|
| 79 |
model.qwen2_model.load_state_dict(renamed, strict=False)
|
| 80 |
model = model.to(device).eval()
|
| 81 |
|
| 82 |
-
# 3) Transcribe (
|
| 83 |
-
import librosa
|
| 84 |
audio, _ = librosa.load("clip.wav", sr=16000, mono=True)
|
| 85 |
out = model.generate(
|
| 86 |
raw_wave=audio,
|
|
@@ -95,65 +133,17 @@ out = model.generate(
|
|
| 95 |
print(out[0])
|
| 96 |
```
|
| 97 |
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
| Setting | Why |
|
| 101 |
-
|---|---|
|
| 102 |
-
| `num_beams=4` | Greedy decoding lets the LLM prior dominate when audio is ambiguous → hallucinated content. Beam search picks the joint-probable transcript. |
|
| 103 |
-
| `repetition_penalty=1.2` | Discourages the loop-tail behaviour that LoRA can induce on long clips. |
|
| 104 |
-
| `length_penalty=0.8` | The training data is mostly 1–15 s; without this, the model EOSes early on dense audio. |
|
| 105 |
-
| `no_repeat_ngram_size=4` | Cheap insurance against repeated 4-grams in failure modes. |
|
| 106 |
-
|
| 107 |
-
### Audio handling for clips longer than ~15 s
|
| 108 |
-
|
| 109 |
-
The model is trained on 1–15 s clips (median ~5 s). For longer audio, segment
|
| 110 |
-
with [silero-vad](https://github.com/snakers4/silero-vad) and pack each
|
| 111 |
-
segment into ≤ 5 s chunks before transcribing. Stitch the per-chunk outputs.
|
| 112 |
-
Larger chunks cause the model to EOS early, dropping content.
|
| 113 |
-
|
| 114 |
-
## Performance
|
| 115 |
-
|
| 116 |
-
### In-domain (SLSCU Khummuang val, 487 clips)
|
| 117 |
-
|
| 118 |
-
The v1 (Whisper-LoRA) hit **5.39 % CER** on this set but only because the
|
| 119 |
-
val set is in-distribution; on real out-of-domain audio it produced
|
| 120 |
-
SLSCU-template hallucinations regardless of the input. v2 (this model)
|
| 121 |
-
is harder to benchmark with a single number because the training mix is
|
| 122 |
-
broader; honest evaluation is in progress.
|
| 123 |
-
|
| 124 |
-
### Out-of-domain qualitative comparison
|
| 125 |
-
|
| 126 |
-
On a 53-second medical dialogue (patient describing diabetic-like symptoms):
|
| 127 |
-
|
| 128 |
-
| Model | Result |
|
| 129 |
-
|---|---|
|
| 130 |
-
| v1 (Whisper-LoRA, SLSCU only) | Completely off-topic — "cookies, honey, jasmine rice in 28 packs" (template hallucination) |
|
| 131 |
-
| Base Pathumma (no LoRA) | Mostly correct medical dialogue in Central Thai register |
|
| 132 |
-
| **v2 (Pathumma + this LoRA)** | **Mostly correct medical dialogue, with Kham Mueang particles preserved (เจ้า, เปิ้น, ฮู, หื้อ, ตวย)** |
|
| 133 |
-
|
| 134 |
-
## Known limitations
|
| 135 |
-
|
| 136 |
-
1. **Last-segment SLSCU bleed.** When a sentence has the prosodic pattern of
|
| 137 |
-
SLSCU's "X has Y of Z, N units" templates, the model still occasionally
|
| 138 |
-
collapses to that template. Most visible on the final segment of a long
|
| 139 |
-
utterance.
|
| 140 |
-
2. **No conversational training data.** Both SLSCU (read e-commerce) and
|
| 141 |
-
Porjai (read news/wiki) are read speech. Natural conversation with
|
| 142 |
-
hesitations and prosodic emphasis is genuinely OOD.
|
| 143 |
-
3. **No medical-domain Kham Mueang.** Adding ~100 h of synthesized medical
|
| 144 |
-
dialogue (via Gemini TTS) is the planned next step.
|
| 145 |
-
4. **Space-tokenized output.** Both training corpora use space-separated
|
| 146 |
-
tokens, so output is space-tokenized. Strip if you want continuous script.
|
| 147 |
|
| 148 |
## Training data
|
| 149 |
|
| 150 |
| Source | Hours | Style | License |
|
| 151 |
|---|---|---|---|
|
| 152 |
-
|
|
| 153 |
-
|
|
| 154 |
|
| 155 |
-
Total
|
| 156 |
-
0.7 h) so the in-domain CER is comparable across versions.
|
| 157 |
|
| 158 |
## Citation
|
| 159 |
|
|
|
|
| 8 |
- speech-recognition
|
| 9 |
- lora
|
| 10 |
- pathumma
|
| 11 |
+
- experimental
|
| 12 |
license: cc-by-sa-4.0
|
| 13 |
base_model: nectec/Pathumma-llm-audio-1.0.0
|
| 14 |
library_name: peft
|
|
|
|
| 16 |
|
| 17 |
# lanna-voice — Kham Mueang STT (Pathumma + LoRA)
|
| 18 |
|
| 19 |
+
> ⚠️ **Status: experimental — known regression on in-domain SLSCU.**
|
| 20 |
+
> This adapter is published for transparency / reproducibility while a
|
| 21 |
+
> retrained version (v3, higher LoRA rank) is in progress. **Do not use
|
| 22 |
+
> for production.** See *Known regression* below.
|
| 23 |
|
| 24 |
+
LoRA adapter for [`nectec/Pathumma-llm-audio-1.0.0`](https://huggingface.co/nectec/Pathumma-llm-audio-1.0.0),
|
| 25 |
+
fine-tuned for **Kham Mueang (Northern Thai / คำเมือง)** on a mix of:
|
| 26 |
+
|
| 27 |
+
- **SLSCU Khummuang** (~33 h, [CMKL/Porjai-Thai-voice-dataset-khummuang](https://huggingface.co/datasets/CMKL/Porjai-Thai-voice-dataset-khummuang))
|
| 28 |
+
- **CMKL Porjai Central Thai** (~50 h, [CMKL/Porjai-Thai-voice-dataset-central](https://huggingface.co/datasets/CMKL/Porjai-Thai-voice-dataset-central))
|
| 29 |
+
|
| 30 |
+
This replaces the previous `stt_lora/` + `stt_ct2/` Whisper-LoRA setup, which
|
| 31 |
+
collapsed onto SLSCU's narrow market-template distribution and produced
|
| 32 |
+
unusable output on out-of-domain audio. The Pathumma swap fixed the OOD
|
| 33 |
+
collapse but introduced a different problem (see below).
|
| 34 |
|
| 35 |
- **Adapter rank:** 8 (LoRA on `q_proj`, `v_proj`; ~2.5 M trainable params)
|
| 36 |
+
- **Training:** 1 epoch, 12,203 steps, batch 4, bf16, LR 1e-4
|
| 37 |
- **Hardware:** single L4 24 GB
|
| 38 |
+
- **Final loss:** ~0.27
|
| 39 |
+
|
| 40 |
+
## Known regression
|
| 41 |
+
|
| 42 |
+
On a 20-clip sample of the **SLSCU val set** (in-domain), v2 hits
|
| 43 |
+
**~90% CER** — vs the previous Whisper-LoRA's 5.39% on the same set. The
|
| 44 |
+
adapter has memorized 3 high-frequency SLSCU phrases and emits one of them
|
| 45 |
+
on most short SLSCU-acoustics clips:
|
| 46 |
+
|
| 47 |
+
| v2 output (greedy) | Hit rate in 20-clip sample |
|
| 48 |
+
|---|---|
|
| 49 |
+
| `จ้วย ปิด หน้าต่าง หื้อ ตวย` | 6 / 20 |
|
| 50 |
+
| `จ้วย ปิด ไฟ หื้อกำ` | 5 / 20 |
|
| 51 |
+
| `ก๋าน จ่าย สตังค์ ของ จ้าว หั้น ยัง บ่ได้ ลง บัญชี` | 3 / 20 |
|
| 52 |
+
|
| 53 |
+
Beam search does not help (beam=4 corpus CER 90.6% vs greedy 90.1%).
|
| 54 |
+
|
| 55 |
+
**Root cause:** rank-8 LoRA × 1 epoch × 50:50 data mix is **under-capacity**.
|
| 56 |
+
With a broader training distribution than v1, the rank-8 adapter (2.5 M params)
|
| 57 |
+
can only memorize a few high-frequency patterns rather than the full SLSCU
|
| 58 |
+
template space. v3 will use rank 32 + lower dropout + 2 epochs + 2× oversampled
|
| 59 |
+
SLSCU to fix this.
|
| 60 |
+
|
| 61 |
+
## Where v2 still helps
|
| 62 |
+
|
| 63 |
+
The medical-dialogue audio used for our OOD test (53 s patient describing
|
| 64 |
+
diabetic-like symptoms) is so far from SLSCU acoustics that the LoRA's
|
| 65 |
+
template trigger doesn't fire — instead the base Pathumma transcribes the
|
| 66 |
+
content correctly and the LoRA applies modest Kham-Mueang flavoring. With
|
| 67 |
+
silero-vad chunking + beam search, v2 produces mostly-coherent medical
|
| 68 |
+
content with `เจ้า / เปิ้น / ฮู / หื้อ / ตวย` particles preserved.
|
| 69 |
+
|
| 70 |
+
So: v2 is roughly *base Pathumma + Kham-Mueang accent overlay* on real OOD
|
| 71 |
+
audio, but it is *broken* on short, SLSCU-acoustics clips. Use base Pathumma
|
| 72 |
+
directly for production until v3 is ready.
|
| 73 |
|
| 74 |
## Repo layout
|
| 75 |
|
|
|
|
| 80 |
README.md ← you are here
|
| 81 |
```
|
| 82 |
|
| 83 |
+
## How to use (with caveats)
|
| 84 |
|
| 85 |
+
PEFT's `save_pretrained` strips the adapter name from key paths, so when
|
| 86 |
+
loading we have to insert `.default.` back into each LoRA tensor name.
|
|
|
|
| 87 |
|
| 88 |
```python
|
| 89 |
+
import torch, librosa
|
| 90 |
from huggingface_hub import hf_hub_download
|
| 91 |
from safetensors.torch import load_file
|
| 92 |
from transformers import AutoModel
|
| 93 |
|
| 94 |
device = "cuda"
|
| 95 |
|
| 96 |
+
# 1) Base Pathumma in inference mode
|
| 97 |
model = AutoModel.from_pretrained(
|
| 98 |
"nectec/Pathumma-llm-audio-1.0.0",
|
| 99 |
torch_dtype=torch.bfloat16,
|
|
|
|
| 102 |
trust_remote_code=True,
|
| 103 |
)
|
| 104 |
|
| 105 |
+
# 2) Overlay our LoRA, with the .default. rename rule
|
| 106 |
adapter_path = hf_hub_download(
|
| 107 |
"mfuni/lanna-voice", "pathumma_lora/adapter_model.safetensors",
|
| 108 |
)
|
|
|
|
| 118 |
model.qwen2_model.load_state_dict(renamed, strict=False)
|
| 119 |
model = model.to(device).eval()
|
| 120 |
|
| 121 |
+
# 3) Transcribe (use beam search; greedy collapses on short clips)
|
|
|
|
| 122 |
audio, _ = librosa.load("clip.wav", sr=16000, mono=True)
|
| 123 |
out = model.generate(
|
| 124 |
raw_wave=audio,
|
|
|
|
| 133 |
print(out[0])
|
| 134 |
```
|
| 135 |
|
| 136 |
+
For audio longer than ~15 s, segment with [silero-vad](https://github.com/snakers4/silero-vad)
|
| 137 |
+
into ≤ 5 s chunks and stitch the outputs.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 138 |
|
| 139 |
## Training data
|
| 140 |
|
| 141 |
| Source | Hours | Style | License |
|
| 142 |
|---|---|---|---|
|
| 143 |
+
| SLSCU Khummuang | 33 | Read e-commerce + survey | CC-BY-SA-4.0 |
|
| 144 |
+
| CMKL Porjai Central Thai | 50 (capped) | Read news + Wikipedia | CC-BY-SA-4.0 |
|
| 145 |
|
| 146 |
+
Total ~83 h, ~50:50 split. Validation is SLSCU-only (487 clips, 0.7 h).
|
|
|
|
| 147 |
|
| 148 |
## Citation
|
| 149 |
|