cnacha-mfu commited on
Commit
dce0f08
·
verified ·
1 Parent(s): 1fb5d0c

Mark v2 as experimental — known SLSCU val regression

Browse files

20-clip SLSCU val sample shows v2 ~90% CER vs v1's 5.39%; LoRA rank 8 collapsed to ~3 templates. v3 (rank 32, 2 epochs, oversampled) in progress.

Files changed (1) hide show
  1. README.md +63 -73
README.md CHANGED
@@ -8,6 +8,7 @@ tags:
8
  - speech-recognition
9
  - lora
10
  - pathumma
 
11
  license: cc-by-sa-4.0
12
  base_model: nectec/Pathumma-llm-audio-1.0.0
13
  library_name: peft
@@ -15,21 +16,60 @@ library_name: peft
15
 
16
  # lanna-voice — Kham Mueang STT (Pathumma + LoRA)
17
 
18
- Speech-to-text fine-tune for **Kham Mueang (Northern Thai / คำเมือง)**, built on
19
- top of [`nectec/Pathumma-llm-audio-1.0.0`](https://huggingface.co/nectec/Pathumma-llm-audio-1.0.0)
20
- with a small LoRA adapter trained on a mix of **SLSCU Khummuang (~33 h)** and
21
- **CMKL Porjai Central Thai (~50 h)**.
22
 
23
- This replaces the previous `stt_lora/` + `stt_ct2/` Whisper-LoRA setup. The
24
- Whisper-LoRA fine-tune collapsed onto SLSCU's narrow market-template
25
- distribution and produced unusable output on out-of-domain audio. Switching to
26
- the Pathumma audio LLM (Whisper encoder → BEATs → Q-Former → Qwen2 8B with
27
- LoRA) plus mixing in Central Thai data fixed the collapse.
 
 
 
 
 
28
 
29
  - **Adapter rank:** 8 (LoRA on `q_proj`, `v_proj`; ~2.5 M trainable params)
30
- - **Training:** 1 epoch, 12,203 steps, batch 4 (1 × 4 grad accum), bf16
31
  - **Hardware:** single L4 24 GB
32
- - **Final loss:** ~0.27 (no SLSCU-template overfit; v1 had collapsed to 0.02)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
  ## Repo layout
35
 
@@ -40,21 +80,20 @@ pathumma_lora/
40
  README.md ← you are here
41
  ```
42
 
43
- ## How to use
44
 
45
- The Pathumma base model already wires LoRA into Qwen2; we overlay our trained
46
- weights on top. PEFT's `save_pretrained` strips the adapter name from keys, so
47
- when loading we have to insert `.default.` back into each LoRA tensor name.
48
 
49
  ```python
50
- import torch
51
  from huggingface_hub import hf_hub_download
52
  from safetensors.torch import load_file
53
  from transformers import AutoModel
54
 
55
  device = "cuda"
56
 
57
- # 1) Load base Pathumma in inference mode
58
  model = AutoModel.from_pretrained(
59
  "nectec/Pathumma-llm-audio-1.0.0",
60
  torch_dtype=torch.bfloat16,
@@ -63,7 +102,7 @@ model = AutoModel.from_pretrained(
63
  trust_remote_code=True,
64
  )
65
 
66
- # 2) Pull our LoRA adapter from this repo and rename keys
67
  adapter_path = hf_hub_download(
68
  "mfuni/lanna-voice", "pathumma_lora/adapter_model.safetensors",
69
  )
@@ -79,8 +118,7 @@ for k, v in sd.items():
79
  model.qwen2_model.load_state_dict(renamed, strict=False)
80
  model = model.to(device).eval()
81
 
82
- # 3) Transcribe (prompt should match training exactly)
83
- import librosa
84
  audio, _ = librosa.load("clip.wav", sr=16000, mono=True)
85
  out = model.generate(
86
  raw_wave=audio,
@@ -95,65 +133,17 @@ out = model.generate(
95
  print(out[0])
96
  ```
97
 
98
- ### Recommended inference settings
99
-
100
- | Setting | Why |
101
- |---|---|
102
- | `num_beams=4` | Greedy decoding lets the LLM prior dominate when audio is ambiguous → hallucinated content. Beam search picks the joint-probable transcript. |
103
- | `repetition_penalty=1.2` | Discourages the loop-tail behaviour that LoRA can induce on long clips. |
104
- | `length_penalty=0.8` | The training data is mostly 1–15 s; without this, the model EOSes early on dense audio. |
105
- | `no_repeat_ngram_size=4` | Cheap insurance against repeated 4-grams in failure modes. |
106
-
107
- ### Audio handling for clips longer than ~15 s
108
-
109
- The model is trained on 1–15 s clips (median ~5 s). For longer audio, segment
110
- with [silero-vad](https://github.com/snakers4/silero-vad) and pack each
111
- segment into ≤ 5 s chunks before transcribing. Stitch the per-chunk outputs.
112
- Larger chunks cause the model to EOS early, dropping content.
113
-
114
- ## Performance
115
-
116
- ### In-domain (SLSCU Khummuang val, 487 clips)
117
-
118
- The v1 (Whisper-LoRA) hit **5.39 % CER** on this set but only because the
119
- val set is in-distribution; on real out-of-domain audio it produced
120
- SLSCU-template hallucinations regardless of the input. v2 (this model)
121
- is harder to benchmark with a single number because the training mix is
122
- broader; honest evaluation is in progress.
123
-
124
- ### Out-of-domain qualitative comparison
125
-
126
- On a 53-second medical dialogue (patient describing diabetic-like symptoms):
127
-
128
- | Model | Result |
129
- |---|---|
130
- | v1 (Whisper-LoRA, SLSCU only) | Completely off-topic — "cookies, honey, jasmine rice in 28 packs" (template hallucination) |
131
- | Base Pathumma (no LoRA) | Mostly correct medical dialogue in Central Thai register |
132
- | **v2 (Pathumma + this LoRA)** | **Mostly correct medical dialogue, with Kham Mueang particles preserved (เจ้า, เปิ้น, ฮู, หื้อ, ตวย)** |
133
-
134
- ## Known limitations
135
-
136
- 1. **Last-segment SLSCU bleed.** When a sentence has the prosodic pattern of
137
- SLSCU's "X has Y of Z, N units" templates, the model still occasionally
138
- collapses to that template. Most visible on the final segment of a long
139
- utterance.
140
- 2. **No conversational training data.** Both SLSCU (read e-commerce) and
141
- Porjai (read news/wiki) are read speech. Natural conversation with
142
- hesitations and prosodic emphasis is genuinely OOD.
143
- 3. **No medical-domain Kham Mueang.** Adding ~100 h of synthesized medical
144
- dialogue (via Gemini TTS) is the planned next step.
145
- 4. **Space-tokenized output.** Both training corpora use space-separated
146
- tokens, so output is space-tokenized. Strip if you want continuous script.
147
 
148
  ## Training data
149
 
150
  | Source | Hours | Style | License |
151
  |---|---|---|---|
152
- | [SLSCU Khummuang](https://huggingface.co/datasets/CMKL/Porjai-Thai-voice-dataset-khummuang) | 33 | Read e-commerce + survey | CC-BY-SA-4.0 |
153
- | [CMKL Porjai Central Thai](https://huggingface.co/datasets/CMKL/Porjai-Thai-voice-dataset-central) | 50 (capped) | Read news + Wikipedia | CC-BY-SA-4.0 |
154
 
155
- Total: ~83 h, ~50:50 split. Held-out validation is SLSCU-only (487 clips,
156
- 0.7 h) so the in-domain CER is comparable across versions.
157
 
158
  ## Citation
159
 
 
8
  - speech-recognition
9
  - lora
10
  - pathumma
11
+ - experimental
12
  license: cc-by-sa-4.0
13
  base_model: nectec/Pathumma-llm-audio-1.0.0
14
  library_name: peft
 
16
 
17
  # lanna-voice — Kham Mueang STT (Pathumma + LoRA)
18
 
19
+ > ⚠️ **Status: experimental known regression on in-domain SLSCU.**
20
+ > This adapter is published for transparency / reproducibility while a
21
+ > retrained version (v3, higher LoRA rank) is in progress. **Do not use
22
+ > for production.** See *Known regression* below.
23
 
24
+ LoRA adapter for [`nectec/Pathumma-llm-audio-1.0.0`](https://huggingface.co/nectec/Pathumma-llm-audio-1.0.0),
25
+ fine-tuned for **Kham Mueang (Northern Thai / คำเมือง)** on a mix of:
26
+
27
+ - **SLSCU Khummuang** (~33 h, [CMKL/Porjai-Thai-voice-dataset-khummuang](https://huggingface.co/datasets/CMKL/Porjai-Thai-voice-dataset-khummuang))
28
+ - **CMKL Porjai Central Thai** (~50 h, [CMKL/Porjai-Thai-voice-dataset-central](https://huggingface.co/datasets/CMKL/Porjai-Thai-voice-dataset-central))
29
+
30
+ This replaces the previous `stt_lora/` + `stt_ct2/` Whisper-LoRA setup, which
31
+ collapsed onto SLSCU's narrow market-template distribution and produced
32
+ unusable output on out-of-domain audio. The Pathumma swap fixed the OOD
33
+ collapse but introduced a different problem (see below).
34
 
35
  - **Adapter rank:** 8 (LoRA on `q_proj`, `v_proj`; ~2.5 M trainable params)
36
+ - **Training:** 1 epoch, 12,203 steps, batch 4, bf16, LR 1e-4
37
  - **Hardware:** single L4 24 GB
38
+ - **Final loss:** ~0.27
39
+
40
+ ## Known regression
41
+
42
+ On a 20-clip sample of the **SLSCU val set** (in-domain), v2 hits
43
+ **~90% CER** — vs the previous Whisper-LoRA's 5.39% on the same set. The
44
+ adapter has memorized 3 high-frequency SLSCU phrases and emits one of them
45
+ on most short SLSCU-acoustics clips:
46
+
47
+ | v2 output (greedy) | Hit rate in 20-clip sample |
48
+ |---|---|
49
+ | `จ้วย ปิด หน้าต่าง หื้อ ตวย` | 6 / 20 |
50
+ | `จ้วย ปิด ไฟ หื้อกำ` | 5 / 20 |
51
+ | `ก๋าน จ่าย สตังค์ ของ จ้าว หั้น ยัง บ่ได้ ลง บัญชี` | 3 / 20 |
52
+
53
+ Beam search does not help (beam=4 corpus CER 90.6% vs greedy 90.1%).
54
+
55
+ **Root cause:** rank-8 LoRA × 1 epoch × 50:50 data mix is **under-capacity**.
56
+ With a broader training distribution than v1, the rank-8 adapter (2.5 M params)
57
+ can only memorize a few high-frequency patterns rather than the full SLSCU
58
+ template space. v3 will use rank 32 + lower dropout + 2 epochs + 2× oversampled
59
+ SLSCU to fix this.
60
+
61
+ ## Where v2 still helps
62
+
63
+ The medical-dialogue audio used for our OOD test (53 s patient describing
64
+ diabetic-like symptoms) is so far from SLSCU acoustics that the LoRA's
65
+ template trigger doesn't fire — instead the base Pathumma transcribes the
66
+ content correctly and the LoRA applies modest Kham-Mueang flavoring. With
67
+ silero-vad chunking + beam search, v2 produces mostly-coherent medical
68
+ content with `เจ้า / เปิ้น / ฮู / หื้อ / ตวย` particles preserved.
69
+
70
+ So: v2 is roughly *base Pathumma + Kham-Mueang accent overlay* on real OOD
71
+ audio, but it is *broken* on short, SLSCU-acoustics clips. Use base Pathumma
72
+ directly for production until v3 is ready.
73
 
74
  ## Repo layout
75
 
 
80
  README.md ← you are here
81
  ```
82
 
83
+ ## How to use (with caveats)
84
 
85
+ PEFT's `save_pretrained` strips the adapter name from key paths, so when
86
+ loading we have to insert `.default.` back into each LoRA tensor name.
 
87
 
88
  ```python
89
+ import torch, librosa
90
  from huggingface_hub import hf_hub_download
91
  from safetensors.torch import load_file
92
  from transformers import AutoModel
93
 
94
  device = "cuda"
95
 
96
+ # 1) Base Pathumma in inference mode
97
  model = AutoModel.from_pretrained(
98
  "nectec/Pathumma-llm-audio-1.0.0",
99
  torch_dtype=torch.bfloat16,
 
102
  trust_remote_code=True,
103
  )
104
 
105
+ # 2) Overlay our LoRA, with the .default. rename rule
106
  adapter_path = hf_hub_download(
107
  "mfuni/lanna-voice", "pathumma_lora/adapter_model.safetensors",
108
  )
 
118
  model.qwen2_model.load_state_dict(renamed, strict=False)
119
  model = model.to(device).eval()
120
 
121
+ # 3) Transcribe (use beam search; greedy collapses on short clips)
 
122
  audio, _ = librosa.load("clip.wav", sr=16000, mono=True)
123
  out = model.generate(
124
  raw_wave=audio,
 
133
  print(out[0])
134
  ```
135
 
136
+ For audio longer than ~15 s, segment with [silero-vad](https://github.com/snakers4/silero-vad)
137
+ into ≤ 5 s chunks and stitch the outputs.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
138
 
139
  ## Training data
140
 
141
  | Source | Hours | Style | License |
142
  |---|---|---|---|
143
+ | SLSCU Khummuang | 33 | Read e-commerce + survey | CC-BY-SA-4.0 |
144
+ | CMKL Porjai Central Thai | 50 (capped) | Read news + Wikipedia | CC-BY-SA-4.0 |
145
 
146
+ Total ~83 h, ~50:50 split. Validation is SLSCU-only (487 clips, 0.7 h).
 
147
 
148
  ## Citation
149