Mark v2 as experimental — known SLSCU val regression

20-clip SLSCU val sample shows v2 ~90% CER vs v1's 5.39%; LoRA rank 8 collapsed to ~3 templates. v3 (rank 32, 2 epochs, oversampled) in progress.

Files changed (1) hide show

README.md +63 -73

README.md CHANGED Viewed

@@ -8,6 +8,7 @@ tags:
   - speech-recognition
   - lora
   - pathumma
 license: cc-by-sa-4.0
 base_model: nectec/Pathumma-llm-audio-1.0.0
 library_name: peft
@@ -15,21 +16,60 @@ library_name: peft
 # lanna-voice — Kham Mueang STT (Pathumma + LoRA)
-Speech-to-text fine-tune for **Kham Mueang (Northern Thai / คำเมือง)**, built on
-top of [`nectec/Pathumma-llm-audio-1.0.0`](https://huggingface.co/nectec/Pathumma-llm-audio-1.0.0)
-with a small LoRA adapter trained on a mix of **SLSCU Khummuang (~33 h)** and
-**CMKL Porjai Central Thai (~50 h)**.
-This replaces the previous `stt_lora/` + `stt_ct2/` Whisper-LoRA setup. The
-Whisper-LoRA fine-tune collapsed onto SLSCU's narrow market-template
-distribution and produced unusable output on out-of-domain audio. Switching to
-the Pathumma audio LLM (Whisper encoder → BEATs → Q-Former → Qwen2 8B with
-LoRA) plus mixing in Central Thai data fixed the collapse.
 - **Adapter rank:** 8 (LoRA on `q_proj`, `v_proj`; ~2.5 M trainable params)
-- **Training:** 1 epoch, 12,203 steps, batch 4 (1 × 4 grad accum), bf16
 - **Hardware:** single L4 24 GB
-- **Final loss:** ~0.27 (no SLSCU-template overfit; v1 had collapsed to 0.02)
 ## Repo layout
@@ -40,21 +80,20 @@ pathumma_lora/
 README.md  ← you are here
 ```
-## How to use
-The Pathumma base model already wires LoRA into Qwen2; we overlay our trained
-weights on top. PEFT's `save_pretrained` strips the adapter name from keys, so
-when loading we have to insert `.default.` back into each LoRA tensor name.
 ```python
-import torch
 from huggingface_hub import hf_hub_download
 from safetensors.torch import load_file
 from transformers import AutoModel
 device = "cuda"
-# 1) Load base Pathumma in inference mode
 model = AutoModel.from_pretrained(
     "nectec/Pathumma-llm-audio-1.0.0",
     torch_dtype=torch.bfloat16,
@@ -63,7 +102,7 @@ model = AutoModel.from_pretrained(
     trust_remote_code=True,
 )
-# 2) Pull our LoRA adapter from this repo and rename keys
 adapter_path = hf_hub_download(
     "mfuni/lanna-voice", "pathumma_lora/adapter_model.safetensors",
 )
@@ -79,8 +118,7 @@ for k, v in sd.items():
 model.qwen2_model.load_state_dict(renamed, strict=False)
 model = model.to(device).eval()
-# 3) Transcribe (prompt should match training exactly)
-import librosa
 audio, _ = librosa.load("clip.wav", sr=16000, mono=True)
 out = model.generate(
     raw_wave=audio,
@@ -95,65 +133,17 @@ out = model.generate(
 print(out[0])
 ```
-### Recommended inference settings
-| Setting | Why |
-|---|---|
-| `num_beams=4` | Greedy decoding lets the LLM prior dominate when audio is ambiguous → hallucinated content. Beam search picks the joint-probable transcript. |
-| `repetition_penalty=1.2` | Discourages the loop-tail behaviour that LoRA can induce on long clips. |
-| `length_penalty=0.8` | The training data is mostly 1–15 s; without this, the model EOSes early on dense audio. |
-| `no_repeat_ngram_size=4` | Cheap insurance against repeated 4-grams in failure modes. |
-### Audio handling for clips longer than ~15 s
-The model is trained on 1–15 s clips (median ~5 s). For longer audio, segment
-with [silero-vad](https://github.com/snakers4/silero-vad) and pack each
-segment into ≤ 5 s chunks before transcribing. Stitch the per-chunk outputs.
-Larger chunks cause the model to EOS early, dropping content.
-## Performance
-### In-domain (SLSCU Khummuang val, 487 clips)
-The v1 (Whisper-LoRA) hit **5.39 % CER** on this set but only because the
-val set is in-distribution; on real out-of-domain audio it produced
-SLSCU-template hallucinations regardless of the input. v2 (this model)
-is harder to benchmark with a single number because the training mix is
-broader; honest evaluation is in progress.
-### Out-of-domain qualitative comparison
-On a 53-second medical dialogue (patient describing diabetic-like symptoms):
-| Model | Result |
-|---|---|
-| v1 (Whisper-LoRA, SLSCU only) | Completely off-topic — "cookies, honey, jasmine rice in 28 packs" (template hallucination) |
-| Base Pathumma (no LoRA) | Mostly correct medical dialogue in Central Thai register |
-| **v2 (Pathumma + this LoRA)** | **Mostly correct medical dialogue, with Kham Mueang particles preserved (เจ้า, เปิ้น, ฮู, หื้อ, ตวย)** |
-## Known limitations
-1. **Last-segment SLSCU bleed.** When a sentence has the prosodic pattern of
-   SLSCU's "X has Y of Z, N units" templates, the model still occasionally
-   collapses to that template. Most visible on the final segment of a long
-   utterance.
-2. **No conversational training data.** Both SLSCU (read e-commerce) and
-   Porjai (read news/wiki) are read speech. Natural conversation with
-   hesitations and prosodic emphasis is genuinely OOD.
-3. **No medical-domain Kham Mueang.** Adding ~100 h of synthesized medical
-   dialogue (via Gemini TTS) is the planned next step.
-4. **Space-tokenized output.** Both training corpora use space-separated
-   tokens, so output is space-tokenized. Strip if you want continuous script.
 ## Training data
 | Source | Hours | Style | License |
 |---|---|---|---|
-| [SLSCU Khummuang](https://huggingface.co/datasets/CMKL/Porjai-Thai-voice-dataset-khummuang) | 33 | Read e-commerce + survey | CC-BY-SA-4.0 |
-| [CMKL Porjai Central Thai](https://huggingface.co/datasets/CMKL/Porjai-Thai-voice-dataset-central) | 50 (capped) | Read news + Wikipedia | CC-BY-SA-4.0 |
-Total: ~83 h, ~50:50 split. Held-out validation is SLSCU-only (487 clips,
-0.7 h) so the in-domain CER is comparable across versions.
 ## Citation

   - speech-recognition
   - lora
   - pathumma
+  - experimental
 license: cc-by-sa-4.0
 base_model: nectec/Pathumma-llm-audio-1.0.0
 library_name: peft
 # lanna-voice — Kham Mueang STT (Pathumma + LoRA)
+> ⚠️ **Status: experimental — known regression on in-domain SLSCU.**
+> This adapter is published for transparency / reproducibility while a
+> retrained version (v3, higher LoRA rank) is in progress. **Do not use
+> for production.** See *Known regression* below.
+LoRA adapter for [`nectec/Pathumma-llm-audio-1.0.0`](https://huggingface.co/nectec/Pathumma-llm-audio-1.0.0),
+fine-tuned for **Kham Mueang (Northern Thai / คำเมือง)** on a mix of:
+- **SLSCU Khummuang** (~33 h, [CMKL/Porjai-Thai-voice-dataset-khummuang](https://huggingface.co/datasets/CMKL/Porjai-Thai-voice-dataset-khummuang))
+- **CMKL Porjai Central Thai** (~50 h, [CMKL/Porjai-Thai-voice-dataset-central](https://huggingface.co/datasets/CMKL/Porjai-Thai-voice-dataset-central))
+This replaces the previous `stt_lora/` + `stt_ct2/` Whisper-LoRA setup, which
+collapsed onto SLSCU's narrow market-template distribution and produced
+unusable output on out-of-domain audio. The Pathumma swap fixed the OOD
+collapse but introduced a different problem (see below).
 - **Adapter rank:** 8 (LoRA on `q_proj`, `v_proj`; ~2.5 M trainable params)
+- **Training:** 1 epoch, 12,203 steps, batch 4, bf16, LR 1e-4
 - **Hardware:** single L4 24 GB
+- **Final loss:** ~0.27
+## Known regression
+On a 20-clip sample of the **SLSCU val set** (in-domain), v2 hits
+**~90% CER** — vs the previous Whisper-LoRA's 5.39% on the same set. The
+adapter has memorized 3 high-frequency SLSCU phrases and emits one of them
+on most short SLSCU-acoustics clips:
+| v2 output (greedy) | Hit rate in 20-clip sample |
+|---|---|
+| `จ้วย ปิด หน้าต่าง หื้อ ตวย` | 6 / 20 |
+| `จ้วย ปิด ไฟ หื้อกำ` | 5 / 20 |
+| `ก๋าน จ่าย สตังค์ ของ จ้าว หั้น ยัง บ่ได้ ลง บัญชี` | 3 / 20 |
+Beam search does not help (beam=4 corpus CER 90.6% vs greedy 90.1%).
+**Root cause:** rank-8 LoRA × 1 epoch × 50:50 data mix is **under-capacity**.
+With a broader training distribution than v1, the rank-8 adapter (2.5 M params)
+can only memorize a few high-frequency patterns rather than the full SLSCU
+template space. v3 will use rank 32 + lower dropout + 2 epochs + 2× oversampled
+SLSCU to fix this.
+## Where v2 still helps
+The medical-dialogue audio used for our OOD test (53 s patient describing
+diabetic-like symptoms) is so far from SLSCU acoustics that the LoRA's
+template trigger doesn't fire — instead the base Pathumma transcribes the
+content correctly and the LoRA applies modest Kham-Mueang flavoring. With
+silero-vad chunking + beam search, v2 produces mostly-coherent medical
+content with `เจ้า / เปิ้น / ฮู / หื้อ / ตวย` particles preserved.
+So: v2 is roughly *base Pathumma + Kham-Mueang accent overlay* on real OOD
+audio, but it is *broken* on short, SLSCU-acoustics clips. Use base Pathumma
+directly for production until v3 is ready.
 ## Repo layout
 README.md  ← you are here
 ```
+## How to use (with caveats)
+PEFT's `save_pretrained` strips the adapter name from key paths, so when
+loading we have to insert `.default.` back into each LoRA tensor name.
 ```python
+import torch, librosa
 from huggingface_hub import hf_hub_download
 from safetensors.torch import load_file
 from transformers import AutoModel
 device = "cuda"
+# 1) Base Pathumma in inference mode
 model = AutoModel.from_pretrained(
     "nectec/Pathumma-llm-audio-1.0.0",
     torch_dtype=torch.bfloat16,
     trust_remote_code=True,
 )
+# 2) Overlay our LoRA, with the .default. rename rule
 adapter_path = hf_hub_download(
     "mfuni/lanna-voice", "pathumma_lora/adapter_model.safetensors",
 )
 model.qwen2_model.load_state_dict(renamed, strict=False)
 model = model.to(device).eval()
+# 3) Transcribe (use beam search; greedy collapses on short clips)
 audio, _ = librosa.load("clip.wav", sr=16000, mono=True)
 out = model.generate(
     raw_wave=audio,
 print(out[0])
 ```
+For audio longer than ~15 s, segment with [silero-vad](https://github.com/snakers4/silero-vad)
+into ≤ 5 s chunks and stitch the outputs.
 ## Training data
 | Source | Hours | Style | License |
 |---|---|---|---|
+| SLSCU Khummuang | 33 | Read e-commerce + survey | CC-BY-SA-4.0 |
+| CMKL Porjai Central Thai | 50 (capped) | Read news + Wikipedia | CC-BY-SA-4.0 |
+Total ~83 h, ~50:50 split. Validation is SLSCU-only (487 clips, 0.7 h).
 ## Citation