majentik commited on
Commit
2bd7bb3
·
verified ·
1 Parent(s): 83de0b8

Improve model card and add LoRA sidecar config

Browse files
Files changed (3) hide show
  1. README.md +124 -95
  2. config.json +17 -5
  3. lora/lora_config.json +19 -0
README.md CHANGED
@@ -1,143 +1,172 @@
1
  ---
2
- license: gemma
 
 
3
  language:
4
- - en
5
  library_name: mlx
6
- tags:
7
- - mlx
8
- - apple-silicon
9
- - gemma
10
- - gemma-4
11
- - meralion
12
- - speech
13
- - asr
14
- - lora
15
- - singapore-english
16
- - singlish
17
  pipeline_tag: automatic-speech-recognition
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  base_model:
19
- - google/gemma-4-E4B-it
20
- - MERaLiON/MERaLiON-3-10B
21
  datasets:
22
- - MERaLiON/Multitask-National-Speech-Corpus-v1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
  ---
24
 
25
  # Gemma-4-E4B-BF16 + MERaLiON Speech LoRA for Singapore English (MLX)
26
 
27
- A Singapore-English ASR composition that pairs the **MERaLiON-3** speech encoder with a **BF16 Gemma-4-E4B** language model, glued by a small projector and a LoRA adapter trained for speech understanding.
28
 
29
- This is the **BF16 native** edition of the bundle — the decoder runs at full bfloat16 precision (no quantization). It improves on our 8-bit release and beats the standalone MERaLiON-3 baseline by **~9.7 WER points** on the MNSC ASR Part 2 test set.
30
 
31
- > Designed for Apple Silicon (MLX). One unified pipeline, runnable from a single Python process.
32
 
33
- ## Highlights
34
 
35
- - **Speech encoder**: MERaLiON-3 acoustic encoder + frame-level adaptor (frozen, fp16)
36
- - **Language model**: Gemma-4-E4B in **bfloat16** (no quantization, no calibration artifacts)
37
- - **Projector**: 3-stage MLP (3584 → 3072 → 2560) bridging speech → text embedding space
38
- - **LoRA adapter**: rank-16 on `q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj` across all 42 decoder layers
39
- - **Format**: native MLX `safetensors` throughout
40
- - **Bundle size**: ~16 GB
41
 
42
- ## Results
 
 
 
 
43
 
44
- Evaluated on **MERaLiON Multitask National Speech Corpus v1 — ASR Part 2 Test** (3000 clips), Singapore English with code-switched proper nouns (Malay, Tamil, Mandarin place/person names).
 
 
45
 
46
- | System | WER ↓ |
47
- |---|---|
48
- | MERaLiON-3 (encoder + native decoder, baseline) | 25.78% |
49
- | **This release (BF16 + speech LoRA)** | **16.09%** |
50
 
51
- Δ = **−9.69 percentage points** vs. baseline.
 
 
 
 
 
 
 
52
 
53
- Normalization: lowercase, ASCII punctuation stripped, whitespace collapsed (jiwer default), speaker prefix tags removed.
54
 
55
- ## Bundle layout
56
-
57
- ```
58
- .
59
- ├── config.json # composition manifest
60
- ├── PROVENANCE.md # data sources, eval methodology, license chain
61
- ├── README.md # this file
62
- ├── decoder/ # Gemma-4-E4B BF16 (MLX safetensors, 4 shards)
63
- ├── speech_encoder/ # MERaLiON-3 encoder + adaptor (fp16)
64
- ├── projector/ # 3-stage MLP, fp32
65
- └── lora/ # rank-16 LoRA adapters, fp32
66
  ```
67
 
68
  ## Quickstart
69
 
70
- This bundle is composed; loading it requires the small **`elderwise`** runtime that wires the four components together. The runtime ships in [ajentik/elderwise-mlx](https://github.com/ajentik/elderwise-mlx) (private). The pipeline class handles:
71
 
72
- - mel-spectrogram extraction from raw audio
73
- - forward through MERaLiON-3 encoder + adaptor
74
- - projection into Gemma-4 embedding space
75
- - LoRA-augmented decoder generation with proper chat templating
76
 
77
- Sketch:
78
 
79
  ```python
80
- from elderwise.inference import load_pipeline
 
 
 
 
 
81
 
82
- pipe = load_pipeline(
83
- decoder_path="path/to/decoder",
84
- speech_encoder_path="path/to/speech_encoder",
85
- projector_path="path/to/projector",
86
- lora_path="path/to/lora/adapters.safetensors",
87
  lora_rank=16,
88
- lora_target_names=("q_proj", "k_proj", "v_proj", "o_proj",
89
- "gate_proj", "up_proj", "down_proj"),
90
- lora_scale=20.0,
 
91
  )
92
 
93
- text = pipe.transcribe("clip.wav", prompt="Transcribe the following audio: ")
94
  print(text)
95
  ```
96
 
97
- For mode-switching (text-only vs. speech), set the LoRA scale to `0.0` for clean text generation and `20.0` for speech transcription. The same decoder weights serve both modes.
 
 
 
 
 
98
 
99
  ## Intended use
100
 
101
- - Singapore English ASR, especially conversational and code-switched speech
102
- - Research and prototyping on Apple Silicon
103
- - Component swap-in for downstream LM-augmented voice apps (notes, agents, IVRs)
 
 
 
 
104
 
105
- Not intended for safety-critical transcription, courtroom records, or medical use.
 
 
106
 
107
  ## Limitations
108
 
109
- - Trained primarily on Singapore English (MNSC v1, Parts 1–2). Out-of-distribution accents may degrade.
110
- - Code-switched proper nouns (Malay/Tamil/Mandarin lexical items) remain the dominant residual error class.
111
- - Long-form audio (>30 s clips) was not in scope; the bundle is tuned for utterance-level inputs.
112
- - Mandarin and other non-English languages are **not** covered by this LoRA. A separate switchable adapter is planned for Mandarin.
113
 
114
- ## Composition lineage
115
 
116
- ```
117
- MERaLiON-3 (encoder + adaptor)
118
-
119
- ▼ speech embeds (3584-d)
120
- ┌────────────┐
121
- projector │ 3584 3072 2560
122
- └────────────┘
123
- │ text-space embeds
124
-
125
- Gemma-4-E4B BF16 decoder + LoRA (rank-16)
126
-
127
-
128
- transcription
129
- ```
130
 
131
- The decoder shares one set of weights between text and speech; the LoRA is gated by a runtime scale.
132
 
133
- ## Provenance & licensing
134
 
135
- See [`PROVENANCE.md`](./PROVENANCE.md) for the full chain of custody and license terms. Summary:
 
 
 
136
 
137
- - **Decoder weights** inherit from `google/gemma-4-E4B-it` and are subject to the **Gemma Terms of Use**.
138
- - **Speech encoder** inherits from `MERaLiON/MERaLiON-3-10B`.
139
- - **Speech corpus** for LoRA training: `MERaLiON/Multitask-National-Speech-Corpus-v1`.
140
- - **Projector + LoRA** weights are released under the same Gemma terms.
141
 
142
  ## Citation
143
 
@@ -146,11 +175,11 @@ See [`PROVENANCE.md`](./PROVENANCE.md) for the full chain of custody and license
146
  title = {Gemma-4-E4B-BF16 + MERaLiON Speech LoRA for Singapore English (MLX)},
147
  author = {majentik},
148
  year = {2026},
149
- howpublished = {\url{https://huggingface.co/majentik/gemma-4-e4b-mlx-elderwise-MERaLiON}}
150
  }
151
  ```
152
 
153
  ## Related releases
154
 
155
- - 8-bit edition: [`majentik/Gemma-4-E4B-MERaLiON-Speech-LoRA-MNSC-MLX`](https://huggingface.co/majentik/Gemma-4-E4B-MERaLiON-Speech-LoRA-MNSC-MLX) — smaller (~10 GB), 18.86% WER.
156
- - This BF16 edition is the recommended default when memory budget allows.
 
1
  ---
2
+ license: other
3
+ license_name: gemma-terms-and-meralion-release-terms
4
+ license_link: https://ai.google.dev/gemma/terms
5
  language:
6
+ - en
7
  library_name: mlx
 
 
 
 
 
 
 
 
 
 
 
8
  pipeline_tag: automatic-speech-recognition
9
+ tags:
10
+ - mlx
11
+ - safetensors
12
+ - gemma
13
+ - gemma-4
14
+ - meralion
15
+ - speech
16
+ - speech-to-text
17
+ - automatic-speech-recognition
18
+ - lora
19
+ - bfloat16
20
+ - singapore-english
21
+ - singlish
22
  base_model:
23
+ - google/gemma-4-E4B-it
24
+ - MERaLiON/MERaLiON-3-10B
25
  datasets:
26
+ - MERaLiON/Multitask-National-Speech-Corpus-v1
27
+ metrics:
28
+ - wer
29
+ model-index:
30
+ - name: Gemma-4-E4B-BF16 + MERaLiON Speech LoRA
31
+ results:
32
+ - task:
33
+ type: automatic-speech-recognition
34
+ name: Automatic Speech Recognition
35
+ dataset:
36
+ type: MERaLiON/Multitask-National-Speech-Corpus-v1
37
+ name: MNSC ASR Part 2 Test
38
+ split: test
39
+ metrics:
40
+ - type: wer
41
+ name: WER
42
+ value: 16.09
43
  ---
44
 
45
  # Gemma-4-E4B-BF16 + MERaLiON Speech LoRA for Singapore English (MLX)
46
 
47
+ A composed Singapore-English ASR model that connects the **MERaLiON-3** speech encoder to a **BF16 Gemma-4-E4B** decoder through a trained projector and rank-16 speech LoRA.
48
 
49
+ This BF16 release is the recommended quality-first edition: it keeps the decoder in native bfloat16, avoids quantization artifacts, and improves the standalone MERaLiON-3 baseline by **9.69 WER points** on the MNSC ASR Part 2 test set.
50
 
51
+ > Important: this is a **composed MLX bundle**, not a vanilla `transformers.pipeline` checkpoint. Use the `elderwise` runtime (or equivalent wiring) to connect `speech_encoder/`, `projector/`, `decoder/`, and `lora/`.
52
 
53
+ ## Result summary
54
 
55
+ Evaluated on **MERaLiON Multitask National Speech Corpus v1 ASR Part 2 Test** (3000 utterance-level clips).
 
 
 
 
 
56
 
57
+ | System | WER ↓ | Notes |
58
+ |---|---:|---|
59
+ | MERaLiON-3 baseline | 25.78% | stock MERaLiON-3 encoder + native decoder |
60
+ | 8-bit Gemma-4 + MERaLiON speech LoRA | 18.86% | smaller sibling release |
61
+ | **This BF16 release** | **16.09%** | best-quality bundle |
62
 
63
+ - Absolute improvement vs. MERaLiON-3 baseline: **−9.69pp**
64
+ - Absolute improvement vs. 8-bit sibling: **−2.77pp**
65
+ - Normalization: lowercase, ASCII punctuation stripped, whitespace collapsed, speaker-prefix tags removed from reference and hypothesis.
66
 
67
+ ## What is inside
 
 
 
68
 
69
+ | Path | Contents | Precision |
70
+ |---|---|---|
71
+ | `decoder/` | Gemma-4-E4B instruction decoder, MLX format | bfloat16 |
72
+ | `speech_encoder/` | MERaLiON-3 acoustic encoder + frame adaptor | fp16 |
73
+ | `projector/` | `LayerNorm -> Linear(3584,3072) -> SiLU -> Linear(3072,2560) -> RMSNorm` | fp32 |
74
+ | `lora/` | rank-16 speech-alignment LoRA adapters + `lora_config.json` | fp32 |
75
+ | `config.json` | composition manifest | JSON |
76
+ | `PROVENANCE.md` | chain of custody, evaluation, license notes | Markdown |
77
 
78
+ The speech path is:
79
 
80
+ ```text
81
+ audio -> Whisper-style log-mel -> MERaLiON-3 encoder/adaptor -> 3584-d speech embeddings
82
+ -> projector -> 2560-d Gemma embedding space -> Gemma-4-E4B BF16 + speech LoRA -> text
 
 
 
 
 
 
 
 
83
  ```
84
 
85
  ## Quickstart
86
 
87
+ Install or clone the `elderwise` runtime that wires the components together:
88
 
89
+ ```bash
90
+ pip install git+https://github.com/ajentik/elderwise-mlx.git
91
+ # or: git clone https://github.com/ajentik/elderwise-mlx && pip install -e elderwise-mlx
92
+ ```
93
 
94
+ Then load the composed bundle:
95
 
96
  ```python
97
+ from pathlib import Path
98
+
99
+ from elderwise.inference import load_pipeline, transcribe_with_pipeline
100
+ from huggingface_hub import snapshot_download
101
+
102
+ bundle = Path(snapshot_download("majentik/gemma-4-e4b-mlx-elderwise-MERaLiON"))
103
 
104
+ pipeline = load_pipeline(
105
+ meralion_dir=str(bundle / "speech_encoder"),
106
+ gemma_id=str(bundle / "decoder"),
107
+ projector_path=str(bundle / "projector"),
108
+ lora_path=str(bundle / "lora"),
109
  lora_rank=16,
110
+ lora_target_names=(
111
+ "q_proj", "k_proj", "v_proj", "o_proj",
112
+ "gate_proj", "up_proj", "down_proj",
113
+ ),
114
  )
115
 
116
+ text = transcribe_with_pipeline(pipeline, "your_audio.wav", max_tokens=128)
117
  print(text)
118
  ```
119
 
120
+ Runtime notes:
121
+
122
+ - `lora_path` should point to the **directory** containing `adapters.safetensors` (`lora/`), not to the file itself.
123
+ - The target module list must match the adapter: `q/k/v/o/gate/up/down` across all 42 decoder layers.
124
+ - Use the prompt `Transcribe the following audio: ` unless you intentionally fine-tune/evaluate a different prompt contract.
125
+ - The speech LoRA is switchable in the runtime: enable speech mode for ASR, disable/scale to `0.0` for plain text generation.
126
 
127
  ## Intended use
128
 
129
+ Good fits:
130
+
131
+ - Singapore English / Singlish automatic speech recognition
132
+ - utterance-level voice notes, routing, search, and agent input
133
+ - MLX-native speech-language research with a shared text decoder
134
+
135
+ Not intended for:
136
 
137
+ - safety-critical or legal/medical transcription
138
+ - diarization, timestamps, speaker identification, or streaming ASR
139
+ - Mandarin-only ASR; a separate switchable Mandarin LoRA is planned
140
 
141
  ## Limitations
142
 
143
+ - The LoRA is specialized for Singapore English. Other accents and languages may degrade.
144
+ - Residual errors mostly cluster around rare or ambiguous proper nouns, especially code-switched names and places.
145
+ - Long-form audio was not the optimization target; split long recordings into utterance-sized chunks.
146
+ - This repo is a composed bundle. Generic hub inference widgets will not know how to run it without the `elderwise` runtime.
147
 
148
+ ## Architecture details
149
 
150
+ - Speech encoder output dimension: **3584**
151
+ - Projector hidden dimension: **3072**
152
+ - Decoder embedding dimension: **2560**
153
+ - Decoder depth: **42 layers**
154
+ - LoRA rank: **16**
155
+ - LoRA targets: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
156
+ - Speech-mode LoRA scale used by the release runtime: **20.0**
157
+
158
+ Gemma-4's per-layer embedding side channel is handled in the runtime by supplying explicit per-layer inputs for speech positions instead of forcing speech embeddings through token nearest-neighbor recovery.
 
 
 
 
 
159
 
160
+ ## Provenance and licenses
161
 
162
+ See [`PROVENANCE.md`](./PROVENANCE.md) for the full chain of custody. Summary:
163
 
164
+ - Decoder: `google/gemma-4-E4B-it`, converted to MLX bfloat16; Gemma Terms of Use apply.
165
+ - Speech tower: `MERaLiON/MERaLiON-3-10B`; MERaLiON release terms apply.
166
+ - Training data source: `MERaLiON/Multitask-National-Speech-Corpus-v1`; MNSC terms apply.
167
+ - Projector + LoRA: trained alignment components for this composition; distributed with the same upstream obligations.
168
 
169
+ Internal optimization recipe and hardware details are intentionally omitted from the public package.
 
 
 
170
 
171
  ## Citation
172
 
 
175
  title = {Gemma-4-E4B-BF16 + MERaLiON Speech LoRA for Singapore English (MLX)},
176
  author = {majentik},
177
  year = {2026},
178
+ url = {https://huggingface.co/majentik/gemma-4-e4b-mlx-elderwise-MERaLiON}
179
  }
180
  ```
181
 
182
  ## Related releases
183
 
184
+ - 8-bit sibling: [`majentik/Gemma-4-E4B-MERaLiON-Speech-LoRA-MNSC-MLX`](https://huggingface.co/majentik/Gemma-4-E4B-MERaLiON-Speech-LoRA-MNSC-MLX) — smaller, 18.86% WER.
185
+ - This BF16 edition is the recommended release for best transcription quality.
config.json CHANGED
@@ -3,7 +3,9 @@
3
  "version": "1.0.0-bf16",
4
  "kind": "composed_speech_to_text",
5
  "task": "automatic-speech-recognition",
6
- "language": ["en"],
 
 
7
  "domain": "Singapore English (MNSC)",
8
  "framework": "mlx",
9
  "dtype": "bfloat16",
@@ -23,7 +25,11 @@
23
  "projector": {
24
  "path": "projector",
25
  "arch": "LayerNorm -> Linear(3584,3072) -> SiLU -> Linear(3072,2560) -> RMSNorm",
26
- "dims": [3584, 3072, 2560],
 
 
 
 
27
  "dtype": "float32"
28
  },
29
  "lora": {
@@ -31,11 +37,17 @@
31
  "rank": 16,
32
  "scale": 20.0,
33
  "targets": [
34
- "q_proj", "k_proj", "v_proj", "o_proj",
35
- "gate_proj", "up_proj", "down_proj"
 
 
 
 
 
36
  ],
37
  "applied_layers": "all 42 decoder layers",
38
- "dtype": "float32"
 
39
  }
40
  },
41
  "inference": {
 
3
  "version": "1.0.0-bf16",
4
  "kind": "composed_speech_to_text",
5
  "task": "automatic-speech-recognition",
6
+ "language": [
7
+ "en"
8
+ ],
9
  "domain": "Singapore English (MNSC)",
10
  "framework": "mlx",
11
  "dtype": "bfloat16",
 
25
  "projector": {
26
  "path": "projector",
27
  "arch": "LayerNorm -> Linear(3584,3072) -> SiLU -> Linear(3072,2560) -> RMSNorm",
28
+ "dims": [
29
+ 3584,
30
+ 3072,
31
+ 2560
32
+ ],
33
  "dtype": "float32"
34
  },
35
  "lora": {
 
37
  "rank": 16,
38
  "scale": 20.0,
39
  "targets": [
40
+ "q_proj",
41
+ "k_proj",
42
+ "v_proj",
43
+ "o_proj",
44
+ "gate_proj",
45
+ "up_proj",
46
+ "down_proj"
47
  ],
48
  "applied_layers": "all 42 decoder layers",
49
+ "dtype": "float32",
50
+ "config": "lora/lora_config.json"
51
  }
52
  },
53
  "inference": {
lora/lora_config.json ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "format": "elderwise-switchable-lora",
3
+ "adapter_file": "adapters.safetensors",
4
+ "rank": 16,
5
+ "scale": 20.0,
6
+ "target_modules": [
7
+ "q_proj",
8
+ "k_proj",
9
+ "v_proj",
10
+ "o_proj",
11
+ "gate_proj",
12
+ "up_proj",
13
+ "down_proj"
14
+ ],
15
+ "decoder_layers": 42,
16
+ "dtype": "float32",
17
+ "speech_mode_scale": 20.0,
18
+ "text_mode_scale": 0.0
19
+ }