Automatic Speech Recognition
MLX
Safetensors
English
gemma
gemma-4
meralion
speech
speech-to-text
lora
bfloat16
singapore-english
singlish
Eval Results (legacy)
Instructions to use majentik/gemma-4-e4b-mlx-elderwise-MERaLiON with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use majentik/gemma-4-e4b-mlx-elderwise-MERaLiON with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir gemma-4-e4b-mlx-elderwise-MERaLiON majentik/gemma-4-e4b-mlx-elderwise-MERaLiON
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
Improve model card and add LoRA sidecar config
Browse files- README.md +124 -95
- config.json +17 -5
- lora/lora_config.json +19 -0
README.md
CHANGED
|
@@ -1,143 +1,172 @@
|
|
| 1 |
---
|
| 2 |
-
license:
|
|
|
|
|
|
|
| 3 |
language:
|
| 4 |
-
|
| 5 |
library_name: mlx
|
| 6 |
-
tags:
|
| 7 |
-
- mlx
|
| 8 |
-
- apple-silicon
|
| 9 |
-
- gemma
|
| 10 |
-
- gemma-4
|
| 11 |
-
- meralion
|
| 12 |
-
- speech
|
| 13 |
-
- asr
|
| 14 |
-
- lora
|
| 15 |
-
- singapore-english
|
| 16 |
-
- singlish
|
| 17 |
pipeline_tag: automatic-speech-recognition
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
base_model:
|
| 19 |
-
|
| 20 |
-
|
| 21 |
datasets:
|
| 22 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
---
|
| 24 |
|
| 25 |
# Gemma-4-E4B-BF16 + MERaLiON Speech LoRA for Singapore English (MLX)
|
| 26 |
|
| 27 |
-
A Singapore-English ASR
|
| 28 |
|
| 29 |
-
This is the
|
| 30 |
|
| 31 |
-
>
|
| 32 |
|
| 33 |
-
##
|
| 34 |
|
| 35 |
-
|
| 36 |
-
- **Language model**: Gemma-4-E4B in **bfloat16** (no quantization, no calibration artifacts)
|
| 37 |
-
- **Projector**: 3-stage MLP (3584 → 3072 → 2560) bridging speech → text embedding space
|
| 38 |
-
- **LoRA adapter**: rank-16 on `q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj` across all 42 decoder layers
|
| 39 |
-
- **Format**: native MLX `safetensors` throughout
|
| 40 |
-
- **Bundle size**: ~16 GB
|
| 41 |
|
| 42 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
-
|
|
|
|
|
|
|
| 45 |
|
| 46 |
-
|
| 47 |
-
|---|---|
|
| 48 |
-
| MERaLiON-3 (encoder + native decoder, baseline) | 25.78% |
|
| 49 |
-
| **This release (BF16 + speech LoRA)** | **16.09%** |
|
| 50 |
|
| 51 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
|
| 53 |
-
|
| 54 |
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
.
|
| 59 |
-
├── config.json # composition manifest
|
| 60 |
-
├── PROVENANCE.md # data sources, eval methodology, license chain
|
| 61 |
-
├── README.md # this file
|
| 62 |
-
├── decoder/ # Gemma-4-E4B BF16 (MLX safetensors, 4 shards)
|
| 63 |
-
├── speech_encoder/ # MERaLiON-3 encoder + adaptor (fp16)
|
| 64 |
-
├── projector/ # 3-stage MLP, fp32
|
| 65 |
-
└── lora/ # rank-16 LoRA adapters, fp32
|
| 66 |
```
|
| 67 |
|
| 68 |
## Quickstart
|
| 69 |
|
| 70 |
-
|
| 71 |
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
|
| 77 |
-
|
| 78 |
|
| 79 |
```python
|
| 80 |
-
from
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 81 |
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
projector_path=
|
| 86 |
-
lora_path=
|
| 87 |
lora_rank=16,
|
| 88 |
-
lora_target_names=(
|
| 89 |
-
|
| 90 |
-
|
|
|
|
| 91 |
)
|
| 92 |
|
| 93 |
-
text =
|
| 94 |
print(text)
|
| 95 |
```
|
| 96 |
|
| 97 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 98 |
|
| 99 |
## Intended use
|
| 100 |
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 104 |
|
| 105 |
-
|
|
|
|
|
|
|
| 106 |
|
| 107 |
## Limitations
|
| 108 |
|
| 109 |
-
-
|
| 110 |
-
-
|
| 111 |
-
- Long-form audio
|
| 112 |
-
-
|
| 113 |
|
| 114 |
-
##
|
| 115 |
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
Gemma-4-E4B BF16 decoder + LoRA (rank-16)
|
| 126 |
-
│
|
| 127 |
-
▼
|
| 128 |
-
transcription
|
| 129 |
-
```
|
| 130 |
|
| 131 |
-
|
| 132 |
|
| 133 |
-
|
| 134 |
|
| 135 |
-
|
|
|
|
|
|
|
|
|
|
| 136 |
|
| 137 |
-
|
| 138 |
-
- **Speech encoder** inherits from `MERaLiON/MERaLiON-3-10B`.
|
| 139 |
-
- **Speech corpus** for LoRA training: `MERaLiON/Multitask-National-Speech-Corpus-v1`.
|
| 140 |
-
- **Projector + LoRA** weights are released under the same Gemma terms.
|
| 141 |
|
| 142 |
## Citation
|
| 143 |
|
|
@@ -146,11 +175,11 @@ See [`PROVENANCE.md`](./PROVENANCE.md) for the full chain of custody and license
|
|
| 146 |
title = {Gemma-4-E4B-BF16 + MERaLiON Speech LoRA for Singapore English (MLX)},
|
| 147 |
author = {majentik},
|
| 148 |
year = {2026},
|
| 149 |
-
|
| 150 |
}
|
| 151 |
```
|
| 152 |
|
| 153 |
## Related releases
|
| 154 |
|
| 155 |
-
- 8-bit
|
| 156 |
-
- This BF16 edition is the recommended
|
|
|
|
| 1 |
---
|
| 2 |
+
license: other
|
| 3 |
+
license_name: gemma-terms-and-meralion-release-terms
|
| 4 |
+
license_link: https://ai.google.dev/gemma/terms
|
| 5 |
language:
|
| 6 |
+
- en
|
| 7 |
library_name: mlx
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
pipeline_tag: automatic-speech-recognition
|
| 9 |
+
tags:
|
| 10 |
+
- mlx
|
| 11 |
+
- safetensors
|
| 12 |
+
- gemma
|
| 13 |
+
- gemma-4
|
| 14 |
+
- meralion
|
| 15 |
+
- speech
|
| 16 |
+
- speech-to-text
|
| 17 |
+
- automatic-speech-recognition
|
| 18 |
+
- lora
|
| 19 |
+
- bfloat16
|
| 20 |
+
- singapore-english
|
| 21 |
+
- singlish
|
| 22 |
base_model:
|
| 23 |
+
- google/gemma-4-E4B-it
|
| 24 |
+
- MERaLiON/MERaLiON-3-10B
|
| 25 |
datasets:
|
| 26 |
+
- MERaLiON/Multitask-National-Speech-Corpus-v1
|
| 27 |
+
metrics:
|
| 28 |
+
- wer
|
| 29 |
+
model-index:
|
| 30 |
+
- name: Gemma-4-E4B-BF16 + MERaLiON Speech LoRA
|
| 31 |
+
results:
|
| 32 |
+
- task:
|
| 33 |
+
type: automatic-speech-recognition
|
| 34 |
+
name: Automatic Speech Recognition
|
| 35 |
+
dataset:
|
| 36 |
+
type: MERaLiON/Multitask-National-Speech-Corpus-v1
|
| 37 |
+
name: MNSC ASR Part 2 Test
|
| 38 |
+
split: test
|
| 39 |
+
metrics:
|
| 40 |
+
- type: wer
|
| 41 |
+
name: WER
|
| 42 |
+
value: 16.09
|
| 43 |
---
|
| 44 |
|
| 45 |
# Gemma-4-E4B-BF16 + MERaLiON Speech LoRA for Singapore English (MLX)
|
| 46 |
|
| 47 |
+
A composed Singapore-English ASR model that connects the **MERaLiON-3** speech encoder to a **BF16 Gemma-4-E4B** decoder through a trained projector and rank-16 speech LoRA.
|
| 48 |
|
| 49 |
+
This BF16 release is the recommended quality-first edition: it keeps the decoder in native bfloat16, avoids quantization artifacts, and improves the standalone MERaLiON-3 baseline by **9.69 WER points** on the MNSC ASR Part 2 test set.
|
| 50 |
|
| 51 |
+
> Important: this is a **composed MLX bundle**, not a vanilla `transformers.pipeline` checkpoint. Use the `elderwise` runtime (or equivalent wiring) to connect `speech_encoder/`, `projector/`, `decoder/`, and `lora/`.
|
| 52 |
|
| 53 |
+
## Result summary
|
| 54 |
|
| 55 |
+
Evaluated on **MERaLiON Multitask National Speech Corpus v1 — ASR Part 2 Test** (3000 utterance-level clips).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
|
| 57 |
+
| System | WER ↓ | Notes |
|
| 58 |
+
|---|---:|---|
|
| 59 |
+
| MERaLiON-3 baseline | 25.78% | stock MERaLiON-3 encoder + native decoder |
|
| 60 |
+
| 8-bit Gemma-4 + MERaLiON speech LoRA | 18.86% | smaller sibling release |
|
| 61 |
+
| **This BF16 release** | **16.09%** | best-quality bundle |
|
| 62 |
|
| 63 |
+
- Absolute improvement vs. MERaLiON-3 baseline: **−9.69pp**
|
| 64 |
+
- Absolute improvement vs. 8-bit sibling: **−2.77pp**
|
| 65 |
+
- Normalization: lowercase, ASCII punctuation stripped, whitespace collapsed, speaker-prefix tags removed from reference and hypothesis.
|
| 66 |
|
| 67 |
+
## What is inside
|
|
|
|
|
|
|
|
|
|
| 68 |
|
| 69 |
+
| Path | Contents | Precision |
|
| 70 |
+
|---|---|---|
|
| 71 |
+
| `decoder/` | Gemma-4-E4B instruction decoder, MLX format | bfloat16 |
|
| 72 |
+
| `speech_encoder/` | MERaLiON-3 acoustic encoder + frame adaptor | fp16 |
|
| 73 |
+
| `projector/` | `LayerNorm -> Linear(3584,3072) -> SiLU -> Linear(3072,2560) -> RMSNorm` | fp32 |
|
| 74 |
+
| `lora/` | rank-16 speech-alignment LoRA adapters + `lora_config.json` | fp32 |
|
| 75 |
+
| `config.json` | composition manifest | JSON |
|
| 76 |
+
| `PROVENANCE.md` | chain of custody, evaluation, license notes | Markdown |
|
| 77 |
|
| 78 |
+
The speech path is:
|
| 79 |
|
| 80 |
+
```text
|
| 81 |
+
audio -> Whisper-style log-mel -> MERaLiON-3 encoder/adaptor -> 3584-d speech embeddings
|
| 82 |
+
-> projector -> 2560-d Gemma embedding space -> Gemma-4-E4B BF16 + speech LoRA -> text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 83 |
```
|
| 84 |
|
| 85 |
## Quickstart
|
| 86 |
|
| 87 |
+
Install or clone the `elderwise` runtime that wires the components together:
|
| 88 |
|
| 89 |
+
```bash
|
| 90 |
+
pip install git+https://github.com/ajentik/elderwise-mlx.git
|
| 91 |
+
# or: git clone https://github.com/ajentik/elderwise-mlx && pip install -e elderwise-mlx
|
| 92 |
+
```
|
| 93 |
|
| 94 |
+
Then load the composed bundle:
|
| 95 |
|
| 96 |
```python
|
| 97 |
+
from pathlib import Path
|
| 98 |
+
|
| 99 |
+
from elderwise.inference import load_pipeline, transcribe_with_pipeline
|
| 100 |
+
from huggingface_hub import snapshot_download
|
| 101 |
+
|
| 102 |
+
bundle = Path(snapshot_download("majentik/gemma-4-e4b-mlx-elderwise-MERaLiON"))
|
| 103 |
|
| 104 |
+
pipeline = load_pipeline(
|
| 105 |
+
meralion_dir=str(bundle / "speech_encoder"),
|
| 106 |
+
gemma_id=str(bundle / "decoder"),
|
| 107 |
+
projector_path=str(bundle / "projector"),
|
| 108 |
+
lora_path=str(bundle / "lora"),
|
| 109 |
lora_rank=16,
|
| 110 |
+
lora_target_names=(
|
| 111 |
+
"q_proj", "k_proj", "v_proj", "o_proj",
|
| 112 |
+
"gate_proj", "up_proj", "down_proj",
|
| 113 |
+
),
|
| 114 |
)
|
| 115 |
|
| 116 |
+
text = transcribe_with_pipeline(pipeline, "your_audio.wav", max_tokens=128)
|
| 117 |
print(text)
|
| 118 |
```
|
| 119 |
|
| 120 |
+
Runtime notes:
|
| 121 |
+
|
| 122 |
+
- `lora_path` should point to the **directory** containing `adapters.safetensors` (`lora/`), not to the file itself.
|
| 123 |
+
- The target module list must match the adapter: `q/k/v/o/gate/up/down` across all 42 decoder layers.
|
| 124 |
+
- Use the prompt `Transcribe the following audio: ` unless you intentionally fine-tune/evaluate a different prompt contract.
|
| 125 |
+
- The speech LoRA is switchable in the runtime: enable speech mode for ASR, disable/scale to `0.0` for plain text generation.
|
| 126 |
|
| 127 |
## Intended use
|
| 128 |
|
| 129 |
+
Good fits:
|
| 130 |
+
|
| 131 |
+
- Singapore English / Singlish automatic speech recognition
|
| 132 |
+
- utterance-level voice notes, routing, search, and agent input
|
| 133 |
+
- MLX-native speech-language research with a shared text decoder
|
| 134 |
+
|
| 135 |
+
Not intended for:
|
| 136 |
|
| 137 |
+
- safety-critical or legal/medical transcription
|
| 138 |
+
- diarization, timestamps, speaker identification, or streaming ASR
|
| 139 |
+
- Mandarin-only ASR; a separate switchable Mandarin LoRA is planned
|
| 140 |
|
| 141 |
## Limitations
|
| 142 |
|
| 143 |
+
- The LoRA is specialized for Singapore English. Other accents and languages may degrade.
|
| 144 |
+
- Residual errors mostly cluster around rare or ambiguous proper nouns, especially code-switched names and places.
|
| 145 |
+
- Long-form audio was not the optimization target; split long recordings into utterance-sized chunks.
|
| 146 |
+
- This repo is a composed bundle. Generic hub inference widgets will not know how to run it without the `elderwise` runtime.
|
| 147 |
|
| 148 |
+
## Architecture details
|
| 149 |
|
| 150 |
+
- Speech encoder output dimension: **3584**
|
| 151 |
+
- Projector hidden dimension: **3072**
|
| 152 |
+
- Decoder embedding dimension: **2560**
|
| 153 |
+
- Decoder depth: **42 layers**
|
| 154 |
+
- LoRA rank: **16**
|
| 155 |
+
- LoRA targets: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
|
| 156 |
+
- Speech-mode LoRA scale used by the release runtime: **20.0**
|
| 157 |
+
|
| 158 |
+
Gemma-4's per-layer embedding side channel is handled in the runtime by supplying explicit per-layer inputs for speech positions instead of forcing speech embeddings through token nearest-neighbor recovery.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 159 |
|
| 160 |
+
## Provenance and licenses
|
| 161 |
|
| 162 |
+
See [`PROVENANCE.md`](./PROVENANCE.md) for the full chain of custody. Summary:
|
| 163 |
|
| 164 |
+
- Decoder: `google/gemma-4-E4B-it`, converted to MLX bfloat16; Gemma Terms of Use apply.
|
| 165 |
+
- Speech tower: `MERaLiON/MERaLiON-3-10B`; MERaLiON release terms apply.
|
| 166 |
+
- Training data source: `MERaLiON/Multitask-National-Speech-Corpus-v1`; MNSC terms apply.
|
| 167 |
+
- Projector + LoRA: trained alignment components for this composition; distributed with the same upstream obligations.
|
| 168 |
|
| 169 |
+
Internal optimization recipe and hardware details are intentionally omitted from the public package.
|
|
|
|
|
|
|
|
|
|
| 170 |
|
| 171 |
## Citation
|
| 172 |
|
|
|
|
| 175 |
title = {Gemma-4-E4B-BF16 + MERaLiON Speech LoRA for Singapore English (MLX)},
|
| 176 |
author = {majentik},
|
| 177 |
year = {2026},
|
| 178 |
+
url = {https://huggingface.co/majentik/gemma-4-e4b-mlx-elderwise-MERaLiON}
|
| 179 |
}
|
| 180 |
```
|
| 181 |
|
| 182 |
## Related releases
|
| 183 |
|
| 184 |
+
- 8-bit sibling: [`majentik/Gemma-4-E4B-MERaLiON-Speech-LoRA-MNSC-MLX`](https://huggingface.co/majentik/Gemma-4-E4B-MERaLiON-Speech-LoRA-MNSC-MLX) — smaller, 18.86% WER.
|
| 185 |
+
- This BF16 edition is the recommended release for best transcription quality.
|
config.json
CHANGED
|
@@ -3,7 +3,9 @@
|
|
| 3 |
"version": "1.0.0-bf16",
|
| 4 |
"kind": "composed_speech_to_text",
|
| 5 |
"task": "automatic-speech-recognition",
|
| 6 |
-
"language": [
|
|
|
|
|
|
|
| 7 |
"domain": "Singapore English (MNSC)",
|
| 8 |
"framework": "mlx",
|
| 9 |
"dtype": "bfloat16",
|
|
@@ -23,7 +25,11 @@
|
|
| 23 |
"projector": {
|
| 24 |
"path": "projector",
|
| 25 |
"arch": "LayerNorm -> Linear(3584,3072) -> SiLU -> Linear(3072,2560) -> RMSNorm",
|
| 26 |
-
"dims": [
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
"dtype": "float32"
|
| 28 |
},
|
| 29 |
"lora": {
|
|
@@ -31,11 +37,17 @@
|
|
| 31 |
"rank": 16,
|
| 32 |
"scale": 20.0,
|
| 33 |
"targets": [
|
| 34 |
-
"q_proj",
|
| 35 |
-
"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
],
|
| 37 |
"applied_layers": "all 42 decoder layers",
|
| 38 |
-
"dtype": "float32"
|
|
|
|
| 39 |
}
|
| 40 |
},
|
| 41 |
"inference": {
|
|
|
|
| 3 |
"version": "1.0.0-bf16",
|
| 4 |
"kind": "composed_speech_to_text",
|
| 5 |
"task": "automatic-speech-recognition",
|
| 6 |
+
"language": [
|
| 7 |
+
"en"
|
| 8 |
+
],
|
| 9 |
"domain": "Singapore English (MNSC)",
|
| 10 |
"framework": "mlx",
|
| 11 |
"dtype": "bfloat16",
|
|
|
|
| 25 |
"projector": {
|
| 26 |
"path": "projector",
|
| 27 |
"arch": "LayerNorm -> Linear(3584,3072) -> SiLU -> Linear(3072,2560) -> RMSNorm",
|
| 28 |
+
"dims": [
|
| 29 |
+
3584,
|
| 30 |
+
3072,
|
| 31 |
+
2560
|
| 32 |
+
],
|
| 33 |
"dtype": "float32"
|
| 34 |
},
|
| 35 |
"lora": {
|
|
|
|
| 37 |
"rank": 16,
|
| 38 |
"scale": 20.0,
|
| 39 |
"targets": [
|
| 40 |
+
"q_proj",
|
| 41 |
+
"k_proj",
|
| 42 |
+
"v_proj",
|
| 43 |
+
"o_proj",
|
| 44 |
+
"gate_proj",
|
| 45 |
+
"up_proj",
|
| 46 |
+
"down_proj"
|
| 47 |
],
|
| 48 |
"applied_layers": "all 42 decoder layers",
|
| 49 |
+
"dtype": "float32",
|
| 50 |
+
"config": "lora/lora_config.json"
|
| 51 |
}
|
| 52 |
},
|
| 53 |
"inference": {
|
lora/lora_config.json
ADDED
|
@@ -0,0 +1,19 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"format": "elderwise-switchable-lora",
|
| 3 |
+
"adapter_file": "adapters.safetensors",
|
| 4 |
+
"rank": 16,
|
| 5 |
+
"scale": 20.0,
|
| 6 |
+
"target_modules": [
|
| 7 |
+
"q_proj",
|
| 8 |
+
"k_proj",
|
| 9 |
+
"v_proj",
|
| 10 |
+
"o_proj",
|
| 11 |
+
"gate_proj",
|
| 12 |
+
"up_proj",
|
| 13 |
+
"down_proj"
|
| 14 |
+
],
|
| 15 |
+
"decoder_layers": 42,
|
| 16 |
+
"dtype": "float32",
|
| 17 |
+
"speech_mode_scale": 20.0,
|
| 18 |
+
"text_mode_scale": 0.0
|
| 19 |
+
}
|