majentik
/

gemma-4-e4b-mlx-elderwise-MERaLiON

@@ -1,143 +1,172 @@
 ---
-license: gemma
 language:
-  - en
 library_name: mlx
-tags:
-  - mlx
-  - apple-silicon
-  - gemma
-  - gemma-4
-  - meralion
-  - speech
-  - asr
-  - lora
-  - singapore-english
-  - singlish
 pipeline_tag: automatic-speech-recognition
 base_model:
-  - google/gemma-4-E4B-it
-  - MERaLiON/MERaLiON-3-10B
 datasets:
-  - MERaLiON/Multitask-National-Speech-Corpus-v1
 ---
 # Gemma-4-E4B-BF16 + MERaLiON Speech LoRA for Singapore English (MLX)
-A Singapore-English ASR composition that pairs the **MERaLiON-3** speech encoder with a **BF16 Gemma-4-E4B** language model, glued by a small projector and a LoRA adapter trained for speech understanding.
-This is the **BF16 native** edition of the bundle — the decoder runs at full bfloat16 precision (no quantization). It improves on our 8-bit release and beats the standalone MERaLiON-3 baseline by **~9.7 WER points** on the MNSC ASR Part 2 test set.
-> Designed for Apple Silicon (MLX). One unified pipeline, runnable from a single Python process.
-## Highlights
-- **Speech encoder**: MERaLiON-3 acoustic encoder + frame-level adaptor (frozen, fp16)
-- **Language model**: Gemma-4-E4B in **bfloat16** (no quantization, no calibration artifacts)
-- **Projector**: 3-stage MLP (3584 → 3072 → 2560) bridging speech → text embedding space
-- **LoRA adapter**: rank-16 on `q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj` across all 42 decoder layers
-- **Format**: native MLX `safetensors` throughout
-- **Bundle size**: ~16 GB
-## Results
-Evaluated on **MERaLiON Multitask National Speech Corpus v1 — ASR Part 2 Test** (3000 clips), Singapore English with code-switched proper nouns (Malay, Tamil, Mandarin place/person names).
-| System | WER ↓ |
-|---|---|
-| MERaLiON-3 (encoder + native decoder, baseline) | 25.78% |
-| **This release (BF16 + speech LoRA)** | **16.09%** |
-Δ = **−9.69 percentage points** vs. baseline.
-Normalization: lowercase, ASCII punctuation stripped, whitespace collapsed (jiwer default), speaker prefix tags removed.
-## Bundle layout
-```
-.
-├── config.json             # composition manifest
-├── PROVENANCE.md           # data sources, eval methodology, license chain
-├── README.md               # this file
-├── decoder/                # Gemma-4-E4B BF16 (MLX safetensors, 4 shards)
-├── speech_encoder/         # MERaLiON-3 encoder + adaptor (fp16)
-├── projector/              # 3-stage MLP, fp32
-└── lora/                   # rank-16 LoRA adapters, fp32
 ```
 ## Quickstart
-This bundle is composed; loading it requires the small **`elderwise`** runtime that wires the four components together. The runtime ships in [ajentik/elderwise-mlx](https://github.com/ajentik/elderwise-mlx) (private). The pipeline class handles:
-- mel-spectrogram extraction from raw audio
-- forward through MERaLiON-3 encoder + adaptor
-- projection into Gemma-4 embedding space
-- LoRA-augmented decoder generation with proper chat templating
-Sketch:
 ```python
-from elderwise.inference import load_pipeline
-pipe = load_pipeline(
-    decoder_path="path/to/decoder",
-    speech_encoder_path="path/to/speech_encoder",
-    projector_path="path/to/projector",
-    lora_path="path/to/lora/adapters.safetensors",
     lora_rank=16,
-    lora_target_names=("q_proj", "k_proj", "v_proj", "o_proj",
-                       "gate_proj", "up_proj", "down_proj"),
-    lora_scale=20.0,
 )
-text = pipe.transcribe("clip.wav", prompt="Transcribe the following audio: ")
 print(text)
 ```
-For mode-switching (text-only vs. speech), set the LoRA scale to `0.0` for clean text generation and `20.0` for speech transcription. The same decoder weights serve both modes.
 ## Intended use
-- Singapore English ASR, especially conversational and code-switched speech
-- Research and prototyping on Apple Silicon
-- Component swap-in for downstream LM-augmented voice apps (notes, agents, IVRs)
-Not intended for safety-critical transcription, courtroom records, or medical use.
 ## Limitations
-- Trained primarily on Singapore English (MNSC v1, Parts 1–2). Out-of-distribution accents may degrade.
-- Code-switched proper nouns (Malay/Tamil/Mandarin lexical items) remain the dominant residual error class.
-- Long-form audio (>30 s clips) was not in scope; the bundle is tuned for utterance-level inputs.
-- Mandarin and other non-English languages are **not** covered by this LoRA. A separate switchable adapter is planned for Mandarin.
-## Composition lineage
-```
-       MERaLiON-3 (encoder + adaptor)
-                  │
-                  ▼   speech embeds (3584-d)
-            ┌────────────┐
-            │ projector  │   3584 → 3072 → 2560
-            └────────────┘
-                  │   text-space embeds
-                  ▼
-       Gemma-4-E4B BF16 decoder + LoRA (rank-16)
-                  │
-                  ▼
-              transcription
-```
-The decoder shares one set of weights between text and speech; the LoRA is gated by a runtime scale.
-## Provenance & licensing
-See [`PROVENANCE.md`](./PROVENANCE.md) for the full chain of custody and license terms. Summary:
-- **Decoder weights** inherit from `google/gemma-4-E4B-it` and are subject to the **Gemma Terms of Use**.
-- **Speech encoder** inherits from `MERaLiON/MERaLiON-3-10B`.
-- **Speech corpus** for LoRA training: `MERaLiON/Multitask-National-Speech-Corpus-v1`.
-- **Projector + LoRA** weights are released under the same Gemma terms.
 ## Citation
@@ -146,11 +175,11 @@ See [`PROVENANCE.md`](./PROVENANCE.md) for the full chain of custody and license
   title  = {Gemma-4-E4B-BF16 + MERaLiON Speech LoRA for Singapore English (MLX)},
   author = {majentik},
   year   = {2026},
-  howpublished = {\url{https://huggingface.co/majentik/gemma-4-e4b-mlx-elderwise-MERaLiON}}
 }
 ```
 ## Related releases
-- 8-bit edition: [`majentik/Gemma-4-E4B-MERaLiON-Speech-LoRA-MNSC-MLX`](https://huggingface.co/majentik/Gemma-4-E4B-MERaLiON-Speech-LoRA-MNSC-MLX) — smaller (~10 GB), 18.86% WER.
-- This BF16 edition is the recommended default when memory budget allows.

 ---
+license: other
+license_name: gemma-terms-and-meralion-release-terms
+license_link: https://ai.google.dev/gemma/terms
 language:
+- en
 library_name: mlx
 pipeline_tag: automatic-speech-recognition
+tags:
+- mlx
+- safetensors
+- gemma
+- gemma-4
+- meralion
+- speech
+- speech-to-text
+- automatic-speech-recognition
+- lora
+- bfloat16
+- singapore-english
+- singlish
 base_model:
+- google/gemma-4-E4B-it
+- MERaLiON/MERaLiON-3-10B
 datasets:
+- MERaLiON/Multitask-National-Speech-Corpus-v1
+metrics:
+- wer
+model-index:
+- name: Gemma-4-E4B-BF16 + MERaLiON Speech LoRA
+  results:
+  - task:
+      type: automatic-speech-recognition
+      name: Automatic Speech Recognition
+    dataset:
+      type: MERaLiON/Multitask-National-Speech-Corpus-v1
+      name: MNSC ASR Part 2 Test
+      split: test
+    metrics:
+    - type: wer
+      name: WER
+      value: 16.09
 ---
 # Gemma-4-E4B-BF16 + MERaLiON Speech LoRA for Singapore English (MLX)
+A composed Singapore-English ASR model that connects the **MERaLiON-3** speech encoder to a **BF16 Gemma-4-E4B** decoder through a trained projector and rank-16 speech LoRA.
+This BF16 release is the recommended quality-first edition: it keeps the decoder in native bfloat16, avoids quantization artifacts, and improves the standalone MERaLiON-3 baseline by **9.69 WER points** on the MNSC ASR Part 2 test set.
+> Important: this is a **composed MLX bundle**, not a vanilla `transformers.pipeline` checkpoint. Use the `elderwise` runtime (or equivalent wiring) to connect `speech_encoder/`, `projector/`, `decoder/`, and `lora/`.
+## Result summary
+Evaluated on **MERaLiON Multitask National Speech Corpus v1 — ASR Part 2 Test** (3000 utterance-level clips).
+| System | WER ↓ | Notes |
+|---|---:|---|
+| MERaLiON-3 baseline | 25.78% | stock MERaLiON-3 encoder + native decoder |
+| 8-bit Gemma-4 + MERaLiON speech LoRA | 18.86% | smaller sibling release |
+| **This BF16 release** | **16.09%** | best-quality bundle |
+- Absolute improvement vs. MERaLiON-3 baseline: **−9.69pp**
+- Absolute improvement vs. 8-bit sibling: **−2.77pp**
+- Normalization: lowercase, ASCII punctuation stripped, whitespace collapsed, speaker-prefix tags removed from reference and hypothesis.
+## What is inside
+| Path | Contents | Precision |
+|---|---|---|
+| `decoder/` | Gemma-4-E4B instruction decoder, MLX format | bfloat16 |
+| `speech_encoder/` | MERaLiON-3 acoustic encoder + frame adaptor | fp16 |
+| `projector/` | `LayerNorm -> Linear(3584,3072) -> SiLU -> Linear(3072,2560) -> RMSNorm` | fp32 |
+| `lora/` | rank-16 speech-alignment LoRA adapters + `lora_config.json` | fp32 |
+| `config.json` | composition manifest | JSON |
+| `PROVENANCE.md` | chain of custody, evaluation, license notes | Markdown |
+The speech path is:
+```text
+audio -> Whisper-style log-mel -> MERaLiON-3 encoder/adaptor -> 3584-d speech embeddings
+      -> projector -> 2560-d Gemma embedding space -> Gemma-4-E4B BF16 + speech LoRA -> text
 ```
 ## Quickstart
+Install or clone the `elderwise` runtime that wires the components together:
+```bash
+pip install git+https://github.com/ajentik/elderwise-mlx.git
+# or: git clone https://github.com/ajentik/elderwise-mlx && pip install -e elderwise-mlx
+```
+Then load the composed bundle:
 ```python
+from pathlib import Path
+from elderwise.inference import load_pipeline, transcribe_with_pipeline
+from huggingface_hub import snapshot_download
+bundle = Path(snapshot_download("majentik/gemma-4-e4b-mlx-elderwise-MERaLiON"))
+pipeline = load_pipeline(
+    meralion_dir=str(bundle / "speech_encoder"),
+    gemma_id=str(bundle / "decoder"),
+    projector_path=str(bundle / "projector"),
+    lora_path=str(bundle / "lora"),
     lora_rank=16,
+    lora_target_names=(
+        "q_proj", "k_proj", "v_proj", "o_proj",
+        "gate_proj", "up_proj", "down_proj",
+    ),
 )
+text = transcribe_with_pipeline(pipeline, "your_audio.wav", max_tokens=128)
 print(text)
 ```
+Runtime notes:
+- `lora_path` should point to the **directory** containing `adapters.safetensors` (`lora/`), not to the file itself.
+- The target module list must match the adapter: `q/k/v/o/gate/up/down` across all 42 decoder layers.
+- Use the prompt `Transcribe the following audio: ` unless you intentionally fine-tune/evaluate a different prompt contract.
+- The speech LoRA is switchable in the runtime: enable speech mode for ASR, disable/scale to `0.0` for plain text generation.
 ## Intended use
+Good fits:
+- Singapore English / Singlish automatic speech recognition
+- utterance-level voice notes, routing, search, and agent input
+- MLX-native speech-language research with a shared text decoder
+Not intended for:
+- safety-critical or legal/medical transcription
+- diarization, timestamps, speaker identification, or streaming ASR
+- Mandarin-only ASR; a separate switchable Mandarin LoRA is planned
 ## Limitations
+- The LoRA is specialized for Singapore English. Other accents and languages may degrade.
+- Residual errors mostly cluster around rare or ambiguous proper nouns, especially code-switched names and places.
+- Long-form audio was not the optimization target; split long recordings into utterance-sized chunks.
+- This repo is a composed bundle. Generic hub inference widgets will not know how to run it without the `elderwise` runtime.
+## Architecture details
+- Speech encoder output dimension: **3584**
+- Projector hidden dimension: **3072**
+- Decoder embedding dimension: **2560**
+- Decoder depth: **42 layers**
+- LoRA rank: **16**
+- LoRA targets: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
+- Speech-mode LoRA scale used by the release runtime: **20.0**
+Gemma-4's per-layer embedding side channel is handled in the runtime by supplying explicit per-layer inputs for speech positions instead of forcing speech embeddings through token nearest-neighbor recovery.
+## Provenance and licenses
+See [`PROVENANCE.md`](./PROVENANCE.md) for the full chain of custody. Summary:
+- Decoder: `google/gemma-4-E4B-it`, converted to MLX bfloat16; Gemma Terms of Use apply.
+- Speech tower: `MERaLiON/MERaLiON-3-10B`; MERaLiON release terms apply.
+- Training data source: `MERaLiON/Multitask-National-Speech-Corpus-v1`; MNSC terms apply.
+- Projector + LoRA: trained alignment components for this composition; distributed with the same upstream obligations.
+Internal optimization recipe and hardware details are intentionally omitted from the public package.
 ## Citation
   title  = {Gemma-4-E4B-BF16 + MERaLiON Speech LoRA for Singapore English (MLX)},
   author = {majentik},
   year   = {2026},
+  url    = {https://huggingface.co/majentik/gemma-4-e4b-mlx-elderwise-MERaLiON}
 }
 ```
 ## Related releases
+- 8-bit sibling: [`majentik/Gemma-4-E4B-MERaLiON-Speech-LoRA-MNSC-MLX`](https://huggingface.co/majentik/Gemma-4-E4B-MERaLiON-Speech-LoRA-MNSC-MLX) — smaller, 18.86% WER.
+- This BF16 edition is the recommended release for best transcription quality.

config.json CHANGED Viewed

@@ -3,7 +3,9 @@
   "version": "1.0.0-bf16",
   "kind": "composed_speech_to_text",
   "task": "automatic-speech-recognition",
-  "language": ["en"],
   "domain": "Singapore English (MNSC)",
   "framework": "mlx",
   "dtype": "bfloat16",
@@ -23,7 +25,11 @@
     "projector": {
       "path": "projector",
       "arch": "LayerNorm -> Linear(3584,3072) -> SiLU -> Linear(3072,2560) -> RMSNorm",
-      "dims": [3584, 3072, 2560],
       "dtype": "float32"
     },
     "lora": {
@@ -31,11 +37,17 @@
       "rank": 16,
       "scale": 20.0,
       "targets": [
-        "q_proj", "k_proj", "v_proj", "o_proj",
-        "gate_proj", "up_proj", "down_proj"
       ],
       "applied_layers": "all 42 decoder layers",
-      "dtype": "float32"
     }
   },
   "inference": {

   "version": "1.0.0-bf16",
   "kind": "composed_speech_to_text",
   "task": "automatic-speech-recognition",
+  "language": [
+    "en"
+  ],
   "domain": "Singapore English (MNSC)",
   "framework": "mlx",
   "dtype": "bfloat16",
     "projector": {
       "path": "projector",
       "arch": "LayerNorm -> Linear(3584,3072) -> SiLU -> Linear(3072,2560) -> RMSNorm",
+      "dims": [
+        3584,
+        3072,
+        2560
+      ],
       "dtype": "float32"
     },
     "lora": {
       "rank": 16,
       "scale": 20.0,
       "targets": [
+        "q_proj",
+        "k_proj",
+        "v_proj",
+        "o_proj",
+        "gate_proj",
+        "up_proj",
+        "down_proj"
       ],
       "applied_layers": "all 42 decoder layers",
+      "dtype": "float32",
+      "config": "lora/lora_config.json"
     }
   },
   "inference": {

lora/lora_config.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+  "format": "elderwise-switchable-lora",
+  "adapter_file": "adapters.safetensors",
+  "rank": 16,
+  "scale": 20.0,
+  "target_modules": [
+    "q_proj",
+    "k_proj",
+    "v_proj",
+    "o_proj",
+    "gate_proj",
+    "up_proj",
+    "down_proj"
+  ],
+  "decoder_layers": 42,
+  "dtype": "float32",
+  "speech_mode_scale": 20.0,
+  "text_mode_scale": 0.0
+}