Add model card

d3350db verified about 1 month ago

6.05 kB

	---
	license: mit
	language:
	- en
	tags:
	- automatic-speech-recognition
	- whisper
	- mamba
	- ssm
	- distillation
	- hedgehog
	- cross-architecture-distillation
	- librispeech
	- apple-silicon
	- mlx
	base_model: openai/whisper-tiny
	datasets:
	- librispeech_asr
	metrics:
	- wer
	---

	# WhisperMamba — HedgeMamba Distillation of Whisper-tiny

	Unofficial implementation of ["Attention to Mamba: A Recipe for Cross-Architecture Distillation"](https://arxiv.org/abs/2604.14191) (Moudgil et al., Apple + MILA, April 2026) applied to Whisper-tiny.

	The decoder's self-attention layers are replaced with HedgeMamba SSM mixers using two-stage knowledge distillation. The encoder and cross-attention remain frozen from the original Whisper-tiny weights.

	> Not the authors' code, not affiliated with Apple or MILA.

	Code: [github.com/akashicMarga/hedge-mamba-distil](https://github.com/akashicMarga/hedge-mamba-distil)

	---

	## What this model is

	Whisper-tiny has 4 decoder layers, each with a self-attention block. This student replaces every self-attention with a HedgeMambaMixer — a selective SSM with:

	- Hedgehog projection on B and C: `φ(x) = softmax([Wx, −Wx])` — doubles effective state size and replaces Q/K
	- Selective scan with input-dependent Δt (ZOH discretization)
	- SiLU gate on the output
	- Fix-B state caching for O(1) per-step autoregressive inference (no KV cache growth)

	The encoder (4 Transformer layers + Conv frontend) is fully frozen. Only the decoder SSM weights are learned from scratch.

	---

	## Files in this repo

	\| File \| Description \|
	\|------\|-------------\|
	\| `pytorch/whisper_mamba_final.pt` \| Final PyTorch state dict (Stage 1 + Stage 2, 144 MB) \|
	\| `pytorch/stage1_final.pt` \| Stage 1 only (cosine-distilled SSM, before ASR fine-tuning) \|
	\| `mlx/whisper_mamba_mlx_final.npz` \| Final MLX weights (142 MB, Apple Silicon inference) \|
	\| `mlx/whisper_mamba_mlx_final.json` \| MLX checkpoint metadata \|

	The `.pt` files are raw `state_dict` `OrderedDict`s — load with `torch.load(..., map_location="cpu")`. The `.npz` is an MLX weight archive — load with `mlx.core.load(...)`.

	---

	## Results

	### WER on LibriSpeech test splits (greedy decoding, lowercase, no punctuation)

	\| Model \| Split \| WER \|
	\|-------\|-------\|-----\|
	\| Whisper-tiny teacher \| test.clean \| 9.65% \|
	\| WhisperMamba student (PyTorch) \| test.clean \| 8.49% \|
	\| Whisper-tiny teacher \| test.other \| 20.23% \|
	\| WhisperMamba student (PyTorch) \| test.other \| 18.0% \|

	The student outperforms the teacher on both splits. The larger gap on `test.other` suggests scheduled sampling gives the student better robustness to its own decoding errors.

	### Validation WER during Stage 2 (PyTorch, LibriSpeech train-clean-100)

	\| Epoch \| Val WER \|
	\|-------\|---------\|
	\| 3 \| — \|
	\| 5 \| ~5% \|

	### Inference latency (single utterance, 20 samples, Apple M-series)

	\| Model \| Backend \| Latency \|
	\|-------\|---------\|---------\|
	\| Whisper-tiny teacher \| PyTorch MPS \| ~154 ms \|
	\| WhisperMamba student \| PyTorch MPS \| ~129 ms \|
	\| WhisperMamba student \| MLX \| ~41 ms \|

	MLX is ~3.7× faster than the PyTorch teacher. The O(1) SSM state means latency does not grow with sequence length (unlike the KV cache in standard Whisper).

	---

	## Training

	### Two-stage distillation

	Stage 1 — Cosine distillation (warm-up, ~3 h on M-series):
	- Loss: layer-wise cosine similarity between student and teacher decoder hidden states
	- Only SSM weights trained; everything else frozen
	- Warm-initializes SSM from teacher attention projections (Appendix B parameter surgery: `B_proj ← k_proj`, `C_proj ← q_proj`)
	- 2 epochs, LibriSpeech train-clean-100, batch size 8

	Stage 2 — ASR fine-tuning (~5 h on M-series):
	- Loss: cross-entropy on LibriSpeech transcripts
	- Scheduled sampling: ground-truth token replacement ramps 0% → 50% over first half of training, closing the teacher-forcing gap
	- SSM, cross-attn, FFN, and layer norms all trained
	- 5 epochs, LibriSpeech train-clean-100, batch size 8

	An MLX re-implementation trains both stages end-to-end in ~3.5 h.

	### Config

	```yaml
	teacher: openai/whisper-tiny
	state_size: 64 # ×2 after Hedgehog = 128 effective
	batch_size: 8
	stage1_lr: 0.0005
	stage2_lr: 0.0001
	ss_max_p: 0.5 # scheduled sampling ceiling
	stage1_epochs: 2
	stage2_epochs: 5
	data: librispeech_asr train.100 / validation
	```

	---

	## Usage

	Install the source repo, then load the checkpoint:

	```bash
	pip install torch transformers datasets jiwer
	git clone https://github.com/akashicMarga/hedge-mamba-distil
	cd hedge-mamba-distil
	```

	```python
	import torch
	from src.student.whisper_mamba import WhisperMambaStudent

	# Load the state dict
	state_dict = torch.load("pytorch/whisper_mamba_final.pt", map_location="cpu")

	# Rebuild the student (requires the source repo)
	from transformers import WhisperProcessor
	processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")

	model = WhisperMambaStudent.from_teacher("openai/whisper-tiny", state_size=64)
	model.load_state_dict(state_dict)
	model.eval()
	```

	For the MLX backend (Apple Silicon):

	```bash
	pip install mlx mlx-whisper
	python scripts/mlx_inference.py # benchmarks student vs teacher
	python scripts/mic_demo.py # live microphone
	```

	---

	## Deviations from the paper

	\| Paper \| This repo \| Reason \|
	\|-------\|-----------\|--------\|
	\| RoPE on B and C \| Omitted \| Whisper already has positional embeddings \|
	\| `state_size = hidden_size` (N = D = 384) \| `state_size = 64` (128 after Hedgehog) \| N = D makes scan state (B, 768, 768) — too slow on MPS \|
	\| Parallel associative scan \| Python for-loop \| No fused Metal/Triton kernel yet \|
	\| Per-head Hedgehog (H heads × D/H dim) \| Single virtual head of size N \| Avoids the H × D_h = D constraint when N ≠ D_h \|

	---

	## Citation

	```bibtex
	@misc{moudgil2026hedgemamba,
	title = {Attention to Mamba: A Recipe for Cross-Architecture Distillation},
	author = {Moudgil, Abhinav and others},
	year = {2026},
	url = {https://arxiv.org/abs/2604.14191}
	}
	```