Update README.md

9879d9a verified about 1 month ago

3.82 kB

	---
	license: apache-2.0
	language:
	- en
	- multilingual
	pipeline_tag: automatic-speech-recognition
	tags:
	- audio
	- speech-recognition
	- gguf
	- gemma
	- conformer
	library_name: ggml
	base_model: google/gemma-4-E2B-it
	---

	# Gemma-4-E2B-it — GGUF

	GGUF conversion of [`google/gemma-4-E2B-it`](https://huggingface.co/google/gemma-4-E2B-it) for use with [CrispStrobe/CrispASR](https://github.com/CrispStrobe/CrispASR).

	## Available variants

	\| File \| Quant \| Size \| Notes \|
	\|---\|---\|---\|---\|
	\| `gemma4-e2b-it.gguf` \| F16 \| ~9.5 GB \| Full precision \|
	\| `gemma4-e2b-it-q8_0.gguf` \| Q8_0 \| ~5.0 GB \| Near-lossless quant \|
	\| `gemma4-e2b-it-q4_k.gguf` \| Q4_K \| ~2.8 GB \| Standard quant \|
	\| `gemma4-e2b-it-q2_k.gguf` \| Q2_K \| ~2.2 GB \| Smallest, quality drop \|

	## Model details

	- Architecture: USM Conformer audio encoder (12L, 1024d, chunked-local attention with relative position bias, LightConv1d, ClippableLinear with QAT scalars) + Gemma4 LLM decoder (35L, 1536d, GQA 8Q/1KV, per-layer embeddings, hybrid sliding/full attention, GeGLU)
	- Parameters: 2.3B effective (5.1B with embeddings)
	- Audio: Gemma4AudioFeatureExtractor — 128-bin mel, 16 kHz, frame_length=320, hop=160, fft_length=512, semicausal padding, log(mel + mel_floor=0.001), no normalisation
	- Languages: 140+ (ASR + speech translation)
	- License: Apache 2.0
	- Source: [`google/gemma-4-E2B-it`](https://huggingface.co/google/gemma-4-E2B-it)

	## What's included vs an upstream Gemma-4 GGUF

	This GGUF is built specifically for ASR with CrispASR and includes the audio path that
	standard text/vision Gemma-4 GGUFs (unsloth, ggml-org) omit:

	- 12-layer audio conformer encoder (~872 tensors total).
	- Gemma4MultimodalEmbedder audio→LLM adapter (`embed_audio.embedding_projection`,
	pre-projection RMSNorm).
	- All ClippableLinear QAT clipping scalars (`.input_min/max`, `.output_min/max`) — these
	are NOT QAT-only artefacts. HF applies them at inference via
	`Gemma4ClippableLinear.forward`. Skipping them collapses the encoder past layer 5.
	- `num_kv_shared_layers`, `layer_full_mask`, `partial_rotary_factor`,
	`global_head_dim`, `use_double_wide_mlp`, `attention_k_eq_v` — all the per-layer
	flags the LLM forward needs to honour.
	- Mel filterbank + Hann window resources (HTK no-norm filters,
	`frame_length=320` window; the runtime regenerates these too).

	Vision tower tensors are excluded.

	## Usage with CrispASR

	```bash
	# Auto-download (recommended)
	./build/bin/crispasr --backend gemma4-e2b -m auto --auto-download -f audio.wav

	# Or explicit path
	./build/bin/crispasr --backend gemma4-e2b -m gemma4-e2b-it-q4_k.gguf -f audio.wav
	```

	## Differential testing

	CrispASR ships a stage-by-stage differential test against the HF PyTorch
	reference. Per-stage cosine similarity vs HF `Gemma4AudioModel`:

	```
	mel_spectrogram 1.0000 bit-exact (HF FE faithfully reproduced)
	audio_subsample_output 0.9994 conv2d + LayerNorm + ReLU
	audio_layer_0..11 0.97 — 0.99 (with QAT clip scalars)
	audio_tower_output 0.99+
	```

	Run it yourself:

	```bash
	# 1. Dump HF reference
	HF_HOME=/path/to/hf-cache python tools/dump_reference.py \
	--backend gemma4 --model-dir google/gemma-4-E2B-it \
	--audio samples/jfk.wav --output /tmp/gemma4-ref.gguf

	# 2. Compare
	build/bin/crispasr-diff gemma4 \
	gemma4-e2b-it-q4_k.gguf /tmp/gemma4-ref.gguf samples/jfk.wav
	```

	## Conversion provenance

	This GGUF was produced by `models/convert-gemma4-e2b-to-gguf.py` (CrispASR repo)
	running on Kaggle T4 nodes (16 GB RAM). Conversion config:

	- `--outtype f16` then `crispasr-quantize` for Q-variants.
	- ClippableLinear QAT scalars persisted as 1-element F32 tensors named
	`audio.layers.{i}.{linear}.input_min/max, output_min/max`.
	- Vision tower (`model.vision_tower.`, `model.embed_vision.`) skipped.