MLX 4-bit conversion of gemma-4-12B-it — with vision + audio embedder weights preserved

#27

by jokernifty - opened about 4 hours ago

Hi all,

Shipped what I believe is the first MLX conversion of the 12B Unified model that keeps the encoder-free vision_embedder and audio_embedder weights intact (not just the text backbone):

https://huggingface.co/jokernifty/gemma-4-12B-it-mlx-4bit-multimodal

What works today

Text inference on Apple Silicon (via mlx_lm with a 4-line gemma4_unified.py alias — README has the snippet)
VisionEmbedder + AudioEmbedder ported to MLX, weights load cleanly, forward passes produce sensible outputs
Token-splicing forward pass (image_token_id / audio_token_id / video_token_id) implemented and tested with synthetic inputs

What's WIP

🚧 Image preprocessor (resize → bucket to 70/140/280/560/1120 tokens → 48×48 patches → factorized 2D position IDs)
🚧 Audio preprocessor (16 kHz waveform → 640-sample frames)
🚧 Chat-template integration for <boi>…<eoi> / <boa>…<eoa> blocks

The companion package with the embedders + converter is at https://github.com/jokernifty/mlx-gemma4-unified (currently scaffolded — preprocessors next).

A few notes on what was tricky

mlx-lm doesn't yet register the gemma4_unified model_type — solved with a small alias that subclasses gemma4.Model and adds the encoder-free weight names to the sanitize skip list.
Standard mlx_lm.convert hits the macOS Metal 5-second command-buffer watchdog because the final eval+save of all quantized weights happens in one giant graph. Worked around with a layer-by-layer converter that calls mx.eval() after each transformer block (~1.2 s each on M-series).
The 262k × 3840 token embedding can't be quantized in one shot for the same watchdog reason; kept in bf16. Adds ~2 GB to the file size but standard practice for many mlx-community checkpoints.
MLX doesn't yet support boolean indexing, so the masked_scatter equivalent uses a numpy-host helper for the integer indices.

Would love a sanity check from anyone who's been working on gemma4_unified support in mlx-lm / mlx-vlm upstream — happy to coordinate or PR the alias module if it's useful. And of course feedback / issues / PRs on the embedder ports are very welcome.

Massive thanks to Google DeepMind for the open release

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment