MLX 4-bit conversion of gemma-4-12B-it — with vision + audio embedder weights preserved

#27
by jokernifty - opened

Hi all,

Shipped what I believe is the first MLX conversion of the 12B Unified model that keeps the encoder-free vision_embedder and audio_embedder weights intact (not just the text backbone):

https://huggingface.co/jokernifty/gemma-4-12B-it-mlx-4bit-multimodal

What works today

  • Text inference on Apple Silicon (via mlx_lm with a 4-line gemma4_unified.py alias — README has the snippet)
  • VisionEmbedder + AudioEmbedder ported to MLX, weights load cleanly, forward passes produce sensible outputs
  • Token-splicing forward pass (image_token_id / audio_token_id / video_token_id) implemented and tested with synthetic inputs

What's WIP

  • 🚧 Image preprocessor (resize → bucket to 70/140/280/560/1120 tokens → 48×48 patches → factorized 2D position IDs)
  • 🚧 Audio preprocessor (16 kHz waveform → 640-sample frames)
  • 🚧 Chat-template integration for <boi>…<eoi> / <boa>…<eoa> blocks

The companion package with the embedders + converter is at https://github.com/jokernifty/mlx-gemma4-unified (currently scaffolded — preprocessors next).

A few notes on what was tricky

  • mlx-lm doesn't yet register the gemma4_unified model_type — solved with a small alias that subclasses gemma4.Model and adds the encoder-free weight names to the sanitize skip list.
  • Standard mlx_lm.convert hits the macOS Metal 5-second command-buffer watchdog because the final eval+save of all quantized weights happens in one giant graph. Worked around with a layer-by-layer converter that calls mx.eval() after each transformer block (~1.2 s each on M-series).
  • The 262k × 3840 token embedding can't be quantized in one shot for the same watchdog reason; kept in bf16. Adds ~2 GB to the file size but standard practice for many mlx-community checkpoints.
  • MLX doesn't yet support boolean indexing, so the masked_scatter equivalent uses a numpy-host helper for the integer indices.

Would love a sanity check from anyone who's been working on gemma4_unified support in mlx-lm / mlx-vlm upstream — happy to coordinate or PR the alias module if it's useful. And of course feedback / issues / PRs on the embedder ports are very welcome.

Massive thanks to Google DeepMind for the open release

Sign up or log in to comment