Instructions to use google/gemma-4-12B-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/gemma-4-12B-it with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("google/gemma-4-12B-it") model = AutoModelForImageTextToText.from_pretrained("google/gemma-4-12B-it") - Notebooks
- Google Colab
- Kaggle
MLX 4-bit conversion of gemma-4-12B-it — with vision + audio embedder weights preserved
Hi all,
Shipped what I believe is the first MLX conversion of the 12B Unified model that keeps the encoder-free vision_embedder and audio_embedder weights intact (not just the text backbone):
https://huggingface.co/jokernifty/gemma-4-12B-it-mlx-4bit-multimodal
What works today
- Text inference on Apple Silicon (via
mlx_lmwith a 4-linegemma4_unified.pyalias — README has the snippet) VisionEmbedder+AudioEmbedderported to MLX, weights load cleanly, forward passes produce sensible outputs- Token-splicing forward pass (image_token_id / audio_token_id / video_token_id) implemented and tested with synthetic inputs
What's WIP
- 🚧 Image preprocessor (resize → bucket to 70/140/280/560/1120 tokens → 48×48 patches → factorized 2D position IDs)
- 🚧 Audio preprocessor (16 kHz waveform → 640-sample frames)
- 🚧 Chat-template integration for
<boi>…<eoi>/<boa>…<eoa>blocks
The companion package with the embedders + converter is at https://github.com/jokernifty/mlx-gemma4-unified (currently scaffolded — preprocessors next).
A few notes on what was tricky
mlx-lmdoesn't yet register thegemma4_unifiedmodel_type — solved with a small alias that subclassesgemma4.Modeland adds the encoder-free weight names to the sanitize skip list.- Standard
mlx_lm.converthits the macOS Metal 5-second command-buffer watchdog because the final eval+save of all quantized weights happens in one giant graph. Worked around with a layer-by-layer converter that callsmx.eval()after each transformer block (~1.2 s each on M-series). - The 262k × 3840 token embedding can't be quantized in one shot for the same watchdog reason; kept in bf16. Adds ~2 GB to the file size but standard practice for many mlx-community checkpoints.
- MLX doesn't yet support boolean indexing, so the masked_scatter equivalent uses a numpy-host helper for the integer indices.
Would love a sanity check from anyone who's been working on gemma4_unified support in mlx-lm / mlx-vlm upstream — happy to coordinate or PR the alias module if it's useful. And of course feedback / issues / PRs on the embedder ports are very welcome.
Massive thanks to Google DeepMind for the open release