⚠️ Existing MLX quantized Gemma 4 models (mlx-community, unsloth) produce garbage output due to quantizing PLE (Per-Layer Embedding) layers.

by Alkd - opened Apr 4

Discussion

Alkd

Apr 4

Here is the only one works for me

https://huggingface.co/FakeRockert543/gemma-4-e4b-it-MLX-4bit

pcuenq

MLX Community org Apr 4

Fixed 2 days ago during the release window: https://github.com/Blaizzy/mlx-vlm/pull/893

This works perfectly fine for me:

pip install --upgrade mlx-vlm
mlx_vlm.generate --model mlx-community/gemma-4-e2b-it-bf16 --prompt "Who are you?"

warrenronsiek

MLX Community org Apr 4

•

edited Apr 4

~~Works for bf16, doesn't work for quantization. See http://github.com/jundot/omlx/issues/534~~ nvm! seems to work!

openjay

MLX Community org Apr 6

be carefule, it took me 50G ram

ukint-vs

MLX Community org Apr 6

Also converted the REAP-pruned variants (21B and 19B) to PLE-safe MLX 4-bit - both validated with vision + multilingual chat working correctly.
REAP-21B (13.9 GB) actually outscores the full 26B 4-bit on several benchmarks despite being smaller.
You can also convert any gemma4 variant yourself using the scripts in the repo - just point convert_gemma4.py at the source model. Built on FakeRocket543's PLE-safe quantization work.
Models: https://huggingface.co/ukint-vs/gemma-4-21b-a4b-it-REAP-MLX-4bit
Conversion scripts + benchmark results: https://github.com/ukint-vs/mlx-gemma4-reap

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment