Instructions to use agentmish/gemma-4-12B-it-mlx-8bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use agentmish/gemma-4-12B-it-mlx-8bit with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("agentmish/gemma-4-12B-it-mlx-8bit") config = load_config("agentmish/gemma-4-12B-it-mlx-8bit") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use agentmish/gemma-4-12B-it-mlx-8bit with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "agentmish/gemma-4-12B-it-mlx-8bit"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "agentmish/gemma-4-12B-it-mlx-8bit" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use agentmish/gemma-4-12B-it-mlx-8bit with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "agentmish/gemma-4-12B-it-mlx-8bit"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default agentmish/gemma-4-12B-it-mlx-8bit
Run Hermes
hermes
- gemma-4-12B-it-mlx-8bit
- How it was converted
- Quantization / PLE note (why this is 8.512 bits-per-weight, not 8.0)
- Footprint & performance (Apple M-series, 96 GB unified memory)
- Verified working examples (proof of correctness)
- Usage (mlx-vlm)
- Benchmarks — making it snappy (honest findings)
- Optional speedup — MTP speculative decoding (add-on, not required)
- License & attribution
- How it was converted
gemma-4-12B-it-mlx-8bit
An 8-bit MLX conversion of google/gemma-4-12B-it,
the encoder-free unified multimodal Gemma 4 12B instruction-tuned model (text + image + audio +
video in a single decoder, model_type="gemma4_unified", 48 layers, vocab 262144, 256K context).
Converted with mlx-vlm for Apple Silicon. Runs comfortably
on an M-series Mac with unified memory.
Converted with mlx-vlm because the 12B is multimodal/encoder-free.
mlx-lmwould convert only the text decoder and silently drop the vision and audio towers — use this VLM build for the full model.
How it was converted
Resolved toolchain: mlx-vlm 0.6.1, mlx 0.31.2, transformers 5.9.0, huggingface_hub 1.17.0 (Python 3.12).
Exact command:
mlx_vlm.convert \
--hf-path google/gemma-4-12B-it \
--mlx-path ./gemma-4-12B-it-mlx-8bit \
-q --q-bits 8 --q-group-size 64 --dtype bfloat16
Quantization / PLE note (why this is 8.512 bits-per-weight, not 8.0)
Gemma 4 uses Per-Layer Embeddings (PLE) with scaled embedding/linear tables. Naive low-bit
quantization of those tables produces incoherent (looping / garbage) output. The mlx-vlm 0.6.1
converter applies a PLE-aware predicate: it keeps the per-layer PLE tables (per_layer_*,
altup, laurel), every normalization layer, and all per-layer scalars in bf16, and quantizes
the large matmuls (MLP gate/up/down, attention q/k/v/o, and the token/vision/audio embedding
projections) to 8-bit (affine, group size 64). The bf16-kept tensors are why the measured average
is 8.512 bits per weight rather than a flat 8.0.
This 8-bit build was validated by real generations (correct factual + reasoning answers, fluent non-repetitive prose) before publishing.
Footprint & performance (Apple M-series, 96 GB unified memory)
| metric | value |
|---|---|
| weights on disk | ~12 GB (3 safetensors shards) |
| peak inference memory | ~12.85 GB |
| generation speed | ~40–43 tokens/sec (text, batch 1, no KV-cache quant) |
| cold model load |
Verified working examples (proof of correctness)
This build was checked with assertion-style examples, not just eyeballed. All passed.
Text — correctness assertions (both PASS):
CHECK: capital-of-France -> PASS
A: The capital of France is **Paris**. (+ Eiffel Tower, Louvre Museum)
CHECK: 17-sheep-all-but-9 -> PASS
A: There are **9** sheep left. ("all but 9" = every sheep except that number)
ALL CHECKS PASSED
Multimodal — vision path (proof the image tower survived conversion):
Given a sample photo of two cats, the 8-bit model returned:
Two tabby cats are resting on a bright pink blanket inside a vehicle. They are positioned near
two white remote controls, with one cat stretching out and the other curled slightly inward.
The description is accurate (two tabby cats, pink blanket, two white remotes), confirming the
vision encoder converted correctly. This is the concrete reason mlx-vlm was required over
mlx-lm, which would have dropped the vision and audio towers.
Usage (mlx-vlm)
Install: pip install mlx-vlm (this is a VLM, not a plain text model).
Text generation:
python -m mlx_vlm.generate --model agentmish/gemma-4-12B-it-mlx-8bit \
--prompt "In one sentence, what is Apple MLX?" \
--temperature 0.0 --max-tokens 200
Sampling flags:
mlx_vlm.generatedoes not accept--top-p/--top-k(those aremlx_lmflags). Pass them through--gen-kwargsinstead, or rely on the model'sgeneration_config.jsondefaults (temperature 1.0, top_k 64, top_p 0.95):python -m mlx_vlm.generate --model agentmish/gemma-4-12B-it-mlx-8bit \ --prompt "Explain unified memory on Apple Silicon." \ --temperature 1.0 --max-tokens 350 --gen-kwargs '{"top_p": 0.95, "top_k": 64}'
Multimodal (image example):
python -m mlx_vlm.generate --model agentmish/gemma-4-12B-it-mlx-8bit \
--image path/or/url.jpg --prompt "Describe this image." --max-tokens 200
Chat format: Gemma 4 uses <|turn> … <turn|> turn delimiters and auto-prepends <bos> via its
chat template (the template is bundled here as chat_template.jinja); do not add <bos> yourself.
Thinking is opt-in (--enable-thinking); by default the template emits an empty thought channel so
the model answers directly.
Benchmarks — making it snappy (honest findings)
Measured on a 96 GB Apple Silicon Mac, batch size 1, greedy decoding. The headline is that the dominant snappiness lever is the 8-bit weight quantization itself (this build) plus keeping the model resident and using a prompt cache — not KV-cache quantization.
| lever | effect | when to use |
|---|---|---|
| 8-bit weights (this build) | ~12 GB, ~40–43 tok/s, ~3.5 s cold load | always — the main win |
| prompt cache (repeated long context) | ~19.7x faster prefill on a shared prefix (e.g. 1380 cached tokens: 2.52 s → 0.128 s; 2nd-turn TTFT ~2.7 s → ~0.31 s) | multi-turn / shared-context apps |
TurboQuant KV (--kv-quant-scheme turboquant --kv-bits 3.5) |
negligible here: only ~75 MB saved at 6.4k tokens, and decode ~1.5 tok/s slower (dequant per step) | very long context / batched serving only |
| uniform KV quant | broken for Gemma 4 — crashes (RotatingKVCache Quantization NYI) |
avoid; use the TurboQuant scheme instead |
Why KV quantization barely helps at normal lengths: under Gemma's grouped-query + sliding-window attention the KV cache stays small, so peak memory is dominated by the one-shot prefill activation spike plus the 12 GB of weights, not by the cache. TurboQuant's benefit only scales with steady-state KV (tens of thousands of tokens, or many concurrent requests). The prompt cache is the single biggest latency lever for repeated/shared context, and it triggers when the cached prefix is strictly shorter than the new prompt (shared context + a new trailing question), not when you re-send an identical prompt.
Optional speedup — MTP speculative decoding (add-on, not required)
An optional Multi-Token-Prediction drafter is published separately at
agentmish/gemma-4-12B-it-assistant-mlx-bf16.
It is a true add-on: this base model runs on its own, and the drafter only accelerates decoding when
you opt in with --draft-model … --draft-kind mtp --draft-block-size 3. On this hardware it gives a
real ~1.70x speedup on short factual prompts and ~1.36x on long generation. See that repo's
card for the full block-size sweep and an honest note on temperature-0 output equivalence.
License & attribution
This conversion is released under Apache-2.0, inheriting the license of the base model.
NOTICE
This is a format conversion (8-bit MLX quantization) of google/gemma-4-12B-it
(© Google, licensed under Apache-2.0). All model weights and the underlying architecture are the
work of Google; this repository only re-encodes them for the Apple MLX runtime via mlx-vlm. No
model behavior was intentionally modified beyond 8-bit quantization. Refer to the base model card
for intended use, capabilities, and limitations:
https://huggingface.co/google/gemma-4-12B-it
- Downloads last month
- 93
8-bit