Image-Text-to-Text
MLX
Safetensors
English
gemma
gemma4
multimodal
vision
audio
apple-silicon
text-generation
Instructions to use Edmon02/gemma-4-12B-it-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use Edmon02/gemma-4-12B-it-MLX with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("Edmon02/gemma-4-12B-it-MLX") config = load_config("Edmon02/gemma-4-12B-it-MLX") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
Gemma 4 12B Instruction-Tuned — MLX (Apple Silicon)
Local MLX quants of google/gemma-4-12B-it for native inference on Apple Silicon via mlx-vlm.
| Parameters | ~12B dense |
| Modalities | Text, vision, audio (native in backbone) |
| License | Apache 2.0 |
| Runtime | mlx-vlm (not mlx-lm — Gemma 4 is multimodal) |
| Format | MLX safetensors, one subfolder per quant |
Also available: Edmon02/gemma-4-12B-it-GGUF for llama.cpp / LM Studio.
Why this repo exists
- One download hub for curated MLX quants (4bit, mxfp4, 6bit, 8bit).
- PLE-safe conversion from the official Google checkpoint with
mlx-vlm >= 0.6.0. - Documented recipes in gemma-4-12b-local.
Available quants
See mlx-manifest.json for the live file list.
| Subfolder | Use |
|---|---|
4bit/ |
Default — best balance on 16 GB unified memory |
mxfp4/ |
Apple-optimized 4-bit; often fastest on M-series |
6bit/ |
Higher quality |
8bit/ |
Max quality that still fits ~16 GB at inference |
Load a specific quant by downloading its subfolder or pointing load() at the local path.
Download
pip install -U mlx-vlm huggingface_hub
# Recommended quant (4bit)
huggingface-cli download Edmon02/gemma-4-12B-it-MLX 4bit/ --local-dir ./models/gemma-4-12b-mlx
Accept the license on google/gemma-4-12B-it before using weights.
Quick start
Text chat (CLI)
python -m mlx_vlm.generate \
--model ./models/gemma-4-12b-mlx/4bit \
--prompt "List three benefits of encoder-free multimodal models." \
--max-tokens 256 --temperature 0.7
Text chat (Python — use chat template)
Gemma 4 requires the chat template; generate() does not apply it automatically:
from mlx_vlm import generate, load
from mlx_vlm.prompt_utils import apply_chat_template
model, processor = load("./models/gemma-4-12b-mlx/4bit")
prompt = apply_chat_template(
processor,
model.config,
[{"role": "user", "content": "Hello!"}],
add_generation_prompt=True,
)
print(generate(model=model, processor=processor, prompt=prompt, max_tokens=256).text)
Vision (image + text)
python -m mlx_vlm.generate \
--model ./models/gemma-4-12b-mlx/4bit \
--prompt "Describe this image in one sentence." \
--image photo.jpg \
--max-tokens 128
Hardware guide
| Unified memory | Suggested quant |
|---|---|
| 8 GB | 4bit/ only, short context |
| 16 GB | 4bit/ or mxfp4/ |
| 24 GB+ | 6bit/ or 8bit/ |
Provenance
| Item | Source |
|---|---|
| Base model | google/gemma-4-12B-it |
| Conversion | Local mlx_vlm.convert via scripts/convert_gemma4_mlx_quants.py |
| Maintainer | Edmon02/audio_set |
Limitations
- Converted locally — validate quality on your tasks vs official BF16.
- Audio support depends on your
mlx-vlmversion; confirmprocessor_config.jsonis present. - Gated upstream — HF token + license acceptance required for
google/*repos.
Citation
@article{gemma_2026,
title={Gemma 4},
author={Google DeepMind},
year={2026},
url={https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemma-4-12B/}
}
Hardware compatibility
Log In to add your hardware
Quantized
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support