ToPo-ToPo/gemma-4-31b-it-mlx-8bit

MLX 8bit conversion of google/gemma-4-31b-it for Apple Silicon (mlx-vlm).

Provenance (self-converted from official weights)

  • Source: google/gemma-4-31b-it (license: gemma)
  • Tool: mlx-vlm 0.6.3mlx_vlm.convert --hf-path google/gemma-4-31b-it --mlx-path . -q --q-bits 8 --q-group-size 64
  • Effective: 8.643 bits/weight
  • Validation: reproduced geometrically exact CAD output in an agentic CAD+FEM pipeline (volumes match the reference mlx-community conversion).

Usage

from mlx_vlm import load, generate
model, processor = load("ToPo-ToPo/gemma-4-31b-it-mlx-8bit")

License

This is a derivative of Google Gemma. Use is governed by the Gemma Terms of Use and the Gemma Prohibited Use Policy. Weights were converted/quantized to MLX format (modification notice per the Gemma Terms).

⚡ Faster generation with MTP (speculative decoding, lossless)

Recommended drafter: google/gemma-4-31b-it-assistant — Google's official MTP drafter for this model. It loads directly in mlx-vlm (no conversion needed) and gives up to ~3x faster generation (≈1.4–1.5x measured on short prompts); output is identical to non-MTP decoding.

# requires:  pip install "mlx-vlm>=0.6.3"
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

model, processor = load("ToPo-ToPo/gemma-4-31b-it-mlx-8bit")
draft_model, _   = load("google/gemma-4-31b-it-assistant")
config = load_config("ToPo-ToPo/gemma-4-31b-it-mlx-8bit")

prompt = apply_chat_template(processor, config, "Hello!", num_images=0)
out = generate(model, processor, prompt,
               draft_model=draft_model, draft_kind="mtp", max_tokens=256)

CLI (draft_kind auto-detected): mlx_vlm.generate --model ToPo-ToPo/gemma-4-31b-it-mlx-8bit --draft-model google/gemma-4-31b-it-assistant

Notes

  • draft_kind="mtp" is required in the Python API (the CLI auto-detects it).
  • Use this model's own drafter above — drafters are size-specific and not interchangeable across Gemma 4 variants.
  • Needs mlx-vlm >= 0.6.3. MTP is lossless — if output differs from non-MTP, your versions are mismatched.
Downloads last month
25
Safetensors
Model size
9B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including ToPo-ToPo/gemma-4-31b-it-mlx-8bit