How to use from the
Use from the
MLX library
# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm
# if on a CUDA device, also pip install mlx[cuda]

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("Daizee/Gemma3-Callous-Calla-4B-mlx")

prompt = "Once upon a time in"
text = generate(model, tokenizer, prompt=prompt, verbose=True)

Gemma3-Callous-Calla-4B โ€” MLX builds (Apple Silicon)

This repo hosts MLX-converted variants of Daizee/Gemma3-Callous-Calla-4B for fast, local inference on Apple Silicon (M-series).
Tokenizer/config are included at the repo root. MLX weight folders live under mlx/.

Note on vocab padding: For MLX compatibility, the tokenizer/embeddings were padded to the next multiple of 64 tokens.
In this build: 262,208 tokens (added 64 placeholder tokens named <pad_ex_*>).

Variants

Path Bits Group Size Notes
mlx/g128/ int4 128 Smallest & fastest
mlx/g64/ int4 64 Slightly larger, often steadier
mlx/int8/ 8 โ€” Closest to fp16 quality (slower)

Quickstart (MLX-LM)

Run from Hugging Face (no cloning needed)

python -m mlx_lm.generate \
  --model hf://Daizee/Gemma3-Callous-Calla-4B-mlx/mlx/g64 \
  --prompt "Summarize the Bill of Rights for 7th graders in 4 bullet points." \
  --max-tokens 180 --temp 0.3 --top-p 0.92
Downloads last month
11
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Daizee/Gemma3-Callous-Calla-4B-mlx

Finetuned
(1)
this model