gemma-3-4b-it-qat-4bit-lite

Paper: On-Device Multimodal LLM Optimization: Fitting Gemma 3 into 2 GB

Optimized version of gemma-3-4b-it-qat-4bit for Apple Silicon edge devices. Reduces model size from 2.8 GB to 2.3 GB with lower runtime memory and significantly reduced thermal output, while preserving text and image understanding quality.

For an even smaller version (2.1 GB) with weight splitting and neuron pruning, see gemma-3-4b-it-qat-4bit-mobile.

Optimizations Applied

Step Optimization Effect
1 Vocabulary pruning 262K โ†’ 144K tokens -170 MB disk, token_map remapping
2 Vision fc2 bf16 โ†’ 4-bit (pad 4304 โ†’ 4352) -191 MB disk
3 Remove text layers 31, 32, 33 (34 โ†’ 31 layers) -159 MB disk, faster inference
4 Image resolution 896 โ†’ 672 ~3x less vision attention compute

Architecture

Text model:
  vocab_size: 262,208 (token_map โ†’ 144,257 compact embeddings)
  hidden_size: 2560
  intermediate_size: 10240
  num_hidden_layers: 31
  num_attention_heads: 8 (GQA, 4 KV heads)
  head_dim: 256
  quantization: 4-bit, group_size=64

Vision model (SigLIP):
  hidden_size: 1152
  intermediate_size: 4352 (padded from 4304, fc2 4-bit quantized)
  num_hidden_layers: 27
  image_size: 672
  patch_size: 14
  mm_tokens_per_image: 144

Model Files

File Size Description
model.safetensors 2.3 GB All weights (language + vision)
config.json - Model configuration with vocab_pruning metadata

Requirements

This model uses token_map for vocabulary pruning. The inference engine must:

  1. Token map: Read vocab_pruning.compact_vocab_size (144,257) from config.json. Initialize embedding with compact size. Load language_model.model.embed_tokens.token_map (int32[262208]) and remap: embedding(token_map[input_ids]).

Usage

Swift (swift-gemma-cli)

A native Swift CLI for running this model on Apple Silicon, with full support for token_map.

git clone https://github.com/AtomGradient/swift-gemma-cli.git
cd swift-gemma-cli
swift build -c release

# Text generation
swift run -c release gemma-cli <model-path> \
  --prompt "Hello, how are you?" --max-tokens 100 --temperature 0.0

# Image understanding
swift run -c release gemma-cli <model-path> \
  --image photo.jpg \
  --prompt "Describe this image in detail." --max-tokens 200 --temperature 0.0

Benchmarks (Apple Silicon)

Metric Original This Model
Disk size 2.8 GB 2.3 GB
Peak memory (image) ~5500 MB 4590 MB
Prompt speed (text) 109 t/s ~120 t/s
Generation speed (text) 90 t/s ~110 t/s
Prompt speed (image) 54 t/s 123 t/s
Generation speed (image) 27 t/s 66 t/s
Image understanding Correct Correct
Text quality Perfect Good

Quality Notes

  • Image understanding is fully preserved: correctly identifies objects, colors, composition
  • Text quality is better than the mobile variant since no neuron pruning is applied
  • Recommended when text quality is prioritized over minimum model size

Base Model

gemma-3-4b-it-qat-4bit

License

Same as the base model. See Gemma Terms of Use.

Downloads last month
65
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for AtomGradient/gemma-3-4b-it-qat-4bit-lite