gemma-3-4b-it-qat-4bit-lite
Paper: On-Device Multimodal LLM Optimization: Fitting Gemma 3 into 2 GB
Optimized version of gemma-3-4b-it-qat-4bit for Apple Silicon edge devices. Reduces model size from 2.8 GB to 2.3 GB with lower runtime memory and significantly reduced thermal output, while preserving text and image understanding quality.
For an even smaller version (2.1 GB) with weight splitting and neuron pruning, see gemma-3-4b-it-qat-4bit-mobile.
Optimizations Applied
| Step | Optimization | Effect |
|---|---|---|
| 1 | Vocabulary pruning 262K โ 144K tokens | -170 MB disk, token_map remapping |
| 2 | Vision fc2 bf16 โ 4-bit (pad 4304 โ 4352) | -191 MB disk |
| 3 | Remove text layers 31, 32, 33 (34 โ 31 layers) | -159 MB disk, faster inference |
| 4 | Image resolution 896 โ 672 | ~3x less vision attention compute |
Architecture
Text model:
vocab_size: 262,208 (token_map โ 144,257 compact embeddings)
hidden_size: 2560
intermediate_size: 10240
num_hidden_layers: 31
num_attention_heads: 8 (GQA, 4 KV heads)
head_dim: 256
quantization: 4-bit, group_size=64
Vision model (SigLIP):
hidden_size: 1152
intermediate_size: 4352 (padded from 4304, fc2 4-bit quantized)
num_hidden_layers: 27
image_size: 672
patch_size: 14
mm_tokens_per_image: 144
Model Files
| File | Size | Description |
|---|---|---|
model.safetensors |
2.3 GB | All weights (language + vision) |
config.json |
- | Model configuration with vocab_pruning metadata |
Requirements
This model uses token_map for vocabulary pruning. The inference engine must:
- Token map: Read
vocab_pruning.compact_vocab_size(144,257) from config.json. Initialize embedding with compact size. Loadlanguage_model.model.embed_tokens.token_map(int32[262208]) and remap:embedding(token_map[input_ids]).
Usage
Swift (swift-gemma-cli)
A native Swift CLI for running this model on Apple Silicon, with full support for token_map.
git clone https://github.com/AtomGradient/swift-gemma-cli.git
cd swift-gemma-cli
swift build -c release
# Text generation
swift run -c release gemma-cli <model-path> \
--prompt "Hello, how are you?" --max-tokens 100 --temperature 0.0
# Image understanding
swift run -c release gemma-cli <model-path> \
--image photo.jpg \
--prompt "Describe this image in detail." --max-tokens 200 --temperature 0.0
Benchmarks (Apple Silicon)
| Metric | Original | This Model |
|---|---|---|
| Disk size | 2.8 GB | 2.3 GB |
| Peak memory (image) | ~5500 MB | 4590 MB |
| Prompt speed (text) | 109 t/s | ~120 t/s |
| Generation speed (text) | 90 t/s | ~110 t/s |
| Prompt speed (image) | 54 t/s | 123 t/s |
| Generation speed (image) | 27 t/s | 66 t/s |
| Image understanding | Correct | Correct |
| Text quality | Perfect | Good |
Quality Notes
- Image understanding is fully preserved: correctly identifies objects, colors, composition
- Text quality is better than the
mobilevariant since no neuron pruning is applied - Recommended when text quality is prioritized over minimum model size
Base Model
License
Same as the base model. See Gemma Terms of Use.
- Downloads last month
- 65
Hardware compatibility
Log In to add your hardware
Quantized
Model tree for AtomGradient/gemma-3-4b-it-qat-4bit-lite
Base model
OpenGVLab/InternVL3-1B-Pretrained
Finetuned
OpenGVLab/InternVL3-1B-Instruct
Finetuned
mlx-community/gemma-3-4b-it-qat-4bit