Bonsai

Prism ML Website  |  Whitepaper  |  Demo & Examples  |  Colab Notebook  |  Discord

Bonsai-4B-mlx-1bit

End-to-end 1-bit language model for Apple Silicon

12.8x smaller than FP16 | 4.8x faster on M4 Pro | 60 tok/s on iPhone | runs on Mac, iPhone, iPad

Highlights

  • Deployed footprint — runs comfortably on any Mac or iPhone
  • End-to-end 1-bit weights across embeddings, attention projections, MLP projections, and LM head
  • MLX-native format (1-bit g128) with inline dequantization kernels — no FP16 materialization
  • Cross-platform companion: also available as GGUF Q1_0_g128 for llama.cpp

Frontier Efficiency

Resources

  • Google Colab — try Bonsai in your browser, no setup required
  • Whitepaper — for more details on Bonsai, check out our whitepaper
  • Demo repo — comprehensive examples for serving, benchmarking, and integrating Bonsai
  • Discord — join the community for support, discussion, and updates
  • 1-bit kernels: MLX fork (Apple Silicon) · mlx-swift fork (iOS/macOS) · llama.cpp fork (CUDA + Metal)
  • Locally AI — we have partnered with Locally AI for iPhone support

Model Overview

Item Specification
Parameters 4.0B (~3.6B non-embedding)
Architecture Qwen3-4B dense: GQA (32 query / 8 KV heads), SwiGLU MLP, RoPE, RMSNorm
Layers 36 Transformer decoder blocks
Context length 32,768 tokens
Vocab size 151,936
Weight format MLX 1-bit g128
Deployed size 0.63 GB (12.8x smaller than FP16)
1-bit coverage Embeddings, attention projections, MLP projections, LM head
License Apache 2.0

Quantization Format: 1-bit g128

Each weight is a single bit: 0 maps to −scale, 1 maps to +scale. Every group of 128 weights shares one FP16 scale factor.

MLX's quantization formats generally store both a scale and a bias per group: w = mlx_scale * bit + mlx_bias. To pack our scale-only 1-bit weights into this format:

mlx_scale = 2 * original_scale
mlx_bias  = −original_scale

This reconstructs −scale when bit=0 and +scale when bit=1. Because MLX stores two FP16 values per group (scale + bias) instead of one, the effective bits per weight is slightly higher than the GGUF format:

  • MLX 1-bit g128: 1.25 bpw (1 sign bit + two 16-bit values amortized over 128 weights)
  • GGUF Q1_0_g128: 1.125 bpw (1 sign bit + one 16-bit scale amortized over 128 weights)

Memory Requirement

Parameter memory only (weights and scales loaded into memory):

Format Size Reduction Ratio
FP16 8.04 GB 1.0x
MLX 1-bit g128 0.63 GB 92.2% 12.8x
GGUF Q1_0_g128 0.57 GB 93.0% 14.2x

The model directory on disk is 0.64 GB (16 MB larger) because it also includes tokenizer, config, and other metadata files alongside the weights.

Best Practices

Generation Parameters

Parameter Default Suggested range
Temperature 0.5 0.5 -- 0.7
Top-k 20 20 -- 40
Top-p 0.9 0.85 -- 0.95
Repetition penalty 1.0
Presence penalty 0.0

System Prompt

You can use a simple system prompt such as:

You are a helpful assistant

Quickstart

MLX (Python)

Requires PrismML fork of MLX with 1-bit kernel support (upstream PR pending):

pip install mlx-lm
pip install mlx @ git+https://github.com/PrismML-Eng/mlx.git@prism
from mlx_lm import load, generate

model, tokenizer = load("prism-ml/Bonsai-4B-mlx-1bit")

response = generate(
    model,
    tokenizer,
    prompt="Explain quantum computing in simple terms.",
    max_tokens=256,
)
print(response)

MLX Swift (iOS / macOS)

1-bit Bonsai 4B runs natively on iPhone and iPad via MLX Swift. Requires our mlx-swift fork with 1-bit kernels (upstream PR pending).

Throughput (MLX / Apple Silicon)

Platform Backend TG128 (tok/s) FP16 TG (tok/s) TG vs FP16 PP512 (tok/s) FP16 PP512 (tok/s)
M4 Pro 48 GB MLX (Python) 132 28 4.8x 806 728
M4 Pro 48 GB llama.cpp Metal 136 29 4.7x 915 915
iPhone 17 Pro Max MLX Swift 60 651

Citation

If you use 1-bit Bonsai 4B, please cite:

@techreport{bonsai,
    title   = {Bonsai: End-to-End 1-bit Language Model Deployment
               Across Apple, GPU, and Mobile Runtimes},
    author  = {Prism ML},
    year    = {2026},
    month   = {March},
    url     = {https://prismml.com}
}

Contact

For questions, feedback, or collaboration inquiries: contact@prismml.com

Downloads last month
64
Safetensors
Model size
0.2B params
Tensor type
U32
·
F16
·
MLX
Hardware compatibility
Log In to add your hardware

1-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including prism-ml/Bonsai-4B-mlx-1bit