Instructions to use prism-ml/Bonsai-8B-mlx-1bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use prism-ml/Bonsai-8B-mlx-1bit with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("prism-ml/Bonsai-8B-mlx-1bit")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings
LM Studio

How to use prism-ml/Bonsai-8B-mlx-1bit with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "prism-ml/Bonsai-8B-mlx-1bit"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "prism-ml/Bonsai-8B-mlx-1bit"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use prism-ml/Bonsai-8B-mlx-1bit with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "prism-ml/Bonsai-8B-mlx-1bit"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default prism-ml/Bonsai-8B-mlx-1bit

Run Hermes

hermes

OpenClaw new

How to use prism-ml/Bonsai-8B-mlx-1bit with OpenClaw:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "prism-ml/Bonsai-8B-mlx-1bit"

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "prism-ml/Bonsai-8B-mlx-1bit" \
  --custom-provider-id mlx-lm \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

MLX LM

How to use prism-ml/Bonsai-8B-mlx-1bit with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "prism-ml/Bonsai-8B-mlx-1bit"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "prism-ml/Bonsai-8B-mlx-1bit"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "prism-ml/Bonsai-8B-mlx-1bit",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Prism ML Website | Whitepaper | Demo & Examples | Colab Notebook | Discord

Bonsai-8B-mlx-1bit

End-to-end 1-bit language model for Apple Silicon

12.8x smaller than FP16 | 8.4x faster on M4 Pro | 44 tok/s on iPhone | runs on Mac, iPhone, iPad

Highlights

1.28 GB parameter memory (down from 16.38 GB FP16) — runs comfortably on any Mac or iPhone
End-to-end 1-bit weights across embeddings, attention projections, MLP projections, and LM head
MLX-native format (1-bit g128) with inline dequantization kernels — no FP16 materialization
Competitive benchmarks: 70.5 avg score across 6 categories, matching full-precision 8B models at 1/14th the size
Cross-platform companion: also available as GGUF Q1_0_g128 for llama.cpp

Frontier Efficiency

Resources

Google Colab — try Bonsai in your browser, no setup required
Whitepaper — for more details on Bonsai, check out our whitepaper
Demo repo — comprehensive examples for serving, benchmarking, and integrating Bonsai
Discord — join the community for support, discussion, and updates
1-bit kernels: MLX fork (Apple Silicon) · mlx-swift fork (iOS/macOS) · llama.cpp fork (CUDA + Metal)
Locally AI — we have partnered with Locally AI for iPhone support

Model Overview

Item	Specification
Parameters	8.19B (~6.95B non-embedding)
Architecture	Qwen3-8B dense: GQA (32 query / 8 KV heads), SwiGLU MLP, RoPE, RMSNorm
Layers	36 Transformer decoder blocks
Context length	65,536 tokens
Vocab size	151,936
Weight format	MLX 1-bit g128
Deployed size	1.28 GB (12.8x smaller than FP16)
1-bit coverage	Embeddings, attention projections, MLP projections, LM head
License	Apache 2.0

Quantization Format: 1-bit g128

Each weight is a single bit: 0 maps to −scale, 1 maps to +scale. Every group of 128 weights shares one FP16 scale factor.

MLX's quantization formats generally store both a scale and a bias per group: w = mlx_scale * bit + mlx_bias. To pack our scale-only 1-bit weights into this format:

mlx_scale = 2 * original_scale
mlx_bias  = −original_scale

This reconstructs −scale when bit=0 and +scale when bit=1. Because MLX stores two FP16 values per group (scale + bias) instead of one, the effective bits per weight is slightly higher than the GGUF format:

MLX 1-bit g128: 1.25 bpw (1 sign bit + two 16-bit values amortized over 128 weights)
GGUF Q1_0_g128: 1.125 bpw (1 sign bit + one 16-bit scale amortized over 128 weights)

Memory Requirement

Parameter memory only (weights and scales loaded into memory):

Format	Size	Reduction	Ratio
FP16	16.38 GB	—	1.0x
MLX 1-bit g128	1.28 GB	92.2%	12.8x
GGUF Q1_0_g128	1.15 GB	93.0%	14.2x

The model directory on disk is ~~1.30 GB (~~16 MB larger) because it also includes tokenizer, config, and other metadata files alongside the weights.

Best Practices

Generation Parameters

Parameter	Default	Suggested range
Temperature	0.5	0.5 -- 0.7
Top-k	20	20 -- 40
Top-p	0.9	0.85 -- 0.95
Repetition penalty	1.0
Presence penalty	0.0

System Prompt

You can use a simple system prompt such as:

You are a helpful assistant

Quickstart

MLX (Python)

Requires PrismML fork of MLX with 1-bit kernel support (upstream PR pending):
pip install mlx-lm
pip install mlx @ git+https://github.com/PrismML-Eng/mlx.git@prism

from mlx_lm import load, generate

model, tokenizer = load("prism-ml/Bonsai-8B-mlx-1bit")

response = generate(
    model,
    tokenizer,
    prompt="Explain quantum computing in simple terms.",
    max_tokens=256,
)
print(response)

MLX Swift (iOS / macOS)

1-bit Bonsai 8B runs natively on iPhone and iPad via MLX Swift at 44 tok/s on iPhone 17 Pro Max. Requires our mlx-swift fork with 1-bit kernels (upstream PR pending).

Throughput (MLX / Apple Silicon)

Platform	Backend	TG128 (tok/s)	FP16 TG (tok/s)	TG vs FP16	PP512 (tok/s)	FP16 PP512 (tok/s)
M4 Pro 48 GB	MLX (Python)	131	16	8.4x	472	434
M4 Pro 48 GB	llama.cpp Metal	85	16	5.4x	498	490

iPhone 17 Pro Max (MLX Swift)

FP16 does not fit on-device; baseline is 4-bit.

	1-bit (tok/s)	4-bit (tok/s)	1-bit vs 4-bit
Token generation	44	14	3.1x
Prompt processing	377	348	1.08x

Cross-platform throughput

Energy Efficiency

Platform	Bonsai E_tg (mWh/tok)	Baseline E_tg	Advantage
Mac M4 Pro (MLX)	0.074	0.415 (FP16)	5.6x
Mac M4 Pro (Metal)	0.091	0.471 (FP16)	5.1x
iPhone 17 Pro Max	~0.068	~0.143 (4-bit)	2.1x vs 4-bit

Higher instantaneous power does not preclude lower energy — token generation is so much faster that energy per output token drops 4–6x.

Energy efficiency

Benchmarks

Evaluated with EvalScope v1.4.2 + vLLM 0.15.1 on NVIDIA H100 under identical infrastructure, generation parameters, and scoring. All models are in the 6B–9B parameter range.

Model	Company	Size	Avg	MMLU-R	MuSR	GSM8K	HE+	IFEval	BFCL
Qwen 3 8B	Alibaba	16 GB	79.3	83	55	93	82.3	84.2	81
RNJ 8B	EssentialAI	16 GB	73.1	75.5	50.4	93.7	84.2	73.8	61.1
Mistral3 8B	Mistral	16 GB	71.0	73.9	53.8	87.2	67.4	75.4	45.4
Olmo 3 7B	Allen Inst	14 GB	70.9	72	56.1	92.5	79.3	37.1	38.4
1-bit Bonsai 8B	PrismML	1.15 GB	70.5	65.7	50	88	73.8	79.8	65.7
LFM2 8B	LiquidAI	16 GB	69.6	72.7	49.5	90.1	81	82.2	62.0
Llama 3.1 8B	Meta	16 GB	67.1	72.9	51.3	87.9	75	51.5	—
GLM v6 9B	ZhipuAI	16 GB	65.7	61.9	43.2	93.4	78.7	69.3	21.9
Hermes 8B	Nous Research	16 GB	65.4	67.4	52.2	82.9	51.2	65	73.5
Trinity Nano 6B	Arcee	12 GB	61.2	68.8	52.6	81.1	54	50	62.5
Marin 8B	Stanford CRFM	16 GB	56.6	64.8	42.6	86.4	51	50	—
R1-D 7B	DeepSeek	14 GB	55.1	62.5	29.1	92.7	81.7	48.8	15.4

Despite being 1/14th the size, 1-bit Bonsai 8B is competitive with leading full-precision 8B instruct models.

Intelligence Density

Intelligence density captures the ratio of a model's capability to its deployed size:

alpha = -ln(1 - score/100) / size_GB

Model	Size	Intelligence Density (1/GB)
1-bit Bonsai 8B	1.15 GB	1.062
Qwen 3 8B	16 GB	0.098
Llama 3.1 8B	16 GB	0.074
Mistral3 8B	16 GB	0.077

Bonsai 8B achieves 10.8x higher intelligence density than full-precision Qwen 3 8B.

Intelligence density

Use Cases

On-device assistants: interactive AI on Mac, iPhone, and iPad with low latency and strong privacy
Mobile deployment: runs on a wide variety of phones due to low memory footprint
Edge robotics and autonomy: compact deployment on devices with thermal, memory, or connectivity constraints
Cost-sensitive GPU serving: higher throughput and lower energy per token on commodity GPU deployments
Enterprise and private inference: local or controlled-environment inference for data residency requirements

Limitations

No native 1-bit hardware exists yet — current gains are software-kernel optimizations on general-purpose hardware
Mobile power measurement is estimated (Xcode Power Profiler) rather than hardware-metered
The full-precision benchmark frontier continues to advance; the 1-bit methodology is architecture-agnostic and will be applied to newer bases

Citation

If you use 1-bit Bonsai 8B, please cite:

@techreport{bonsai8b,
    title   = {1-bit Bonsai 8B: End-to-End 1-bit Language Model Deployment
               Across Apple, GPU, and Mobile Runtimes},
    author  = {Prism ML},
    year    = {2026},
    month   = {March},
    url     = {https://prismml.com}
}