Bonsai Llamafiles

Self-contained, portable llamafile executables for PrismML's 1-bit Bonsai model family. Download a single file, make it executable, and run — no dependencies, no Python, no package managers.

These llamafiles are built for CPU-only inference using the Cosmopolitan toolchain, and include support for PrismML's custom Q1_0_g128 1-bit quantization format.

Available Models

File	Parameters	GGUF Size	Llamafile Size	Architecture
Bonsai-1.7B.llamafile	1.7B	237 MB	267 MB	Qwen3-1.7B
Bonsai-4B.llamafile	4.0B	546 MB	576 MB	Qwen3-4B
Bonsai-8B.llamafile	8.19B	1.1 GB	1.2 GB	Qwen3-8B

All models use the Q1_0_g128 quantization format — every weight is a single bit with one FP16 scale factor shared across each group of 128 weights (effective 1.125 bits/weight). This achieves ~14x size reduction compared to FP16.

Quickstart

# Download (pick your size)
wget https://huggingface.co/Zetaphor/Bonsai-llamafile/resolve/main/Bonsai-8B.llamafile

# Make executable
chmod +x Bonsai-8B.llamafile

# Run (launches TUI chat + HTTP server)
./Bonsai-8B.llamafile

Usage Modes

Interactive Chat (default)

./Bonsai-8B.llamafile

Opens a combined TUI chat interface and HTTP server on port 8080.

Server Only (OpenAI-compatible API)

./Bonsai-8B.llamafile --server --host 0.0.0.0 --port 8080

Starts an OpenAI-compatible HTTP server. Access the web UI at http://localhost:8080 or use the API:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "bonsai",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

CLI Mode (single prompt)

./Bonsai-8B.llamafile --cli -p "Explain quantum computing in simple terms."

Embedded Parameters

Each llamafile is pre-configured with the following generation parameters (recommended by PrismML):

Parameter	Value
Temperature	0.5
Top-k	20
Top-p	0.9

These can be overridden from the command line, e.g. --temp 0.7 --top-k 40.

How These Were Built

These llamafiles were built by integrating PrismML's Q1_0_g128 quantization support (from their llama.cpp fork) into the llamafile build system:

Cherry-picked PrismML's Q1_0 commits onto the llamafile llama.cpp submodule
Added generic CPU fallback definitions for the Q1_0 dot product functions
Compiled with cosmocc for portable CPU-only binaries
Packaged the GGUF weights and default arguments into the executable using zipalign

Original Models

These are repackaged versions of PrismML's official 1-bit Bonsai GGUF models:

For benchmarks, technical details, and the whitepaper, see PrismML's model cards linked above or visit prismml.com.

License

Apache 2.0 (same as the original Bonsai models).

Downloads last month: 89

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Zetaphor/Bonsai-llamafile

Base model

prism-ml/Bonsai-1.7B-unpacked

Quantized

prism-ml/Bonsai-1.7B-gguf

Finetuned

(1)

this model