Bonsai Llamafiles

Self-contained, portable llamafile executables for PrismML's 1-bit Bonsai model family. Download a single file, make it executable, and run โ€” no dependencies, no Python, no package managers.

These llamafiles are built for CPU-only inference using the Cosmopolitan toolchain, and include support for PrismML's custom Q1_0_g128 1-bit quantization format.

Available Models

File Parameters GGUF Size Llamafile Size Architecture
Bonsai-1.7B.llamafile 1.7B 237 MB 267 MB Qwen3-1.7B
Bonsai-4B.llamafile 4.0B 546 MB 576 MB Qwen3-4B
Bonsai-8B.llamafile 8.19B 1.1 GB 1.2 GB Qwen3-8B

All models use the Q1_0_g128 quantization format โ€” every weight is a single bit with one FP16 scale factor shared across each group of 128 weights (effective 1.125 bits/weight). This achieves ~14x size reduction compared to FP16.

Quickstart

# Download (pick your size)
wget https://huggingface.co/Zetaphor/Bonsai-llamafile/resolve/main/Bonsai-8B.llamafile

# Make executable
chmod +x Bonsai-8B.llamafile

# Run (launches TUI chat + HTTP server)
./Bonsai-8B.llamafile

Usage Modes

Interactive Chat (default)

./Bonsai-8B.llamafile

Opens a combined TUI chat interface and HTTP server on port 8080.

Server Only (OpenAI-compatible API)

./Bonsai-8B.llamafile --server --host 0.0.0.0 --port 8080

Starts an OpenAI-compatible HTTP server. Access the web UI at http://localhost:8080 or use the API:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "bonsai",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

CLI Mode (single prompt)

./Bonsai-8B.llamafile --cli -p "Explain quantum computing in simple terms."

Embedded Parameters

Each llamafile is pre-configured with the following generation parameters (recommended by PrismML):

Parameter Value
Temperature 0.5
Top-k 20
Top-p 0.9

These can be overridden from the command line, e.g. --temp 0.7 --top-k 40.

How These Were Built

These llamafiles were built by integrating PrismML's Q1_0_g128 quantization support (from their llama.cpp fork) into the llamafile build system:

  1. Cherry-picked PrismML's Q1_0 commits onto the llamafile llama.cpp submodule
  2. Added generic CPU fallback definitions for the Q1_0 dot product functions
  3. Compiled with cosmocc for portable CPU-only binaries
  4. Packaged the GGUF weights and default arguments into the executable using zipalign

Original Models

These are repackaged versions of PrismML's official 1-bit Bonsai GGUF models:

For benchmarks, technical details, and the whitepaper, see PrismML's model cards linked above or visit prismml.com.

License

Apache 2.0 (same as the original Bonsai models).

Downloads last month
59
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Zetaphor/Bonsai-llamafile

Finetuned
(1)
this model