Bonsai Llamafiles
Self-contained, portable llamafile executables for PrismML's 1-bit Bonsai model family. Download a single file, make it executable, and run โ no dependencies, no Python, no package managers.
These llamafiles are built for CPU-only inference using the Cosmopolitan toolchain, and include support for PrismML's custom Q1_0_g128 1-bit quantization format.
Available Models
| File | Parameters | GGUF Size | Llamafile Size | Architecture |
|---|---|---|---|---|
| Bonsai-1.7B.llamafile | 1.7B | 237 MB | 267 MB | Qwen3-1.7B |
| Bonsai-4B.llamafile | 4.0B | 546 MB | 576 MB | Qwen3-4B |
| Bonsai-8B.llamafile | 8.19B | 1.1 GB | 1.2 GB | Qwen3-8B |
All models use the Q1_0_g128 quantization format โ every weight is a single bit with one FP16 scale factor shared across each group of 128 weights (effective 1.125 bits/weight). This achieves ~14x size reduction compared to FP16.
Quickstart
# Download (pick your size)
wget https://huggingface.co/Zetaphor/Bonsai-llamafile/resolve/main/Bonsai-8B.llamafile
# Make executable
chmod +x Bonsai-8B.llamafile
# Run (launches TUI chat + HTTP server)
./Bonsai-8B.llamafile
Usage Modes
Interactive Chat (default)
./Bonsai-8B.llamafile
Opens a combined TUI chat interface and HTTP server on port 8080.
Server Only (OpenAI-compatible API)
./Bonsai-8B.llamafile --server --host 0.0.0.0 --port 8080
Starts an OpenAI-compatible HTTP server. Access the web UI at http://localhost:8080 or use the API:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "bonsai",
"messages": [{"role": "user", "content": "Hello!"}]
}'
CLI Mode (single prompt)
./Bonsai-8B.llamafile --cli -p "Explain quantum computing in simple terms."
Embedded Parameters
Each llamafile is pre-configured with the following generation parameters (recommended by PrismML):
| Parameter | Value |
|---|---|
| Temperature | 0.5 |
| Top-k | 20 |
| Top-p | 0.9 |
These can be overridden from the command line, e.g. --temp 0.7 --top-k 40.
How These Were Built
These llamafiles were built by integrating PrismML's Q1_0_g128 quantization support (from their llama.cpp fork) into the llamafile build system:
- Cherry-picked PrismML's Q1_0 commits onto the llamafile
llama.cppsubmodule - Added generic CPU fallback definitions for the Q1_0 dot product functions
- Compiled with
cosmoccfor portable CPU-only binaries - Packaged the GGUF weights and default arguments into the executable using
zipalign
Original Models
These are repackaged versions of PrismML's official 1-bit Bonsai GGUF models:
For benchmarks, technical details, and the whitepaper, see PrismML's model cards linked above or visit prismml.com.
License
Apache 2.0 (same as the original Bonsai models).
- Downloads last month
- 59
Model tree for Zetaphor/Bonsai-llamafile
Base model
prism-ml/Bonsai-1.7B-gguf