| --- |
| tags: |
| - llama-cpp |
| - jetson-nano |
| - cuda-10 |
| - 1-bit |
| - bonsai |
| - edge-ai |
| - gguf |
| - nvidia |
| - tegra |
| - maxwell |
| - quantization |
| - arm64 |
| license: mit |
| pipeline_tag: text-generation |
| --- |
| |
| # llamita.cpp |
|
|
| > Run an 8-billion-parameter 1-bit LLM in 1.1 GB on a $99 Jetson Nano. |
|
|
| **llamita.cpp** is a patched fork of [PrismML's llama.cpp](https://github.com/PrismML-Eng/llama.cpp) that enables [Bonsai](https://huggingface.co/collections/prism-ml/bonsai) 1-bit models (Q1_0_g128) to compile and run with **CUDA 10.2** on the **NVIDIA Jetson Nano** (SM 5.3 Maxwell, 4 GB RAM). |
|
|
| ## Results |
|
|
| | Model | Size on disk | RAM used | Prompt | Generation | Board | |
| |-------|-------------|----------|--------|------------|-------| |
| | [Bonsai-8B](https://huggingface.co/prism-ml/Bonsai-8B-gguf) | 1.1 GB | 2.5 GB | 2.1 tok/s | 1.1 tok/s | Jetson Nano 4GB | |
| | [Bonsai-4B](https://huggingface.co/prism-ml/Bonsai-4B-gguf) | 546 MB | ~1.5 GB | 3.6 tok/s | 1.6 tok/s | Jetson Nano 4GB | |
|
|
| An 8-billion-parameter model running on a board with 128 CUDA cores and 4 GB of shared RAM. |
|
|
| ## What Was Changed |
|
|
| 27 files modified, ~3,200 lines of patches across 7 categories: |
|
|
| 1. **C++17 to C++14** β `if constexpr`, `std::is_same_v`, structured bindings, fold expressions |
| 2. **CUDA 10.2 API stubs** β `nv_bfloat16` type stub, `cooperative_groups/reduce.h`, `CUDA_R_16BF` |
| 3. **SM 5.3 Maxwell** β Warp size macros, MMQ params, flash attention disabled with stubs |
| 4. **ARM NEON GCC 8** β Custom struct types for broken `vld1q_*_x*` intrinsics |
| 5. **Linker** β `-lstdc++fs` for `std::filesystem` |
| 6. **Critical correctness fix** β `binbcast.cu` fold expression silently computing nothing |
| 7. **Build system** β `CUDA_STANDARD 14`, flash attention template exclusion |
|
|
| ## The Bug That Broke Everything |
|
|
| During the C++14 port, a fold expression in `binbcast.cu` was replaced with `(void)0`. This silently broke ALL binary operations (add, multiply, subtract, divide). The model loaded, allocated memory, ran inference β and produced complete garbage. The fix was one line. |
|
|
| ## Links |
|
|
| - **GitHub**: [coverblew/llamita.cpp](https://github.com/coverblew/llamita.cpp) |
| - **Blog post**: [An 8B Model on a $99 Board](https://coverblew.github.io/llamita.cpp/) |
| - **Patch documentation**: [PATCHES.md](https://github.com/coverblew/llamita.cpp/blob/main/PATCHES.md) |
| - **Build guide**: [BUILD-JETSON.md](https://github.com/coverblew/llamita.cpp/blob/main/BUILD-JETSON.md) |
| - **Benchmarks**: [jetson-nano-4gb.md](https://github.com/coverblew/llamita.cpp/blob/main/benchmarks/jetson-nano-4gb.md) |
|
|
| ## Credits |
|
|
| - [ggml-org/llama.cpp](https://github.com/ggml-org/llama.cpp) β Original llama.cpp (MIT) |
| - [PrismML-Eng/llama.cpp](https://github.com/PrismML-Eng/llama.cpp) β Q1_0_g128 support (MIT) |
| - [PrismML Bonsai models](https://huggingface.co/collections/prism-ml/bonsai) β 1-bit LLMs (Apache 2.0) |
|
|