llamita.cpp / README.md
coverblew's picture
Add model card for llamita.cpp
eaf5549 verified
---
tags:
- llama-cpp
- jetson-nano
- cuda-10
- 1-bit
- bonsai
- edge-ai
- gguf
- nvidia
- tegra
- maxwell
- quantization
- arm64
license: mit
pipeline_tag: text-generation
---
# llamita.cpp
> Run an 8-billion-parameter 1-bit LLM in 1.1 GB on a $99 Jetson Nano.
**llamita.cpp** is a patched fork of [PrismML's llama.cpp](https://github.com/PrismML-Eng/llama.cpp) that enables [Bonsai](https://huggingface.co/collections/prism-ml/bonsai) 1-bit models (Q1_0_g128) to compile and run with **CUDA 10.2** on the **NVIDIA Jetson Nano** (SM 5.3 Maxwell, 4 GB RAM).
## Results
| Model | Size on disk | RAM used | Prompt | Generation | Board |
|-------|-------------|----------|--------|------------|-------|
| [Bonsai-8B](https://huggingface.co/prism-ml/Bonsai-8B-gguf) | 1.1 GB | 2.5 GB | 2.1 tok/s | 1.1 tok/s | Jetson Nano 4GB |
| [Bonsai-4B](https://huggingface.co/prism-ml/Bonsai-4B-gguf) | 546 MB | ~1.5 GB | 3.6 tok/s | 1.6 tok/s | Jetson Nano 4GB |
An 8-billion-parameter model running on a board with 128 CUDA cores and 4 GB of shared RAM.
## What Was Changed
27 files modified, ~3,200 lines of patches across 7 categories:
1. **C++17 to C++14** β€” `if constexpr`, `std::is_same_v`, structured bindings, fold expressions
2. **CUDA 10.2 API stubs** β€” `nv_bfloat16` type stub, `cooperative_groups/reduce.h`, `CUDA_R_16BF`
3. **SM 5.3 Maxwell** β€” Warp size macros, MMQ params, flash attention disabled with stubs
4. **ARM NEON GCC 8** β€” Custom struct types for broken `vld1q_*_x*` intrinsics
5. **Linker** β€” `-lstdc++fs` for `std::filesystem`
6. **Critical correctness fix** β€” `binbcast.cu` fold expression silently computing nothing
7. **Build system** β€” `CUDA_STANDARD 14`, flash attention template exclusion
## The Bug That Broke Everything
During the C++14 port, a fold expression in `binbcast.cu` was replaced with `(void)0`. This silently broke ALL binary operations (add, multiply, subtract, divide). The model loaded, allocated memory, ran inference β€” and produced complete garbage. The fix was one line.
## Links
- **GitHub**: [coverblew/llamita.cpp](https://github.com/coverblew/llamita.cpp)
- **Blog post**: [An 8B Model on a $99 Board](https://coverblew.github.io/llamita.cpp/)
- **Patch documentation**: [PATCHES.md](https://github.com/coverblew/llamita.cpp/blob/main/PATCHES.md)
- **Build guide**: [BUILD-JETSON.md](https://github.com/coverblew/llamita.cpp/blob/main/BUILD-JETSON.md)
- **Benchmarks**: [jetson-nano-4gb.md](https://github.com/coverblew/llamita.cpp/blob/main/benchmarks/jetson-nano-4gb.md)
## Credits
- [ggml-org/llama.cpp](https://github.com/ggml-org/llama.cpp) β€” Original llama.cpp (MIT)
- [PrismML-Eng/llama.cpp](https://github.com/PrismML-Eng/llama.cpp) β€” Q1_0_g128 support (MIT)
- [PrismML Bonsai models](https://huggingface.co/collections/prism-ml/bonsai) β€” 1-bit LLMs (Apache 2.0)