coverblew
/

llamita.cpp

Text Generation

Model card Files Files and versions

llamita.cpp / README.md

coverblew's picture

Add model card for llamita.cpp

eaf5549 verified about 1 month ago

|

history blame contribute delete

2.85 kB

	---
	tags:
	- llama-cpp
	- jetson-nano
	- cuda-10
	- 1-bit
	- bonsai
	- edge-ai
	- gguf
	- nvidia
	- tegra
	- maxwell
	- quantization
	- arm64
	license: mit
	pipeline_tag: text-generation
	---

	# llamita.cpp

	> Run an 8-billion-parameter 1-bit LLM in 1.1 GB on a $99 Jetson Nano.

	llamita.cpp is a patched fork of [PrismML's llama.cpp](https://github.com/PrismML-Eng/llama.cpp) that enables [Bonsai](https://huggingface.co/collections/prism-ml/bonsai) 1-bit models (Q1_0_g128) to compile and run with CUDA 10.2 on the NVIDIA Jetson Nano (SM 5.3 Maxwell, 4 GB RAM).

	## Results

	\| Model \| Size on disk \| RAM used \| Prompt \| Generation \| Board \|
	\|-------\|-------------\|----------\|--------\|------------\|-------\|
	\| [Bonsai-8B](https://huggingface.co/prism-ml/Bonsai-8B-gguf) \| 1.1 GB \| 2.5 GB \| 2.1 tok/s \| 1.1 tok/s \| Jetson Nano 4GB \|
	\| [Bonsai-4B](https://huggingface.co/prism-ml/Bonsai-4B-gguf) \| 546 MB \| ~1.5 GB \| 3.6 tok/s \| 1.6 tok/s \| Jetson Nano 4GB \|

	An 8-billion-parameter model running on a board with 128 CUDA cores and 4 GB of shared RAM.

	## What Was Changed

	27 files modified, ~3,200 lines of patches across 7 categories:

	1. C++17 to C++14 — `if constexpr`, `std::is_same_v`, structured bindings, fold expressions
	2. CUDA 10.2 API stubs — `nv_bfloat16` type stub, `cooperative_groups/reduce.h`, `CUDA_R_16BF`
	3. SM 5.3 Maxwell — Warp size macros, MMQ params, flash attention disabled with stubs
	4. ARM NEON GCC 8 — Custom struct types for broken `vld1q__x` intrinsics
	5. Linker — `-lstdc++fs` for `std::filesystem`
	6. Critical correctness fix — `binbcast.cu` fold expression silently computing nothing
	7. Build system — `CUDA_STANDARD 14`, flash attention template exclusion

	## The Bug That Broke Everything

	During the C++14 port, a fold expression in `binbcast.cu` was replaced with `(void)0`. This silently broke ALL binary operations (add, multiply, subtract, divide). The model loaded, allocated memory, ran inference — and produced complete garbage. The fix was one line.

	## Links

	- GitHub: [coverblew/llamita.cpp](https://github.com/coverblew/llamita.cpp)
	- Blog post: [An 8B Model on a $99 Board](https://coverblew.github.io/llamita.cpp/)
	- Patch documentation: [PATCHES.md](https://github.com/coverblew/llamita.cpp/blob/main/PATCHES.md)
	- Build guide: [BUILD-JETSON.md](https://github.com/coverblew/llamita.cpp/blob/main/BUILD-JETSON.md)
	- Benchmarks: [jetson-nano-4gb.md](https://github.com/coverblew/llamita.cpp/blob/main/benchmarks/jetson-nano-4gb.md)

	## Credits

	- [ggml-org/llama.cpp](https://github.com/ggml-org/llama.cpp) — Original llama.cpp (MIT)
	- [PrismML-Eng/llama.cpp](https://github.com/PrismML-Eng/llama.cpp) — Q1_0_g128 support (MIT)
	- [PrismML Bonsai models](https://huggingface.co/collections/prism-ml/bonsai) — 1-bit LLMs (Apache 2.0)