smollm3_q4 / README.md
devankit's picture
Create README.md
8e41467 verified
---
base_model: HuggingFaceTB/SmolLM3-3B
language:
- en
library_name: gguf
tags:
- quantization
- llama.cpp
- gguf
- smollm
- nim-kernels
- 4-bit
- 8-bit
- consumer-hardware
---
# SmolLM3-3B β€” 4-bit GGUF with Custom Nim Kernels
Quantized versions of [SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B)
running on consumer hardware under 6 GB VRAM, with custom matrix kernels written in Nim
for hot-path matrix-multiply operations.
Full write-up: [Run LLMs on a 6 GB Laptop GPU & CPU β€” Medium](https://medium.com/@devbyankit/run-llms-on-a-6-gb-laptop-gpu-cpu-smollm-3-quantization-nim-kernels-6c9cbb233930)
---
## What's in this repo
| File | Format | Size approx. | Use case |
|------|--------|--------------|----------|
| smolLM3-q4_k_m.gguf | Q4_K_M | ~1.9 GB | Best speed, GPU + CPU |
| smolLM3-q8_0.gguf | Q8_0 | ~3.3 GB | Near FP16 quality, fits 6 GB |
Memory footprint is reduced ~75% from the FP16 baseline.
---
## Performance (512-token prompt, averaged over 3 runs)
| Configuration | Prompt Processing | Text Generation | VRAM / RAM |
|--------------|-------------------|-----------------|------------|
| 4-bit GPU | 11.25 tok/s | **14.60 tok/s** | ~5.5 GB VRAM |
| 8-bit GPU | 2.57 tok/s | 12.65 tok/s | ~5.8 GB VRAM |
| 4-bit CPU | 2.38 tok/s | 15.37 tok/s | ~3–4 GB RAM |
| 8-bit CPU | 7.79 tok/s | 11.83 tok/s | ~3–4 GB RAM |
Hardware used: NVIDIA GPU with 6 GB VRAM, Windows, CUDA 12.4.
---
## How to run
```bash
# 4-bit GPU
llama-cli.exe -m smolLM3-q4_k_m.gguf -ngl 36 -n 256 --temp 0.7 \
--repeat-penalty 1.1 --color -sys "you are a helpful assistant"
# 8-bit CPU
llama-cli.exe -m smolLM3-q8_0.gguf -ngl 0 -n 256 --temp 0.7
```
Requires [llama.cpp](https://github.com/ggerganov/llama.cpp) 0.3.14+ built with CUDA
support for GPU runs.
---
## Custom Nim Kernels
The repo ships two companion DLLs β€” `libsmolkernels_q4.dll` and `libsmolkernels_q8.dll`
β€” written in Nim, compiled with `-O3` and ORC memory management. They implement a
custom dequantization + matrix-multiply path (`mulQ4Mat` / `mulQ8Mat`) that is
dynamically loaded by a patched `main.cpp` at runtime based on the detected model type.
llama-cli.exe detects "q4_k_m" in filename β†’ loads libsmolkernels_q4.dll β†’ mulQ4Mat()
llama-cli.exe detects "q8_0" in filename β†’ loads libsmolkernels_q8.dll β†’ mulQ8Mat()
Full build instructions, kernel source code, and the main.cpp patch are in the
Medium article linked above.
---
## System requirements
- NVIDIA GPU 4 GB+ VRAM (for GPU runs), or 8 GB+ RAM (for CPU runs)
- CUDA Toolkit 12.0+
- llama.cpp 0.3.x+
- Windows (DLL-based kernel loader); Linux port straightforward with .so
---
## Related project
This work is the research foundation for
[QBench CLI](https://github.com/AnkitTsj/qbench) β€” a C++ command-line tool that
automates LLM quantization workflows, model selection, and hardware compatibility
checks using the same quantization and kernel techniques developed here.