SmolLM3-3B β 4-bit GGUF with Custom Nim Kernels
Quantized versions of SmolLM3-3B running on consumer hardware under 6 GB VRAM, with custom matrix kernels written in Nim for hot-path matrix-multiply operations.
Full write-up: Run LLMs on a 6 GB Laptop GPU & CPU β Medium
What's in this repo
| File | Format | Size approx. | Use case |
|---|---|---|---|
| smolLM3-q4_k_m.gguf | Q4_K_M | ~1.9 GB | Best speed, GPU + CPU |
| smolLM3-q8_0.gguf | Q8_0 | ~3.3 GB | Near FP16 quality, fits 6 GB |
Memory footprint is reduced ~75% from the FP16 baseline.
Performance (512-token prompt, averaged over 3 runs)
| Configuration | Prompt Processing | Text Generation | VRAM / RAM |
|---|---|---|---|
| 4-bit GPU | 11.25 tok/s | 14.60 tok/s | ~5.5 GB VRAM |
| 8-bit GPU | 2.57 tok/s | 12.65 tok/s | ~5.8 GB VRAM |
| 4-bit CPU | 2.38 tok/s | 15.37 tok/s | ~3β4 GB RAM |
| 8-bit CPU | 7.79 tok/s | 11.83 tok/s | ~3β4 GB RAM |
Hardware used: NVIDIA GPU with 6 GB VRAM, Windows, CUDA 12.4.
How to run
# 4-bit GPU
llama-cli.exe -m smolLM3-q4_k_m.gguf -ngl 36 -n 256 --temp 0.7 \
--repeat-penalty 1.1 --color -sys "you are a helpful assistant"
# 8-bit CPU
llama-cli.exe -m smolLM3-q8_0.gguf -ngl 0 -n 256 --temp 0.7
Requires llama.cpp 0.3.14+ built with CUDA support for GPU runs.
Custom Nim Kernels
The repo ships two companion DLLs β libsmolkernels_q4.dll and libsmolkernels_q8.dll
β written in Nim, compiled with -O3 and ORC memory management. They implement a
custom dequantization + matrix-multiply path (mulQ4Mat / mulQ8Mat) that is
dynamically loaded by a patched main.cpp at runtime based on the detected model type.
llama-cli.exe detects "q4_k_m" in filename β loads libsmolkernels_q4.dll β mulQ4Mat()
llama-cli.exe detects "q8_0" in filename β loads libsmolkernels_q8.dll β mulQ8Mat()
Full build instructions, kernel source code, and the main.cpp patch are in the
Medium article linked above.
System requirements
- NVIDIA GPU 4 GB+ VRAM (for GPU runs), or 8 GB+ RAM (for CPU runs)
- CUDA Toolkit 12.0+
- llama.cpp 0.3.x+
- Windows (DLL-based kernel loader); Linux port straightforward with .so
Related project
This work is the research foundation for QBench CLI β a C++ command-line tool that automates LLM quantization workflows, model selection, and hardware compatibility checks using the same quantization and kernel techniques developed here.
Model tree for devankit/smollm3_q4
Base model
HuggingFaceTB/SmolLM3-3B-Base