File size: 2,969 Bytes
8e41467
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
---
base_model: HuggingFaceTB/SmolLM3-3B
language:
- en
library_name: gguf
tags:
- quantization
- llama.cpp
- gguf
- smollm
- nim-kernels
- 4-bit
- 8-bit
- consumer-hardware
---

# SmolLM3-3B β€” 4-bit GGUF with Custom Nim Kernels

Quantized versions of [SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) 
running on consumer hardware under 6 GB VRAM, with custom matrix kernels written in Nim 
for hot-path matrix-multiply operations.

Full write-up: [Run LLMs on a 6 GB Laptop GPU & CPU β€” Medium](https://medium.com/@devbyankit/run-llms-on-a-6-gb-laptop-gpu-cpu-smollm-3-quantization-nim-kernels-6c9cbb233930)

---

## What's in this repo

| File | Format | Size approx. | Use case |
|------|--------|--------------|----------|
| smolLM3-q4_k_m.gguf | Q4_K_M | ~1.9 GB | Best speed, GPU + CPU |
| smolLM3-q8_0.gguf | Q8_0 | ~3.3 GB | Near FP16 quality, fits 6 GB |

Memory footprint is reduced ~75% from the FP16 baseline.

---

## Performance (512-token prompt, averaged over 3 runs)

| Configuration | Prompt Processing | Text Generation | VRAM / RAM |
|--------------|-------------------|-----------------|------------|
| 4-bit GPU | 11.25 tok/s | **14.60 tok/s** | ~5.5 GB VRAM |
| 8-bit GPU | 2.57 tok/s | 12.65 tok/s | ~5.8 GB VRAM |
| 4-bit CPU | 2.38 tok/s | 15.37 tok/s | ~3–4 GB RAM |
| 8-bit CPU | 7.79 tok/s | 11.83 tok/s | ~3–4 GB RAM |

Hardware used: NVIDIA GPU with 6 GB VRAM, Windows, CUDA 12.4.

---

## How to run

```bash
# 4-bit GPU
llama-cli.exe -m smolLM3-q4_k_m.gguf -ngl 36 -n 256 --temp 0.7 \
  --repeat-penalty 1.1 --color -sys "you are a helpful assistant"

# 8-bit CPU
llama-cli.exe -m smolLM3-q8_0.gguf -ngl 0 -n 256 --temp 0.7
```

Requires [llama.cpp](https://github.com/ggerganov/llama.cpp) 0.3.14+ built with CUDA 
support for GPU runs.

---

## Custom Nim Kernels

The repo ships two companion DLLs β€” `libsmolkernels_q4.dll` and `libsmolkernels_q8.dll` 
β€” written in Nim, compiled with `-O3` and ORC memory management. They implement a 
custom dequantization + matrix-multiply path (`mulQ4Mat` / `mulQ8Mat`) that is 
dynamically loaded by a patched `main.cpp` at runtime based on the detected model type.
llama-cli.exe detects "q4_k_m" in filename β†’ loads libsmolkernels_q4.dll β†’ mulQ4Mat()
llama-cli.exe detects "q8_0" in filename   β†’ loads libsmolkernels_q8.dll β†’ mulQ8Mat()
Full build instructions, kernel source code, and the main.cpp patch are in the 
Medium article linked above.

---

## System requirements

- NVIDIA GPU 4 GB+ VRAM (for GPU runs), or 8 GB+ RAM (for CPU runs)
- CUDA Toolkit 12.0+
- llama.cpp 0.3.x+
- Windows (DLL-based kernel loader); Linux port straightforward with .so

---

## Related project

This work is the research foundation for 
[QBench CLI](https://github.com/AnkitTsj/qbench) β€” a C++ command-line tool that 
automates LLM quantization workflows, model selection, and hardware compatibility 
checks using the same quantization and kernel techniques developed here.