devankit
/

smollm3_q4

4-bit precision

8-bit precision

consumer-hardware

Model card Files Files and versions

smollm3_q4 / README.md

devankit's picture

Create README.md

8e41467 verified 4 days ago

|

history blame contribute delete

2.97 kB

	---
	base_model: HuggingFaceTB/SmolLM3-3B
	language:
	- en
	library_name: gguf
	tags:
	- quantization
	- llama.cpp
	- gguf
	- smollm
	- nim-kernels
	- 4-bit
	- 8-bit
	- consumer-hardware
	---

	# SmolLM3-3B — 4-bit GGUF with Custom Nim Kernels

	Quantized versions of [SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B)
	running on consumer hardware under 6 GB VRAM, with custom matrix kernels written in Nim
	for hot-path matrix-multiply operations.

	Full write-up: [Run LLMs on a 6 GB Laptop GPU & CPU — Medium](https://medium.com/@devbyankit/run-llms-on-a-6-gb-laptop-gpu-cpu-smollm-3-quantization-nim-kernels-6c9cbb233930)

	---

	## What's in this repo

	\| File \| Format \| Size approx. \| Use case \|
	\|------\|--------\|--------------\|----------\|
	\| smolLM3-q4_k_m.gguf \| Q4_K_M \| ~1.9 GB \| Best speed, GPU + CPU \|
	\| smolLM3-q8_0.gguf \| Q8_0 \| ~3.3 GB \| Near FP16 quality, fits 6 GB \|

	Memory footprint is reduced ~75% from the FP16 baseline.

	---

	## Performance (512-token prompt, averaged over 3 runs)

	\| Configuration \| Prompt Processing \| Text Generation \| VRAM / RAM \|
	\|--------------\|-------------------\|-----------------\|------------\|
	\| 4-bit GPU \| 11.25 tok/s \| 14.60 tok/s \| ~5.5 GB VRAM \|
	\| 8-bit GPU \| 2.57 tok/s \| 12.65 tok/s \| ~5.8 GB VRAM \|
	\| 4-bit CPU \| 2.38 tok/s \| 15.37 tok/s \| ~3–4 GB RAM \|
	\| 8-bit CPU \| 7.79 tok/s \| 11.83 tok/s \| ~3–4 GB RAM \|

	Hardware used: NVIDIA GPU with 6 GB VRAM, Windows, CUDA 12.4.

	---

	## How to run

	```bash
	# 4-bit GPU
	llama-cli.exe -m smolLM3-q4_k_m.gguf -ngl 36 -n 256 --temp 0.7 \
	--repeat-penalty 1.1 --color -sys "you are a helpful assistant"

	# 8-bit CPU
	llama-cli.exe -m smolLM3-q8_0.gguf -ngl 0 -n 256 --temp 0.7
	```

	Requires [llama.cpp](https://github.com/ggerganov/llama.cpp) 0.3.14+ built with CUDA
	support for GPU runs.

	---

	## Custom Nim Kernels

	The repo ships two companion DLLs — `libsmolkernels_q4.dll` and `libsmolkernels_q8.dll`
	— written in Nim, compiled with `-O3` and ORC memory management. They implement a
	custom dequantization + matrix-multiply path (`mulQ4Mat` / `mulQ8Mat`) that is
	dynamically loaded by a patched `main.cpp` at runtime based on the detected model type.
	llama-cli.exe detects "q4_k_m" in filename → loads libsmolkernels_q4.dll → mulQ4Mat()
	llama-cli.exe detects "q8_0" in filename → loads libsmolkernels_q8.dll → mulQ8Mat()
	Full build instructions, kernel source code, and the main.cpp patch are in the
	Medium article linked above.

	---

	## System requirements

	- NVIDIA GPU 4 GB+ VRAM (for GPU runs), or 8 GB+ RAM (for CPU runs)
	- CUDA Toolkit 12.0+
	- llama.cpp 0.3.x+
	- Windows (DLL-based kernel loader); Linux port straightforward with .so

	---

	## Related project

	This work is the research foundation for
	[QBench CLI](https://github.com/AnkitTsj/qbench) — a C++ command-line tool that
	automates LLM quantization workflows, model selection, and hardware compatibility
	checks using the same quantization and kernel techniques developed here.