SmolLM3-3B — 4-bit GGUF with Custom Nim Kernels

Quantized versions of SmolLM3-3B running on consumer hardware under 6 GB VRAM, with custom matrix kernels written in Nim for hot-path matrix-multiply operations.

Full write-up: Run LLMs on a 6 GB Laptop GPU & CPU — Medium

What's in this repo

File	Format	Size approx.	Use case
smolLM3-q4_k_m.gguf	Q4_K_M	~1.9 GB	Best speed, GPU + CPU
smolLM3-q8_0.gguf	Q8_0	~3.3 GB	Near FP16 quality, fits 6 GB

Memory footprint is reduced ~75% from the FP16 baseline.

Performance (512-token prompt, averaged over 3 runs)

Configuration	Prompt Processing	Text Generation	VRAM / RAM
4-bit GPU	11.25 tok/s	14.60 tok/s	~5.5 GB VRAM
8-bit GPU	2.57 tok/s	12.65 tok/s	~5.8 GB VRAM
4-bit CPU	2.38 tok/s	15.37 tok/s	~3–4 GB RAM
8-bit CPU	7.79 tok/s	11.83 tok/s	~3–4 GB RAM

Hardware used: NVIDIA GPU with 6 GB VRAM, Windows, CUDA 12.4.

How to run

# 4-bit GPU
llama-cli.exe -m smolLM3-q4_k_m.gguf -ngl 36 -n 256 --temp 0.7 \
  --repeat-penalty 1.1 --color -sys "you are a helpful assistant"

# 8-bit CPU
llama-cli.exe -m smolLM3-q8_0.gguf -ngl 0 -n 256 --temp 0.7

Requires llama.cpp 0.3.14+ built with CUDA support for GPU runs.

Custom Nim Kernels

The repo ships two companion DLLs — libsmolkernels_q4.dll and libsmolkernels_q8.dll — written in Nim, compiled with -O3 and ORC memory management. They implement a custom dequantization + matrix-multiply path (mulQ4Mat / mulQ8Mat) that is dynamically loaded by a patched main.cpp at runtime based on the detected model type. llama-cli.exe detects "q4_k_m" in filename → loads libsmolkernels_q4.dll → mulQ4Mat() llama-cli.exe detects "q8_0" in filename → loads libsmolkernels_q8.dll → mulQ8Mat() Full build instructions, kernel source code, and the main.cpp patch are in the Medium article linked above.

System requirements

NVIDIA GPU 4 GB+ VRAM (for GPU runs), or 8 GB+ RAM (for CPU runs)
CUDA Toolkit 12.0+
llama.cpp 0.3.x+
Windows (DLL-based kernel loader); Linux port straightforward with .so

Related project

This work is the research foundation for QBench CLI — a C++ command-line tool that automates LLM quantization workflows, model selection, and hardware compatibility checks using the same quantization and kernel techniques developed here.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for devankit/smollm3_q4

Base model

HuggingFaceTB/SmolLM3-3B-Base

Finetuned

HuggingFaceTB/SmolLM3-3B

Finetuned

(135)

this model