SmolLM3-3B β€” 4-bit GGUF with Custom Nim Kernels

Quantized versions of SmolLM3-3B running on consumer hardware under 6 GB VRAM, with custom matrix kernels written in Nim for hot-path matrix-multiply operations.

Full write-up: Run LLMs on a 6 GB Laptop GPU & CPU β€” Medium


What's in this repo

File Format Size approx. Use case
smolLM3-q4_k_m.gguf Q4_K_M ~1.9 GB Best speed, GPU + CPU
smolLM3-q8_0.gguf Q8_0 ~3.3 GB Near FP16 quality, fits 6 GB

Memory footprint is reduced ~75% from the FP16 baseline.


Performance (512-token prompt, averaged over 3 runs)

Configuration Prompt Processing Text Generation VRAM / RAM
4-bit GPU 11.25 tok/s 14.60 tok/s ~5.5 GB VRAM
8-bit GPU 2.57 tok/s 12.65 tok/s ~5.8 GB VRAM
4-bit CPU 2.38 tok/s 15.37 tok/s ~3–4 GB RAM
8-bit CPU 7.79 tok/s 11.83 tok/s ~3–4 GB RAM

Hardware used: NVIDIA GPU with 6 GB VRAM, Windows, CUDA 12.4.


How to run

# 4-bit GPU
llama-cli.exe -m smolLM3-q4_k_m.gguf -ngl 36 -n 256 --temp 0.7 \
  --repeat-penalty 1.1 --color -sys "you are a helpful assistant"

# 8-bit CPU
llama-cli.exe -m smolLM3-q8_0.gguf -ngl 0 -n 256 --temp 0.7

Requires llama.cpp 0.3.14+ built with CUDA support for GPU runs.


Custom Nim Kernels

The repo ships two companion DLLs β€” libsmolkernels_q4.dll and libsmolkernels_q8.dll β€” written in Nim, compiled with -O3 and ORC memory management. They implement a custom dequantization + matrix-multiply path (mulQ4Mat / mulQ8Mat) that is dynamically loaded by a patched main.cpp at runtime based on the detected model type. llama-cli.exe detects "q4_k_m" in filename β†’ loads libsmolkernels_q4.dll β†’ mulQ4Mat() llama-cli.exe detects "q8_0" in filename β†’ loads libsmolkernels_q8.dll β†’ mulQ8Mat() Full build instructions, kernel source code, and the main.cpp patch are in the Medium article linked above.


System requirements

  • NVIDIA GPU 4 GB+ VRAM (for GPU runs), or 8 GB+ RAM (for CPU runs)
  • CUDA Toolkit 12.0+
  • llama.cpp 0.3.x+
  • Windows (DLL-based kernel loader); Linux port straightforward with .so

Related project

This work is the research foundation for QBench CLI β€” a C++ command-line tool that automates LLM quantization workflows, model selection, and hardware compatibility checks using the same quantization and kernel techniques developed here.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for devankit/smollm3_q4

Finetuned
(135)
this model