--- base_model: HuggingFaceTB/SmolLM3-3B language: - en library_name: gguf tags: - quantization - llama.cpp - gguf - smollm - nim-kernels - 4-bit - 8-bit - consumer-hardware --- # SmolLM3-3B — 4-bit GGUF with Custom Nim Kernels Quantized versions of [SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) running on consumer hardware under 6 GB VRAM, with custom matrix kernels written in Nim for hot-path matrix-multiply operations. Full write-up: [Run LLMs on a 6 GB Laptop GPU & CPU — Medium](https://medium.com/@devbyankit/run-llms-on-a-6-gb-laptop-gpu-cpu-smollm-3-quantization-nim-kernels-6c9cbb233930) --- ## What's in this repo | File | Format | Size approx. | Use case | |------|--------|--------------|----------| | smolLM3-q4_k_m.gguf | Q4_K_M | ~1.9 GB | Best speed, GPU + CPU | | smolLM3-q8_0.gguf | Q8_0 | ~3.3 GB | Near FP16 quality, fits 6 GB | Memory footprint is reduced ~75% from the FP16 baseline. --- ## Performance (512-token prompt, averaged over 3 runs) | Configuration | Prompt Processing | Text Generation | VRAM / RAM | |--------------|-------------------|-----------------|------------| | 4-bit GPU | 11.25 tok/s | **14.60 tok/s** | ~5.5 GB VRAM | | 8-bit GPU | 2.57 tok/s | 12.65 tok/s | ~5.8 GB VRAM | | 4-bit CPU | 2.38 tok/s | 15.37 tok/s | ~3–4 GB RAM | | 8-bit CPU | 7.79 tok/s | 11.83 tok/s | ~3–4 GB RAM | Hardware used: NVIDIA GPU with 6 GB VRAM, Windows, CUDA 12.4. --- ## How to run ```bash # 4-bit GPU llama-cli.exe -m smolLM3-q4_k_m.gguf -ngl 36 -n 256 --temp 0.7 \ --repeat-penalty 1.1 --color -sys "you are a helpful assistant" # 8-bit CPU llama-cli.exe -m smolLM3-q8_0.gguf -ngl 0 -n 256 --temp 0.7 ``` Requires [llama.cpp](https://github.com/ggerganov/llama.cpp) 0.3.14+ built with CUDA support for GPU runs. --- ## Custom Nim Kernels The repo ships two companion DLLs — `libsmolkernels_q4.dll` and `libsmolkernels_q8.dll` — written in Nim, compiled with `-O3` and ORC memory management. They implement a custom dequantization + matrix-multiply path (`mulQ4Mat` / `mulQ8Mat`) that is dynamically loaded by a patched `main.cpp` at runtime based on the detected model type. llama-cli.exe detects "q4_k_m" in filename → loads libsmolkernels_q4.dll → mulQ4Mat() llama-cli.exe detects "q8_0" in filename → loads libsmolkernels_q8.dll → mulQ8Mat() Full build instructions, kernel source code, and the main.cpp patch are in the Medium article linked above. --- ## System requirements - NVIDIA GPU 4 GB+ VRAM (for GPU runs), or 8 GB+ RAM (for CPU runs) - CUDA Toolkit 12.0+ - llama.cpp 0.3.x+ - Windows (DLL-based kernel loader); Linux port straightforward with .so --- ## Related project This work is the research foundation for [QBench CLI](https://github.com/AnkitTsj/qbench) — a C++ command-line tool that automates LLM quantization workflows, model selection, and hardware compatibility checks using the same quantization and kernel techniques developed here.