| --- |
| base_model: HuggingFaceTB/SmolLM3-3B |
| language: |
| - en |
| library_name: gguf |
| tags: |
| - quantization |
| - llama.cpp |
| - gguf |
| - smollm |
| - nim-kernels |
| - 4-bit |
| - 8-bit |
| - consumer-hardware |
| --- |
| |
| # SmolLM3-3B β 4-bit GGUF with Custom Nim Kernels |
|
|
| Quantized versions of [SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) |
| running on consumer hardware under 6 GB VRAM, with custom matrix kernels written in Nim |
| for hot-path matrix-multiply operations. |
|
|
| Full write-up: [Run LLMs on a 6 GB Laptop GPU & CPU β Medium](https://medium.com/@devbyankit/run-llms-on-a-6-gb-laptop-gpu-cpu-smollm-3-quantization-nim-kernels-6c9cbb233930) |
|
|
| --- |
|
|
| ## What's in this repo |
|
|
| | File | Format | Size approx. | Use case | |
| |------|--------|--------------|----------| |
| | smolLM3-q4_k_m.gguf | Q4_K_M | ~1.9 GB | Best speed, GPU + CPU | |
| | smolLM3-q8_0.gguf | Q8_0 | ~3.3 GB | Near FP16 quality, fits 6 GB | |
|
|
| Memory footprint is reduced ~75% from the FP16 baseline. |
|
|
| --- |
|
|
| ## Performance (512-token prompt, averaged over 3 runs) |
|
|
| | Configuration | Prompt Processing | Text Generation | VRAM / RAM | |
| |--------------|-------------------|-----------------|------------| |
| | 4-bit GPU | 11.25 tok/s | **14.60 tok/s** | ~5.5 GB VRAM | |
| | 8-bit GPU | 2.57 tok/s | 12.65 tok/s | ~5.8 GB VRAM | |
| | 4-bit CPU | 2.38 tok/s | 15.37 tok/s | ~3β4 GB RAM | |
| | 8-bit CPU | 7.79 tok/s | 11.83 tok/s | ~3β4 GB RAM | |
|
|
| Hardware used: NVIDIA GPU with 6 GB VRAM, Windows, CUDA 12.4. |
|
|
| --- |
|
|
| ## How to run |
|
|
| ```bash |
| # 4-bit GPU |
| llama-cli.exe -m smolLM3-q4_k_m.gguf -ngl 36 -n 256 --temp 0.7 \ |
| --repeat-penalty 1.1 --color -sys "you are a helpful assistant" |
| |
| # 8-bit CPU |
| llama-cli.exe -m smolLM3-q8_0.gguf -ngl 0 -n 256 --temp 0.7 |
| ``` |
|
|
| Requires [llama.cpp](https://github.com/ggerganov/llama.cpp) 0.3.14+ built with CUDA |
| support for GPU runs. |
|
|
| --- |
|
|
| ## Custom Nim Kernels |
|
|
| The repo ships two companion DLLs β `libsmolkernels_q4.dll` and `libsmolkernels_q8.dll` |
| β written in Nim, compiled with `-O3` and ORC memory management. They implement a |
| custom dequantization + matrix-multiply path (`mulQ4Mat` / `mulQ8Mat`) that is |
| dynamically loaded by a patched `main.cpp` at runtime based on the detected model type. |
| llama-cli.exe detects "q4_k_m" in filename β loads libsmolkernels_q4.dll β mulQ4Mat() |
| llama-cli.exe detects "q8_0" in filename β loads libsmolkernels_q8.dll β mulQ8Mat() |
| Full build instructions, kernel source code, and the main.cpp patch are in the |
| Medium article linked above. |
| |
| --- |
| |
| ## System requirements |
| |
| - NVIDIA GPU 4 GB+ VRAM (for GPU runs), or 8 GB+ RAM (for CPU runs) |
| - CUDA Toolkit 12.0+ |
| - llama.cpp 0.3.x+ |
| - Windows (DLL-based kernel loader); Linux port straightforward with .so |
| |
| --- |
| |
| ## Related project |
| |
| This work is the research foundation for |
| [QBench CLI](https://github.com/AnkitTsj/qbench) β a C++ command-line tool that |
| automates LLM quantization workflows, model selection, and hardware compatibility |
| checks using the same quantization and kernel techniques developed here. |