SmolLM-Smashed: Tiny Giants, Optimized for Speed
This article is a guest contribution by Parag Ekbote, an active member of the Pruna community. The views and opinions expressed are those of the author and do not necessarily reflect those of Pruna AI.
Socials: LinkedIn · Hugging Face · GitHub
Introduction
The SmolLM family, known for its efficiency-first design (ranging from 135M to 3B parameters), gets an extra performance boost with SmolLM-Smashed. We demonstrate how small, efficient models can achieve impressive performance with careful optimization. Pruna is a model optimization library designed for developers who want to deliver faster, more efficient AI models with minimal effort. It offers a complete suite of compression techniques which includes caching, quantization, pruning, distillation, and compilation to enhance performance of models.
Using Pruna, we can optimize the SmolLM family to make them smaller, faster and deployable on modest hardware without sacrificing much accuracy. You can try an interactive demo of this project on Replicate. The very first run may take a while to boot up, but once warm, responses will be much faster.
Why Optimize?
Model efficiency matters. The goal of this project was to explore how far we can push compact models like SmolLM-135M and SmolLM3-3B using quantization and torch compile.
Setup and Environment
Tech stack: Python 3.11, PyTorch 2.7.0, Transformers ≥ 4.53.0, and Pruna 0.2.8.
You can view the complete model smash script and eval script as well. A colab notebook is provided for convenient usage and easy access.
Optimization Process
The optimization combined two key techniques:
- Quantization: Using Pruna’s HQQ quantizer to compress weights to 4-bit precision while maintainging numerical stability through bfloat16 computation.
- Compilation: Leveraging
torch.compilemodes (max-autotune,reduce-overheadanddefault) for kernel fusion and execution graph optimization. Note thattorch.compiletypically requires a short warm-up phase during the first few inference runs.
Model Smash Configurations
A brief overview of the model configuration choices and the impact they have on performance is provided below. These results come directly from the evaluation script, which reports metrics such as latency, throughput, and memory efficiency. Each configuration has been carefully tuned to match the model’s hidden dimension, ensuring stability while maximizing speed and quality. A breakdown of the evaluation results follows in the next section.
You can view the complete model collection on Hugging Face.
| Model | Quantizer | Bits | Group Size | Compile Mode | Fullgraph |
|---|---|---|---|---|---|
| SmolLM2-360M | HQQ | 4 | 64 | max-autotune | True |
| SmolLM2-1.7B | HQQ | 4 | 128 | default | True |
| SmolLM3-3B | HQQ | 4 | 128 | reduce-overhead | False |
For SmolLM2-360M, a 4-bit HQQ setup with a small group size (64) and max-autotune compilation helps extract maximum speed up without hurting accuracy.
For SmolLM2-1.7B, we use a slightly larger group size (128) with the default compile mode to reduce overhead and keep memory stable during inference.
SmolLM3-3B also uses 4-bit HQQ with group size 128 to maintain quality at scale, while reduce-overhead mode and disabling full-graph help avoid slowdown on larger models.
Evaluation
We evaluated several SmolLM model variants under identical optimization conditions to assess the impact of model scale on inference performance and resource utilization.
Our optimization strategy considers:
- Hidden dimension scaling: Tuned to preserve model stability across different sizes
- Quantization parameters: 4-bit precision carefully calibrated to minimize accuracy degradation.
- Compilation settings: Graph-level optimizations specifically tuned for inference patterns.
All models were quantized using Half-Quadratic Quantization (HQQ) at 4-bit precision and compiled with PyTorch's native compilation framework to maximize inference efficiency.
As seen in the diagram, the memory usage scales approximately linearly with parameter count, indicating efficient quantization without unexpected overhead. The 4-bit representation achieves approximately 75-80% memory reduction compared to FP16 baselines.
| Model Variant | Parameters | Memory Footprint | Compute (MACs) |
|---|---|---|---|
| SmolLM-135M | 81.4M | 311 MB | 72.3B |
| SmolLM-360M | 100.8M | 1.56 GB | 257.2B |
| SmolLM-1.7B | 262.8M | 3.15 GB | 671.1B |
torch.compile provides graph-level optimizations including:
- Kernel fusion: Reduces memory bandwidth requirements by combining operations
- Operator optimization: Specialized kernels for operations
- Fine- grained Compilation: graph-level optimizations specifically tuned for specific inference patterns.
It is to be noted that torch.compile may take some time due for an initial run due to JIT compilation, but the subsequent runs will be much faster. The evaluation demonstrates that HQQ at 4-bit precision provides a favorable balance between efficiency and quality. It also introduces minimal quality degradation while delivering substantial efficiency gains.
For applications where inference cost and latency are primary concerns, this trade-off proves favorable. These results establish a baseline for practical model deployment, demonstrating that modern techniques make language model inference accessible across diverse hardware environments.
Takeaways
- Pruna is powerful and versatile: It simplifies optimization without the need of multiple libraries and difficult configuration settings.
- Small models can go fast with less VRAM: Quantization and compilation can deliver real-world performance improvements with minimal loss in capability. We achieved 2-3x speedup through torch.compile with the combination of HQQ 4-bit quantization to reduce GPU memory usage by ~70% while maintaining model quality.
- Model-specific tuning matters: Matching group sizes and compile modes to architecture is key to stability.
- Deployment-ready: The smashed models can be loaded directly via
PrunaModelor run interactively on a wide range of services such as Replicate, Docker, AWS AMI, etc. You can view the complete list here.
Acknowledgments
Gratitude to the Pruna AI team for the Smash framework and insightful documentation, and to Hugging Face for providing the SmolLM models. Special thanks to David Berenstein, who offered guidance during evaluation and deployment testing.
