SmolLM3-3B GLQ 3.5bpw
SmolLM3-3B quantized using GLQ (Golay-Leech Quantization) with mixed-precision (2-4bpw per layer, averaging 3.5bpw).
Note on effective bpw: This model was quantized with power-of-2 FHT padding. Effective storage is ~4.7 bpw due to dimensional padding (hidden_size=2048, which is already power-of-2, but intermediate_size=11008 pads to 16384). Quality benchmarks below reflect this effective rate.
Newer model available: SmolLM3-3B-GLQ-6bpw uses block-diagonal FHT with zero padding waste and honest 6.0 bpw labeling at 99.6% of bf16 quality.
Quality
- WikiText-2 perplexity: 7.65 (bf16: 7.04)
Usage
pip install glq
import glq.hf_integration
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"xv0y5ncu/SmolLM3-3B-GLQ-3.5bpw",
device_map="cuda",
dtype="float16",
)
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B")
inputs = tokenizer("The capital of France is", return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Requirements
- transformers >= 5.0 (4.x has a weight loading bug for small GLQ models)
- torch >= 2.0
- glq >= 0.2.8 (
pip install glq)
License
Apache 2.0
- Downloads last month
- 221
Model tree for xv0y5ncu/SmolLM3-3B-GLQ-3.5bpw
Base model
HuggingFaceTB/SmolLM3-3B-Base Finetuned
HuggingFaceTB/SmolLM3-3B