SmolLM3-3B GLQ 3.5bpw

SmolLM3-3B quantized using GLQ (Golay-Leech Quantization) with mixed-precision (2-4bpw per layer, averaging 3.5bpw).

Note on effective bpw: This model was quantized with power-of-2 FHT padding. Effective storage is ~4.7 bpw due to dimensional padding (hidden_size=2048, which is already power-of-2, but intermediate_size=11008 pads to 16384). Quality benchmarks below reflect this effective rate.

Newer model available: SmolLM3-3B-GLQ-6bpw uses block-diagonal FHT with zero padding waste and honest 6.0 bpw labeling at 99.6% of bf16 quality.

Quality

  • WikiText-2 perplexity: 7.65 (bf16: 7.04)

Usage

pip install glq
import glq.hf_integration
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "xv0y5ncu/SmolLM3-3B-GLQ-3.5bpw",
    device_map="cuda",
    dtype="float16",
)
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B")

inputs = tokenizer("The capital of France is", return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Requirements

  • transformers >= 5.0 (4.x has a weight loading bug for small GLQ models)
  • torch >= 2.0
  • glq >= 0.2.8 (pip install glq)

License

Apache 2.0

Downloads last month
221
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for xv0y5ncu/SmolLM3-3B-GLQ-3.5bpw

Quantized
(67)
this model