QLoRA: Efficient Finetuning of Quantized LLMs
Paper
•
2305.14314
•
Published
•
59
SmolLM3 QLoRA is a lightweight, 3B parameter open-source language model based on SmolLM3-3B, fine-tuned using QLoRA on the OpenOrca Slim dataset (500K examples). It is optimized for retrieval-augmented generation (RAG) use cases and delivers competitive benchmark scores against much larger models like LLaMA-2 7B.
SmolLM3 QLoRA is intended to serve as a fast and compact assistant model for:
The model has been evaluated using lm-evaluation-harness on a 500-sample subset of key academic benchmarks.
| Task | Accuracy | Normalized Accuracy | LLaMA-2 7B |
|---|---|---|---|
| HellaSwag | 51.2% | 66.4% | 56.7% / 73.2% |
| ARC-Challenge | 49.4% | 52.2% | 53.7% / 56.9% |
| BoolQ | 81.0% | — | 83.1% |
👉 Model achieves ~90–95% of LLaMA-2 7B performance at less than half the size.
SmolLM3-3BOpen-Orca/SlimOrca (500K samples)from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "soupstick/smollm3-qlora-ft"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype="auto"
)
inputs = tokenizer("Explain retrieval-augmented generation.", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=300)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Base model
HuggingFaceTB/SmolLM3-3B-Base