Instructions to use tuggspeedman-ai/SmolVLM2-2.2B-chartqa-lora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use tuggspeedman-ai/SmolVLM2-2.2B-chartqa-lora with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolVLM2-2.2B-Instruct") model = PeftModel.from_pretrained(base_model, "tuggspeedman-ai/SmolVLM2-2.2B-chartqa-lora") - Notebooks
- Google Colab
- Kaggle
SmolVLM2-2.2B-chartqa-lora
I'm working through the Hugging Face Smol Fine-Tuning Language Models course. This is the artifact from Unit 3: a LoRA adapter that takes SmolVLM2-2.2B-Instruct and trains it to answer chart questions in the short, telegraphic style ChartQA expects, instead of the base model's chatty paragraphs. It's a modality breadth piece alongside the main text-LM portfolio arc in U1 and U2.
What the adapter learns
One thing, sharply: the chart-answer style. Given a chart and a question, respond with a single word, number, or short phrase. The base model knows it's looking at a chart and can describe one fluently, but its default register is a paragraph. SFT pulls that toward "Blue" or "2018" or "42".
The vision encoder is frozen. The adapter only touches the language model's attention and MLP projections (q/k/v/o/gate/up/down_proj), so the model can change how it talks about charts, not how it sees them. That bound the work to the format shift on purpose, and is the honest reason factual chart-reading accuracy is limited.
LoRA touches ~0.80% of parameters (18M of 2.26B). The rest is frozen.
Before / after on a held-out chart
The question was "What does the blue line represent?" (a multi-line chart from ChartQA val). Reference answer: "Not too much/not at all".
Base model:
The blue line represents the share of people who say the U.S. takes the interests of their...
The base model treats it as a description task. It writes the start of a paragraph and runs to the generation budget.
Fine-tuned adapter:
Bush
Still wrong. But it's a single word and it stops. Same model, same prompt, same image. The LoRA is doing the format shift.
All 3 demo prompts are in generations_before.json and generations_after.json in this repo, with the reference labels.
How to use
import torch
from peft import PeftModel
from transformers import AutoModelForImageTextToText, AutoProcessor
from transformers.image_utils import load_image
base = AutoModelForImageTextToText.from_pretrained(
"HuggingFaceTB/SmolVLM2-2.2B-Instruct", dtype=torch.bfloat16,
)
model = PeftModel.from_pretrained(base, "tuggspeedman-ai/SmolVLM2-2.2B-chartqa-lora")
processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM2-2.2B-Instruct")
image = load_image("https://path/to/chart.png")
messages = [{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": "What is the highest value in the chart?"},
]}]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(images=[image], text=prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=20, do_sample=False)
print(processor.batch_decode(out[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0])
AutoProcessor.from_pretrained on a SmolVLM model needs both torchvision and num2words installed. Neither is a transformers install dep, so if you're not already in a VLM project you'll need to add them.
Training data
HuggingFaceM4/ChartQA, train[:10%] for training and val[:10%] for eval. That's 2,830 train pairs and 192 eval pairs of (chart image, question, short answer).
- Subsample.
train[:10%], sized to fit a one-epoch cloud run in a sane budget. I'd train on more next time. - System prompt. Frames the model as a chart-analysis specialist that answers in a single word, number, or short phrase. Matches the course's framing.
- Dataset format.
prompt/completion, notmessages. TRL forbidsassistant_only_lossfor VLMs, so themessagespath computes loss over the whole sequence, image tokens included. The prompt/completion path builds a completion mask and trains only on the answer tokens. The smoke caught this: messages-format initial loss was 10.33 (roughly uniform over the 49k vocab); prompt/completion initial loss was 0.78. - Max sequence length.
None. SmolVLM expands one image into ~81 image tokens (or up to ~1,400 with splitting), and any truncation would chop them. In TRL 1.2, the CLI--max_length -1does not disable truncation; onlymax_length=Nonein Python does.
Hyperparameters
LoRA on the language model only, vision encoder and connector frozen:
LoRA rank r |
16 |
LoRA alpha |
32 |
| LoRA dropout | 0.05 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj (scoped to model.text_model.* only; vision encoder reuses the same suffixes, so a plain list would silently LoRA-adapt SigLIP too) |
| Trainable params | 18,087,936 (0.80% of 2.26B) |
| Effective batch | 8 (per_device=1 × grad_accum=8) |
| Optimizer | AdamW |
| Learning rate | 1e-4 |
| Schedule | Cosine, 3% warmup |
| Epochs | 1 |
| Mixed precision | bf16 |
| Loss | Completion-only (prompt/completion dataset format) |
| Seed | 42 |
Training used TRL's SFTTrainer from a custom script. Source: github.com/tuggspeedman-ai/hf-smol-course, see notebooks/unit3/exercise_vlm_sft.py.
Results
| Metric | Start | End |
|---|---|---|
| train loss | 0.745 | 0.219 |
| train mean_token_accuracy | ~0.85 | ~0.92 |
| eval_loss | — | 0.692 |
| eval mean_token_accuracy | — | 0.817 |
Train-vs-eval gap (0.22 vs 0.69) is real. 1 epoch on 2,830 examples with rank-16 LoRA gets the answer style; it does not get strong factual chart reading.
Hardware
| GPU | 1× NVIDIA A10G (HF Jobs flavor a10g-large) |
| Wall time | 78.9 min for 354 optimizer steps (~13.4 s/step) |
| Cost | ~$2 of A10G time |
Honest limits
Scope was intentional. The text portfolio arc in U1 and U2 already covered the "published, preference-aligned small LM" goal I set for the course, so I treated U3 as a learning rep on the VLM stack rather than a serious chart-QA project. ~$2 of compute and ~80 minutes of wall time, enough to run the pipeline end-to-end, get a real behavioral change, and catch the recipe gotchas worth writing down. A half-day a100-large run at higher rank with the SigLIP encoder unfrozen would land a meaningfully better chart reader. I chose not to. The text arc is the portfolio; this is the breadth.
Within that scope: vision encoder frozen, one epoch on 2,830 examples. The model learned to answer in chart-QA style. It did not learn to read charts much better than the base model already could. On the held-out demos in this repo, 1 of 3 is factually right. The remaining gap would need either a longer run, more data, or unfreezing the SigLIP encoder (or a connector-tuning phase) so the visual representation itself can adapt.
Inherits the base model's biases and knowledge cutoff. Not safety-tuned for production use.
Links
- Code: github.com/tuggspeedman-ai/hf-smol-course, see
notebooks/unit3/exercise_vlm_sft.py - U1 (SFT, text): tuggspeedman-ai/SmolLM3-3B-summarize-sft-lora
- U2 (DPO, text): tuggspeedman-ai/SmolLM3-3B-summarize-dpo-lora
- Course: HF smol fine-tuning course
- Downloads last month
- 5
Model tree for tuggspeedman-ai/SmolVLM2-2.2B-chartqa-lora
Base model
HuggingFaceTB/SmolLM2-1.7B