SmolVLM2-2.2B-chartqa-lora

I'm working through the Hugging Face Smol Fine-Tuning Language Models course. This is the artifact from Unit 3: a LoRA adapter that takes SmolVLM2-2.2B-Instruct and trains it to answer chart questions in the short, telegraphic style ChartQA expects, instead of the base model's chatty paragraphs. It's a modality breadth piece alongside the main text-LM portfolio arc in U1 and U2.

What the adapter learns

One thing, sharply: the chart-answer style. Given a chart and a question, respond with a single word, number, or short phrase. The base model knows it's looking at a chart and can describe one fluently, but its default register is a paragraph. SFT pulls that toward "Blue" or "2018" or "42".

The vision encoder is frozen. The adapter only touches the language model's attention and MLP projections (q/k/v/o/gate/up/down_proj), so the model can change how it talks about charts, not how it sees them. That bound the work to the format shift on purpose, and is the honest reason factual chart-reading accuracy is limited.

LoRA touches ~0.80% of parameters (18M of 2.26B). The rest is frozen.

Before / after on a held-out chart

The question was "What does the blue line represent?" (a multi-line chart from ChartQA val). Reference answer: "Not too much/not at all".

Base model:

The blue line represents the share of people who say the U.S. takes the interests of their...

The base model treats it as a description task. It writes the start of a paragraph and runs to the generation budget.

Fine-tuned adapter:

Bush

Still wrong. But it's a single word and it stops. Same model, same prompt, same image. The LoRA is doing the format shift.

All 3 demo prompts are in generations_before.json and generations_after.json in this repo, with the reference labels.

How to use

import torch
from peft import PeftModel
from transformers import AutoModelForImageTextToText, AutoProcessor
from transformers.image_utils import load_image

base = AutoModelForImageTextToText.from_pretrained(
    "HuggingFaceTB/SmolVLM2-2.2B-Instruct", dtype=torch.bfloat16,
)
model = PeftModel.from_pretrained(base, "tuggspeedman-ai/SmolVLM2-2.2B-chartqa-lora")
processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM2-2.2B-Instruct")

image = load_image("https://path/to/chart.png")
messages = [{"role": "user", "content": [
    {"type": "image"},
    {"type": "text", "text": "What is the highest value in the chart?"},
]}]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(images=[image], text=prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=20, do_sample=False)
print(processor.batch_decode(out[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0])

AutoProcessor.from_pretrained on a SmolVLM model needs both torchvision and num2words installed. Neither is a transformers install dep, so if you're not already in a VLM project you'll need to add them.

Training data

HuggingFaceM4/ChartQA, train[:10%] for training and val[:10%] for eval. That's 2,830 train pairs and 192 eval pairs of (chart image, question, short answer).

Subsample. train[:10%], sized to fit a one-epoch cloud run in a sane budget. I'd train on more next time.
System prompt. Frames the model as a chart-analysis specialist that answers in a single word, number, or short phrase. Matches the course's framing.
Dataset format. prompt / completion, not messages. TRL forbids assistant_only_loss for VLMs, so the messages path computes loss over the whole sequence, image tokens included. The prompt/completion path builds a completion mask and trains only on the answer tokens. The smoke caught this: messages-format initial loss was 10.33 (roughly uniform over the 49k vocab); prompt/completion initial loss was 0.78.
Max sequence length. None. SmolVLM expands one image into ~81 image tokens (or up to ~1,400 with splitting), and any truncation would chop them. In TRL 1.2, the CLI --max_length -1 does not disable truncation; only max_length=None in Python does.

Hyperparameters

LoRA on the language model only, vision encoder and connector frozen:


LoRA rank `r`	16
LoRA `alpha`	32
LoRA dropout	0.05
Target modules	`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` (scoped to `model.text_model.*` only; vision encoder reuses the same suffixes, so a plain list would silently LoRA-adapt SigLIP too)
Trainable params	18,087,936 (0.80% of 2.26B)
Effective batch	8 (per_device=1 × grad_accum=8)
Optimizer	AdamW
Learning rate	1e-4
Schedule	Cosine, 3% warmup
Epochs	1
Mixed precision	bf16
Loss	Completion-only (prompt/completion dataset format)
Seed	42

Training used TRL's SFTTrainer from a custom script. Source: github.com/tuggspeedman-ai/hf-smol-course, see notebooks/unit3/exercise_vlm_sft.py.

Results

Metric	Start	End
train loss	0.745	0.219
train mean_token_accuracy	~0.85	~0.92
eval_loss	—	0.692
eval mean_token_accuracy	—	0.817

Train-vs-eval gap (0.22 vs 0.69) is real. 1 epoch on 2,830 examples with rank-16 LoRA gets the answer style; it does not get strong factual chart reading.

Hardware


GPU	1× NVIDIA A10G (HF Jobs flavor `a10g-large`)
Wall time	78.9 min for 354 optimizer steps (~13.4 s/step)
Cost	~$2 of A10G time

Honest limits

Scope was intentional. The text portfolio arc in U1 and U2 already covered the "published, preference-aligned small LM" goal I set for the course, so I treated U3 as a learning rep on the VLM stack rather than a serious chart-QA project. ~$2 of compute and ~80 minutes of wall time, enough to run the pipeline end-to-end, get a real behavioral change, and catch the recipe gotchas worth writing down. A half-day a100-large run at higher rank with the SigLIP encoder unfrozen would land a meaningfully better chart reader. I chose not to. The text arc is the portfolio; this is the breadth.

Within that scope: vision encoder frozen, one epoch on 2,830 examples. The model learned to answer in chart-QA style. It did not learn to read charts much better than the base model already could. On the held-out demos in this repo, 1 of 3 is factually right. The remaining gap would need either a longer run, more data, or unfreezing the SigLIP encoder (or a connector-tuning phase) so the visual representation itself can adapt.

Inherits the base model's biases and knowledge cutoff. Not safety-tuned for production use.

Model tree for tuggspeedman-ai/SmolVLM2-2.2B-chartqa-lora

Base model

HuggingFaceTB/SmolLM2-1.7B

Quantized

HuggingFaceTB/SmolLM2-1.7B-Instruct

Quantized

HuggingFaceTB/SmolVLM-Instruct

Finetuned

HuggingFaceTB/SmolVLM2-2.2B-Instruct

Adapter

(25)

this model

tuggspeedman-ai
/

SmolVLM2-2.2B-chartqa-lora