Update README.md

f4c58f9 verified 19 days ago

4.08 kB

	---
	language:
	- en
	license: apache-2.0
	base_model: Qwen/Qwen2-VL-2B-Instruct
	datasets:
	- HuggingFaceM4/ChartQA
	tags:
	- vision-language
	- chart-qa
	- qwen2-vl
	- unsloth
	- multimodal
	---

	# The Orange Problem — Qwen2-VL-2B ChartQA

	Fine-tuned version of Qwen/Qwen2-VL-2B-Instruct on ChartQA using QLoRA via Unsloth.
	The model answers questions about charts with a number or short phrase.
	This is a fully merged model — no adapter loading required.

	Training: Unsloth + TRL SFTTrainer on Kaggle T4 x 2, 155 min.

	---

	## Results (300 test samples)

	\| Metric \| Base Model \| Fine-Tuned \| Delta \|
	\|---\|---\|---\|---\|
	\| Relaxed Accuracy (primary) \| 54.33% \| 59.67% \| +5.33% \|
	\| Exact Match \| 50.00% \| 54.67% \| +4.67% \|
	\| ROUGE-L \| 54.74% \| 59.17% \| +4.42% \|

	Relaxed Accuracy is the primary ChartQA metric: correct if exact string match OR numeric within 5%.

	---

	## Quickstart

	pip install transformers qwen-vl-utils torch

	```python
	from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
	from qwen_vl_utils import process_vision_info
	from PIL import Image
	import torch

	model = Qwen2VLForConditionalGeneration.from_pretrained(
	"Sriram1806/NLP_Orange_Problem-How_I_Met_You",
	torch_dtype=torch.float16,
	device_map="auto",
	)
	processor = AutoProcessor.from_pretrained("Sriram1806/NLP_Orange_Problem-How_I_Met_You")

	def ask_chart(image, question):
	msgs = [
	{"role": "system", "content": [{"type": "text", "text": (
	"You are a chart analysis assistant. "
	"When given a chart and a question, briefly identify the relevant data, "
	"then output your final answer after '<ANSWER>:'. "
	"The answer must be a number, percentage, or short phrase only."
	)}]},
	{"role": "user", "content": [
	{"type": "image", "image": image},
	{"type": "text", "text": (
	"Look at the chart carefully.\n"
	f"Question: {question}\n\n"
	"Briefly identify the relevant data point, then give your answer as:\n"
	"<ANSWER>: [value]"
	)},
	]},
	]
	text = processor.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
	image_inputs, _ = process_vision_info(msgs)
	inputs = processor(text=[text], images=image_inputs, return_tensors="pt").to(model.device)
	with torch.inference_mode():
	out = model.generate(**inputs, max_new_tokens=96, do_sample=False,
	temperature=None, top_p=None)
	gen = out[0][inputs["input_ids"].shape[-1]:]
	raw = processor.decode(gen, skip_special_tokens=True).strip()
	return raw.split("<ANSWER>:")[-1].strip() if "<ANSWER>:" in raw else raw

	image = Image.open("your_chart.png")
	print(ask_chart(image, "What is the value in 2020?"))
	```

	---

	## Training Details

	\| Parameter \| Value \|
	\|---\|---\|
	\| Base model \| Qwen/Qwen2-VL-2B-Instruct \|
	\| Dataset \| HuggingFaceM4/ChartQA \|
	\| Training samples \| 6,000 \|
	\| Epochs \| 2 \|
	\| Steps completed \| 750 \|
	\| Train time \| 155.1 min \|
	\| LoRA rank \| 32 \|
	\| LoRA alpha \| 32 \|
	\| LoRA dropout \| 0.0 \|
	\| LoRA targets \| Language attention + MLP layers \|
	\| Learning rate \| 2e-4 \|
	\| LR schedule \| Cosine \|
	\| Effective batch size \| 16 (2 per device x 8 grad accum) \|
	\| Precision \| float16 \|
	\| Max sequence length \| 640 tokens \|
	\| Optimizer \| AdamW 8-bit \|
	\| Warmup steps \| 50 \|

	### Key Decisions

	fp16 instead of 4-bit: Unsloth's compiled VisionMlp_forward triggers a bitsandbytes
	Linear4bit recursion error when the ViT MLP blocks are quantised. fp16 avoids this.
	Two T4s (31 GB total) hold the 2B model comfortably.

	finetune_vision_layers=False: ViT MLP blocks cannot be safely LoRA-wrapped alongside
	Unsloth gradient checkpointing when quantised. Language layers are fine-tuned instead.

	CoT prompt with ANSWER: marker: The model emits a brief reasoning step before a
	structured answer tag, improving extraction reliability at inference time.

	normalise_answer before EM scoring: Strips commas and % before comparison so
	"44%" matches "44" and "1,000" matches "1000".