Alan Joshua

Update README.md

db32083 verified 4 months ago

5.15 kB

	---
	license: mit
	base_model: unsloth/Qwen2-VL-2B-Instruct
	tags:
	- vision-language-model
	- chart-qa
	- qwen2-vl
	- lora
	- finetuned
	- unsloth
	datasets:
	- weijiezz/chartqa_split_test
	language:
	- en
	pipeline_tag: image-text-to-text
	---

	# 📊 alan-vlm — ChartQA Vision Language Model

	A finetuned version of Qwen2-VL 2B Instruct trained to answer natural language questions about charts and graphs.

	Finetuned on the [ChartQA dataset](https://huggingface.co/datasets/weijiezz/chartqa_split_test) using [Unsloth](https://github.com/unslothai/unsloth) on a Google Colab free T4 GPU.

	---

	## 🧠 Model Details

	\| \| \|
	\|---\|---\|
	\| Base Model \| Qwen2-VL-2B-Instruct \|
	\| Finetuning Method \| LoRA (r=8, alpha=8) \|
	\| Training Data \| 2,000 chart QA pairs \|
	\| Training Steps \| 500 \|
	\| Batch Size \| 8 (2 per device × 4 gradient accumulation) \|
	\| Trainable Parameters \| 9,232,384 (0.42% of total) \|
	\| Precision \| fp16 \|
	\| Hardware \| Google Colab T4 (15GB VRAM) \|

	---

	## 🚀 Quick Start

	```python
	from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
	from PIL import Image
	import torch

	# Load model
	model = Qwen2VLForConditionalGeneration.from_pretrained(
	"alanjoshua2005/alan-vlm",
	torch_dtype=torch.float16,
	device_map="auto",
	)
	processor = AutoProcessor.from_pretrained("alanjoshua2005/alan-vlm")

	# Run inference
	def ask(image_path, question):
	image = Image.open(image_path).convert("RGB")

	messages = [{"role": "user", "content": [
	{"type": "image"},
	{"type": "text", "text": question},
	]}]

	text_prompt = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=False,
	)

	inputs = processor(
	text=text_prompt,
	images=image,
	return_tensors="pt"
	)
	inputs = {k: v.to("cuda") for k, v in inputs.items()}

	with torch.no_grad():
	output = model.generate(**inputs, max_new_tokens=64)

	input_len = inputs["input_ids"].shape[1]
	return processor.decode(output[0][input_len:], skip_special_tokens=True)

	# Example
	answer = ask("chart.png", "What is the value of the highest bar?")
	print(answer)
	```

	---

	## 🎛️ Gradio Demo

	```python
	import gradio as gr
	from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
	from PIL import Image
	import torch

	model = Qwen2VLForConditionalGeneration.from_pretrained(
	"alanjoshua2005/alan-vlm",
	torch_dtype=torch.float16,
	device_map="auto",
	)
	processor = AutoProcessor.from_pretrained("alanjoshua2005/alan-vlm")

	def answer_chart_question(image, question):
	if image is None or not question.strip():
	return "Please provide both an image and a question."
	image = image.convert("RGB")
	messages = [{"role": "user", "content": [
	{"type": "image"},
	{"type": "text", "text": question},
	]}]
	text_prompt = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
	inputs = processor(text=text_prompt, images=image, return_tensors="pt")
	inputs = {k: v.to("cuda") for k, v in inputs.items()}
	with torch.no_grad():
	output = model.generate(**inputs, max_new_tokens=64)
	input_len = inputs["input_ids"].shape[1]
	return processor.decode(output[0][input_len:], skip_special_tokens=True)

	gr.Interface(
	fn=answer_chart_question,
	inputs=[gr.Image(type="pil"), gr.Textbox(label="Question")],
	outputs=gr.Textbox(label="Answer"),
	title="📊 ChartQA - alan-vlm"
	).launch()
	```

	---

	## 📦 Dataset

	Trained on [weijiezz/chartqa_split_test](https://huggingface.co/datasets/weijiezz/chartqa_split_test) — a 2,000 row dataset of chart images paired with questions and answers. Contains two types of questions:

	- `human_test` — questions written by human annotators
	- `augmented_test` — questions generated via data augmentation

	---

	## 🏋️ Training Details

	Training was done using Unsloth for optimized LoRA finetuning:

	```python
	from unsloth import FastVisionModel

	model, tokenizer = FastVisionModel.from_pretrained(
	"unsloth/Qwen2-VL-2B-Instruct",
	load_in_4bit=True,
	)

	model = FastVisionModel.get_peft_model(
	model,
	finetune_vision_layers=True,
	finetune_language_layers=True,
	finetune_attention_modules=True,
	finetune_mlp_modules=True,
	r=8,
	lora_alpha=8,
	lora_dropout=0,
	bias="none",
	use_gradient_checkpointing="unsloth",
	target_modules=["q_proj", "v_proj", "k_proj", "o_proj",
	"gate_proj", "up_proj", "down_proj"],
	)
	```

	---

	## ⚠️ Limitations

	- Trained on only 2,000 samples — a learning/experimental project
	- May struggle with complex multi-series charts or heavily annotated graphs
	- Not evaluated on the full ChartQA benchmark yet
	- Best suited for simple bar, pie, and line chart questions

	---

	## 🙏 Acknowledgements

	- [Unsloth](https://github.com/unslothai/unsloth) for making VLM finetuning feasible on free Colab GPUs
	- [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) by Alibaba Cloud
	- [ChartQA dataset](https://huggingface.co/datasets/weijiezz/chartqa_split_test) by weijiezz