Userb1az
/

Qwen3-VL-8B-Instruct-ChartQA-sft

Model card Files Files and versions

Userb1az commited on Dec 10, 2025

Commit

60af836

·

verified ·

1 Parent(s): e4e5c77

Upload README.md

Files changed (1) hide show

README.md +54 -0

README.md ADDED Viewed

	@@ -0,0 +1,54 @@

+# Qwen3‑VL‑8B ChartQA (LoRA)
+## Overview
+This repository contains a **Qwen3‑VL‑8B‑Instruct** vision‑language model fine‑tuned to answer questions about charts and plots, focusing on concise numerical or short textual answers.
+Fine‑tuning was performed via **LoRA** using the human‑annotated subset of the [HuggingFaceM4/ChartQA](https://huggingface.co/datasets/HuggingFaceM4/ChartQA) dataset (train split, `human_or_machine = human`).
+Typical behavior:
+- Input: an image of a bar chart and the question `What is the value of the blue bar in 2018?`
+  Output: `24`
+- Input: an image of a line chart and the question `In which year does the orange line reach its maximum?`
+  Output: `2015`
+- Input: an image of a pie chart and the question `What percentage corresponds to Sales?`
+  Output: `38%`
+The LoRA adapter was trained with [LLaMA‑Factory](https://github.com/hiyouga/LLaMA-Factory) on top of `Qwen/Qwen3-VL-8B-Instruct` and can be loaded either as a standard Transformers adapter or merged into the base weights.
+## Base model
+- **Base**: `Qwen/Qwen3-VL-8B-Instruct`
+- **Architecture**: multimodal vision‑language model, ~8.8B parameters
+- **Intended use**: instruction‑following and visual question answering (images + text)
+## Training details
+- **Framework**: LLaMA‑Factory (Supervised Fine‑Tuning with LoRA)
+- **Finetuning type**: LoRA on transformer linear layers, vision tower and projector frozen
+- **Dataset**: `HuggingFaceM4/ChartQA` (train split, only human‑authored QA pairs)
+- **Task**: single‑turn chart question answering (chart image + question → short answer)
+- **Input format**: Qwen3‑VL chat template with `<|im_start|>user` / `<|im_start|>assistant` and `<|vision_start|>…<|vision_end|>` tokens; answers taken as the first label (`label[0]`) for each sample
+- **Number of train examples**: 7 398 human‑annotated samples
+- **Max sequence length**: 2048 tokens
+- **Epochs**: 3
+- **Batch / grad accumulation**: effective batch size 64 (multi‑GPU + gradient accumulation)
+- **Learning rate**: 5e‑5 (AdamW with scheduler)
+- **Precision**: mixed precision (FP16 / bfloat16) with gradient checkpointing
+- **Trainable parameters**: ~21.8M LoRA params (≈0.25 % of 8.79B total)
+Final train loss was around **0.32** after 3 epochs (~10.6M seen tokens), indicating a strong fit on ChartQA while updating only a small LoRA head.
+For best results:
+- Provide a single chart image and a clear question in one turn.
+- Use `temperature=0.0–0.2` and `max_new_tokens` around 16–64.
+- Expect short answers (numbers, years, category names) rather than long explanations.
+## Limitations
+- The model is specialized for **chart question answering** and is not a general‑purpose assistant.
+- It may struggle with non‑chart images, highly stylized plots, or layouts very different from those in ChartQA.
+- Numerical and logical reasoning quality is bounded by the underlying Qwen3‑VL‑8B model; answers used in analytical or reporting workflows should be manually verified.