Upload README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,54 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Qwen3‑VL‑8B ChartQA (LoRA)
|
| 2 |
+
|
| 3 |
+
## Overview
|
| 4 |
+
|
| 5 |
+
This repository contains a **Qwen3‑VL‑8B‑Instruct** vision‑language model fine‑tuned to answer questions about charts and plots, focusing on concise numerical or short textual answers.
|
| 6 |
+
Fine‑tuning was performed via **LoRA** using the human‑annotated subset of the [HuggingFaceM4/ChartQA](https://huggingface.co/datasets/HuggingFaceM4/ChartQA) dataset (train split, `human_or_machine = human`).
|
| 7 |
+
|
| 8 |
+
Typical behavior:
|
| 9 |
+
|
| 10 |
+
- Input: an image of a bar chart and the question `What is the value of the blue bar in 2018?`
|
| 11 |
+
Output: `24`
|
| 12 |
+
|
| 13 |
+
- Input: an image of a line chart and the question `In which year does the orange line reach its maximum?`
|
| 14 |
+
Output: `2015`
|
| 15 |
+
|
| 16 |
+
- Input: an image of a pie chart and the question `What percentage corresponds to Sales?`
|
| 17 |
+
Output: `38%`
|
| 18 |
+
|
| 19 |
+
The LoRA adapter was trained with [LLaMA‑Factory](https://github.com/hiyouga/LLaMA-Factory) on top of `Qwen/Qwen3-VL-8B-Instruct` and can be loaded either as a standard Transformers adapter or merged into the base weights.
|
| 20 |
+
|
| 21 |
+
## Base model
|
| 22 |
+
|
| 23 |
+
- **Base**: `Qwen/Qwen3-VL-8B-Instruct`
|
| 24 |
+
- **Architecture**: multimodal vision‑language model, ~8.8B parameters
|
| 25 |
+
- **Intended use**: instruction‑following and visual question answering (images + text)
|
| 26 |
+
|
| 27 |
+
## Training details
|
| 28 |
+
|
| 29 |
+
- **Framework**: LLaMA‑Factory (Supervised Fine‑Tuning with LoRA)
|
| 30 |
+
- **Finetuning type**: LoRA on transformer linear layers, vision tower and projector frozen
|
| 31 |
+
- **Dataset**: `HuggingFaceM4/ChartQA` (train split, only human‑authored QA pairs)
|
| 32 |
+
- **Task**: single‑turn chart question answering (chart image + question → short answer)
|
| 33 |
+
- **Input format**: Qwen3‑VL chat template with `<|im_start|>user` / `<|im_start|>assistant` and `<|vision_start|>…<|vision_end|>` tokens; answers taken as the first label (`label[0]`) for each sample
|
| 34 |
+
- **Number of train examples**: 7 398 human‑annotated samples
|
| 35 |
+
- **Max sequence length**: 2048 tokens
|
| 36 |
+
- **Epochs**: 3
|
| 37 |
+
- **Batch / grad accumulation**: effective batch size 64 (multi‑GPU + gradient accumulation)
|
| 38 |
+
- **Learning rate**: 5e‑5 (AdamW with scheduler)
|
| 39 |
+
- **Precision**: mixed precision (FP16 / bfloat16) with gradient checkpointing
|
| 40 |
+
- **Trainable parameters**: ~21.8M LoRA params (≈0.25 % of 8.79B total)
|
| 41 |
+
|
| 42 |
+
Final train loss was around **0.32** after 3 epochs (~10.6M seen tokens), indicating a strong fit on ChartQA while updating only a small LoRA head.
|
| 43 |
+
|
| 44 |
+
For best results:
|
| 45 |
+
|
| 46 |
+
- Provide a single chart image and a clear question in one turn.
|
| 47 |
+
- Use `temperature=0.0–0.2` and `max_new_tokens` around 16–64.
|
| 48 |
+
- Expect short answers (numbers, years, category names) rather than long explanations.
|
| 49 |
+
|
| 50 |
+
## Limitations
|
| 51 |
+
|
| 52 |
+
- The model is specialized for **chart question answering** and is not a general‑purpose assistant.
|
| 53 |
+
- It may struggle with non‑chart images, highly stylized plots, or layouts very different from those in ChartQA.
|
| 54 |
+
- Numerical and logical reasoning quality is bounded by the underlying Qwen3‑VL‑8B model; answers used in analytical or reporting workflows should be manually verified.
|