Userb1az commited on
Commit
60af836
·
verified ·
1 Parent(s): e4e5c77

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +54 -0
README.md ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Qwen3‑VL‑8B ChartQA (LoRA)
2
+
3
+ ## Overview
4
+
5
+ This repository contains a **Qwen3‑VL‑8B‑Instruct** vision‑language model fine‑tuned to answer questions about charts and plots, focusing on concise numerical or short textual answers.
6
+ Fine‑tuning was performed via **LoRA** using the human‑annotated subset of the [HuggingFaceM4/ChartQA](https://huggingface.co/datasets/HuggingFaceM4/ChartQA) dataset (train split, `human_or_machine = human`).
7
+
8
+ Typical behavior:
9
+
10
+ - Input: an image of a bar chart and the question `What is the value of the blue bar in 2018?`
11
+ Output: `24`
12
+
13
+ - Input: an image of a line chart and the question `In which year does the orange line reach its maximum?`
14
+ Output: `2015`
15
+
16
+ - Input: an image of a pie chart and the question `What percentage corresponds to Sales?`
17
+ Output: `38%`
18
+
19
+ The LoRA adapter was trained with [LLaMA‑Factory](https://github.com/hiyouga/LLaMA-Factory) on top of `Qwen/Qwen3-VL-8B-Instruct` and can be loaded either as a standard Transformers adapter or merged into the base weights.
20
+
21
+ ## Base model
22
+
23
+ - **Base**: `Qwen/Qwen3-VL-8B-Instruct`
24
+ - **Architecture**: multimodal vision‑language model, ~8.8B parameters
25
+ - **Intended use**: instruction‑following and visual question answering (images + text)
26
+
27
+ ## Training details
28
+
29
+ - **Framework**: LLaMA‑Factory (Supervised Fine‑Tuning with LoRA)
30
+ - **Finetuning type**: LoRA on transformer linear layers, vision tower and projector frozen
31
+ - **Dataset**: `HuggingFaceM4/ChartQA` (train split, only human‑authored QA pairs)
32
+ - **Task**: single‑turn chart question answering (chart image + question → short answer)
33
+ - **Input format**: Qwen3‑VL chat template with `<|im_start|>user` / `<|im_start|>assistant` and `<|vision_start|>…<|vision_end|>` tokens; answers taken as the first label (`label[0]`) for each sample
34
+ - **Number of train examples**: 7 398 human‑annotated samples
35
+ - **Max sequence length**: 2048 tokens
36
+ - **Epochs**: 3
37
+ - **Batch / grad accumulation**: effective batch size 64 (multi‑GPU + gradient accumulation)
38
+ - **Learning rate**: 5e‑5 (AdamW with scheduler)
39
+ - **Precision**: mixed precision (FP16 / bfloat16) with gradient checkpointing
40
+ - **Trainable parameters**: ~21.8M LoRA params (≈0.25 % of 8.79B total)
41
+
42
+ Final train loss was around **0.32** after 3 epochs (~10.6M seen tokens), indicating a strong fit on ChartQA while updating only a small LoRA head.
43
+
44
+ For best results:
45
+
46
+ - Provide a single chart image and a clear question in one turn.
47
+ - Use `temperature=0.0–0.2` and `max_new_tokens` around 16–64.
48
+ - Expect short answers (numbers, years, category names) rather than long explanations.
49
+
50
+ ## Limitations
51
+
52
+ - The model is specialized for **chart question answering** and is not a general‑purpose assistant.
53
+ - It may struggle with non‑chart images, highly stylized plots, or layouts very different from those in ChartQA.
54
+ - Numerical and logical reasoning quality is bounded by the underlying Qwen3‑VL‑8B model; answers used in analytical or reporting workflows should be manually verified.