zhengshuli
Claude
commited on
Commit
ยท
ffaa307
1
Parent(s):
b209339
Update README and add architecture diagram
Browse files- Update README.md with new content
- Add assets/arch.png via Git LFS
๐ค Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- .gitattributes +1 -0
- README.md +6 -0
- assets/arch.png +3 -0
.gitattributes
CHANGED
|
@@ -34,3 +34,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
*.png filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
|
@@ -21,6 +21,12 @@ datasets:
|
|
| 21 |
|
| 22 |
QZhou-Flowchart-VL-32B is a state-of-the-art multimodal large language model specifically designed for flowchart understanding and reasoning. This model is post-trained from Qwen2.5-VL-32B using reinforcement learning with our proposed **Tri-CoT** (Three-stage Chain-of-Thought) reasoning structure.
|
| 23 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
**Key Features:**
|
| 25 |
- ๐ Achieves **87.83%** on QZhou-Flowchart-QA-Benchmark, outperforming GPT-5, Gemini-2.5-Pro, and other SOTA models
|
| 26 |
- ๐ง Structured reasoning with Tri-CoT: JSON extraction โ Logical thinking โ Final answer
|
|
|
|
| 21 |
|
| 22 |
QZhou-Flowchart-VL-32B is a state-of-the-art multimodal large language model specifically designed for flowchart understanding and reasoning. This model is post-trained from Qwen2.5-VL-32B using reinforcement learning with our proposed **Tri-CoT** (Three-stage Chain-of-Thought) reasoning structure.
|
| 23 |
|
| 24 |
+
We introduce a full training pipeline for multimodal large language models focused on flowchart understanding, covering data construction, structured reasoning, and reinforcement learning. The pipeline first synthesizes diverse flowcharts using automated topic generation, JSON-based structural representations, Graphviz-style drawing, and optional visual style control, followed by creating corresponding VQA pairs. To enable grounded reasoning, we propose Tri-CoT, a three-stage chain-of-thought format that separates diagram parsing, logical inference, and final answer generation. Unlike traditional CoT, the model is required to output a full JSON representation of the image before reasoning, ensuring accurate extraction of nodes, edges, and branching conditions. During post-training, we compare supervised fine-tuning with Group Relative Policy Optimization, showing that pure RL yields stronger structured reasoning and reduces hallucinations while preserving general capabilities.
|
| 25 |
+
|
| 26 |
+
<div align="center">
|
| 27 |
+
<img src="assets/arch.png" width="740" height="320"></img>
|
| 28 |
+
</div>
|
| 29 |
+
|
| 30 |
**Key Features:**
|
| 31 |
- ๐ Achieves **87.83%** on QZhou-Flowchart-QA-Benchmark, outperforming GPT-5, Gemini-2.5-Pro, and other SOTA models
|
| 32 |
- ๐ง Structured reasoning with Tri-CoT: JSON extraction โ Logical thinking โ Final answer
|
assets/arch.png
ADDED
|
Git LFS Details
|