dung-vpt-uney commited on
Commit
78a732a
·
1 Parent(s): 98b73d7

Update Visual-CoT demo - 2025-10-12 22:56:46

Browse files

Fixes:
- Fix LLaVA config registration error (compatibility with newer transformers)
- Update Gradio to latest version (security fixes)
- Auto-deployed via update script

Files changed (2) hide show
  1. README.md +18 -25
  2. app.py +14 -14
README.md CHANGED
@@ -23,7 +23,7 @@ datasets:
23
  - deepcs233/Visual-CoT
24
  ---
25
 
26
- # 🌋 Visual-CoT: Chain-of-Thought Reasoning Demo
27
 
28
  <div align="center">
29
 
@@ -34,33 +34,33 @@ datasets:
34
 
35
  </div>
36
 
37
- ## 📖 About
38
 
39
  **Visual Chain-of-Thought (VisCoT)** advances multi-modal language models by introducing a comprehensive dataset and benchmark for chain-of-thought reasoning. Our model can:
40
 
41
- - 🎯 **Identify key regions** in images using bounding boxes
42
- - 💭 **Reason step-by-step** with visual grounding
43
- - 💡 **Answer complex questions** about visual content
44
 
45
- ## 🎯 Key Features
46
 
47
- ### 📊 Dataset
48
  - **438K** question-answer pairs with bounding box annotations
49
  - **13 diverse benchmarks** spanning multiple visual reasoning tasks
50
  - **High-quality annotations** from expert annotators
51
 
52
- ### 🏗️ Model Architecture
53
  - Based on **LLaVA-1.5** with custom visual reasoning pipeline
54
  - **CLIP ViT-L/14** vision encoder
55
  - **Two-step reasoning**: ROI detection → Question answering
56
 
57
- ### 🚀 Demo Features
58
  - **Interactive playground**: Upload your own images
59
  - **Benchmark explorer**: Browse evaluation examples
60
  - **Visual explanations**: See detected regions with bounding boxes
61
  - **Zero GPU**: Powered by Hugging Face's efficient GPU allocation
62
 
63
- ## 🎨 How to Use
64
 
65
  ### Interactive Demo
66
  1. **Upload an image** or choose an example
@@ -75,7 +75,7 @@ datasets:
75
  - See model performance on diverse visual reasoning tasks
76
  - Compare ground truth with model predictions
77
 
78
- ## 📊 Performance
79
 
80
  | Benchmark | Detection Acc | Answer Acc | Overall |
81
  |-----------|--------------|------------|---------|
@@ -117,30 +117,23 @@ If you find our work useful, please cite:
117
  }
118
  ```
119
 
120
- ## 🔗 Resources
121
 
122
- - 📄 **Paper**: [arXiv:2403.16999](https://arxiv.org/abs/2403.16999)
123
- - 💻 **Code**: [GitHub Repository](https://github.com/deepcs233/Visual-CoT)
124
- - 🤗 **Dataset**: [Hugging Face Dataset](https://huggingface.co/datasets/deepcs233/Visual-CoT)
125
- - 🌐 **Project Page**: [https://hao-shao.com/projects/viscot.html](https://hao-shao.com/projects/viscot.html)
126
 
127
- ## ⚖️ License
128
 
129
  - **Code**: Apache License 2.0
130
  - **Dataset**: Research use only
131
  - **Model**: Subject to LLaMA model license
132
 
133
- ## 🙏 Acknowledgements
134
 
135
  This work builds upon:
136
  - [LLaVA](https://github.com/haotian-liu/LLaVA)
137
  - [Shikra](https://github.com/shikras/shikra)
138
  - [Vicuna](https://github.com/lm-sys/FastChat)
139
  - [CLIP](https://github.com/openai/CLIP)
140
-
141
- ---
142
-
143
- <div align="center">
144
- Made with ❤️ by the Visual-CoT Team
145
- </div>
146
-
 
23
  - deepcs233/Visual-CoT
24
  ---
25
 
26
+ # Visual-CoT: Chain-of-Thought Reasoning Demo
27
 
28
  <div align="center">
29
 
 
34
 
35
  </div>
36
 
37
+ ## About
38
 
39
  **Visual Chain-of-Thought (VisCoT)** advances multi-modal language models by introducing a comprehensive dataset and benchmark for chain-of-thought reasoning. Our model can:
40
 
41
+ - **Identify key regions** in images using bounding boxes
42
+ - **Reason step-by-step** with visual grounding
43
+ - **Answer complex questions** about visual content
44
 
45
+ ## Key Features
46
 
47
+ ### Dataset
48
  - **438K** question-answer pairs with bounding box annotations
49
  - **13 diverse benchmarks** spanning multiple visual reasoning tasks
50
  - **High-quality annotations** from expert annotators
51
 
52
+ ### Model Architecture
53
  - Based on **LLaVA-1.5** with custom visual reasoning pipeline
54
  - **CLIP ViT-L/14** vision encoder
55
  - **Two-step reasoning**: ROI detection → Question answering
56
 
57
+ ### Demo Features
58
  - **Interactive playground**: Upload your own images
59
  - **Benchmark explorer**: Browse evaluation examples
60
  - **Visual explanations**: See detected regions with bounding boxes
61
  - **Zero GPU**: Powered by Hugging Face's efficient GPU allocation
62
 
63
+ ## How to Use
64
 
65
  ### Interactive Demo
66
  1. **Upload an image** or choose an example
 
75
  - See model performance on diverse visual reasoning tasks
76
  - Compare ground truth with model predictions
77
 
78
+ ## Performance
79
 
80
  | Benchmark | Detection Acc | Answer Acc | Overall |
81
  |-----------|--------------|------------|---------|
 
117
  }
118
  ```
119
 
120
+ ## Resources
121
 
122
+ - **Paper**: [arXiv:2403.16999](https://arxiv.org/abs/2403.16999)
123
+ - **Code**: [GitHub Repository](https://github.com/deepcs233/Visual-CoT)
124
+ - **Dataset**: [Hugging Face Dataset](https://huggingface.co/datasets/deepcs233/Visual-CoT)
125
+ - **Project Page**: [https://hao-shao.com/projects/viscot.html](https://hao-shao.com/projects/viscot.html)
126
 
127
+ ## License
128
 
129
  - **Code**: Apache License 2.0
130
  - **Dataset**: Research use only
131
  - **Model**: Subject to LLaMA model license
132
 
133
+ ## Acknowledgements
134
 
135
  This work builds upon:
136
  - [LLaVA](https://github.com/haotian-liu/LLaVA)
137
  - [Shikra](https://github.com/shikras/shikra)
138
  - [Vicuna](https://github.com/lm-sys/FastChat)
139
  - [CLIP](https://github.com/openai/CLIP)
 
 
 
 
 
 
 
app.py CHANGED
@@ -593,23 +593,23 @@ def create_demo():
593
  │ Visual-CoT Pipeline │
594
  ├─────────────────────────────────────┤
595
  │ │
596
- │ 📸 Image Input
597
  │ ↓ │
598
  │ 🔍 CLIP ViT-L/14 (Vision Encoder) │
599
  │ ↓ │
600
- │ 🔗 MLP Projector (2-layer)
601
- │ ↓
602
- │ 🧠 LLaMA/Vicuna (Language Model)
603
- │ ↓
604
- │ ┌──────────────┐
605
- │ │ Step 1: ROI │ → Bounding Box
606
- │ └──────────────┘
607
- │ ↓
608
- │ ┌──────────────┐
609
- │ │ Step 2: QA │ → Final Answer
610
- │ └──────────────┘
611
-
612
- └────────────────────────────────────┘
613
  ```
614
 
615
  ---
 
593
  │ Visual-CoT Pipeline │
594
  ├─────────────────────────────────────┤
595
  │ │
596
+ │ 📸 Image Input \│
597
  │ ↓ │
598
  │ 🔍 CLIP ViT-L/14 (Vision Encoder) │
599
  │ ↓ │
600
+ │ 🔗 MLP Projector (2-layer)
601
+ │ ↓
602
+ │ 🧠 LLaMA/Vicuna (Language Model)
603
+ │ ↓
604
+ │ ┌──────────────┐
605
+ │ │ Step 1: ROI │ → Bounding Box
606
+ │ └──────────────┘
607
+ │ ↓
608
+ │ ┌──────────────┐
609
+ │ │ Step 2: QA │ → Final Answer
610
+ │ └──────────────┘
611
+
612
+ └─────────────────────────────────────┘
613
  ```
614
 
615
  ---