Spaces:
Running
on
Zero
Running
on
Zero
dung-vpt-uney
commited on
Commit
·
78a732a
1
Parent(s):
98b73d7
Update Visual-CoT demo - 2025-10-12 22:56:46
Browse filesFixes:
- Fix LLaVA config registration error (compatibility with newer transformers)
- Update Gradio to latest version (security fixes)
- Auto-deployed via update script
README.md
CHANGED
|
@@ -23,7 +23,7 @@ datasets:
|
|
| 23 |
- deepcs233/Visual-CoT
|
| 24 |
---
|
| 25 |
|
| 26 |
-
#
|
| 27 |
|
| 28 |
<div align="center">
|
| 29 |
|
|
@@ -34,33 +34,33 @@ datasets:
|
|
| 34 |
|
| 35 |
</div>
|
| 36 |
|
| 37 |
-
##
|
| 38 |
|
| 39 |
**Visual Chain-of-Thought (VisCoT)** advances multi-modal language models by introducing a comprehensive dataset and benchmark for chain-of-thought reasoning. Our model can:
|
| 40 |
|
| 41 |
-
-
|
| 42 |
-
-
|
| 43 |
-
-
|
| 44 |
|
| 45 |
-
##
|
| 46 |
|
| 47 |
-
###
|
| 48 |
- **438K** question-answer pairs with bounding box annotations
|
| 49 |
- **13 diverse benchmarks** spanning multiple visual reasoning tasks
|
| 50 |
- **High-quality annotations** from expert annotators
|
| 51 |
|
| 52 |
-
###
|
| 53 |
- Based on **LLaVA-1.5** with custom visual reasoning pipeline
|
| 54 |
- **CLIP ViT-L/14** vision encoder
|
| 55 |
- **Two-step reasoning**: ROI detection → Question answering
|
| 56 |
|
| 57 |
-
###
|
| 58 |
- **Interactive playground**: Upload your own images
|
| 59 |
- **Benchmark explorer**: Browse evaluation examples
|
| 60 |
- **Visual explanations**: See detected regions with bounding boxes
|
| 61 |
- **Zero GPU**: Powered by Hugging Face's efficient GPU allocation
|
| 62 |
|
| 63 |
-
##
|
| 64 |
|
| 65 |
### Interactive Demo
|
| 66 |
1. **Upload an image** or choose an example
|
|
@@ -75,7 +75,7 @@ datasets:
|
|
| 75 |
- See model performance on diverse visual reasoning tasks
|
| 76 |
- Compare ground truth with model predictions
|
| 77 |
|
| 78 |
-
##
|
| 79 |
|
| 80 |
| Benchmark | Detection Acc | Answer Acc | Overall |
|
| 81 |
|-----------|--------------|------------|---------|
|
|
@@ -117,30 +117,23 @@ If you find our work useful, please cite:
|
|
| 117 |
}
|
| 118 |
```
|
| 119 |
|
| 120 |
-
##
|
| 121 |
|
| 122 |
-
-
|
| 123 |
-
-
|
| 124 |
-
-
|
| 125 |
-
-
|
| 126 |
|
| 127 |
-
##
|
| 128 |
|
| 129 |
- **Code**: Apache License 2.0
|
| 130 |
- **Dataset**: Research use only
|
| 131 |
- **Model**: Subject to LLaMA model license
|
| 132 |
|
| 133 |
-
##
|
| 134 |
|
| 135 |
This work builds upon:
|
| 136 |
- [LLaVA](https://github.com/haotian-liu/LLaVA)
|
| 137 |
- [Shikra](https://github.com/shikras/shikra)
|
| 138 |
- [Vicuna](https://github.com/lm-sys/FastChat)
|
| 139 |
- [CLIP](https://github.com/openai/CLIP)
|
| 140 |
-
|
| 141 |
-
---
|
| 142 |
-
|
| 143 |
-
<div align="center">
|
| 144 |
-
Made with ❤️ by the Visual-CoT Team
|
| 145 |
-
</div>
|
| 146 |
-
|
|
|
|
| 23 |
- deepcs233/Visual-CoT
|
| 24 |
---
|
| 25 |
|
| 26 |
+
# Visual-CoT: Chain-of-Thought Reasoning Demo
|
| 27 |
|
| 28 |
<div align="center">
|
| 29 |
|
|
|
|
| 34 |
|
| 35 |
</div>
|
| 36 |
|
| 37 |
+
## About
|
| 38 |
|
| 39 |
**Visual Chain-of-Thought (VisCoT)** advances multi-modal language models by introducing a comprehensive dataset and benchmark for chain-of-thought reasoning. Our model can:
|
| 40 |
|
| 41 |
+
- **Identify key regions** in images using bounding boxes
|
| 42 |
+
- **Reason step-by-step** with visual grounding
|
| 43 |
+
- **Answer complex questions** about visual content
|
| 44 |
|
| 45 |
+
## Key Features
|
| 46 |
|
| 47 |
+
### Dataset
|
| 48 |
- **438K** question-answer pairs with bounding box annotations
|
| 49 |
- **13 diverse benchmarks** spanning multiple visual reasoning tasks
|
| 50 |
- **High-quality annotations** from expert annotators
|
| 51 |
|
| 52 |
+
### Model Architecture
|
| 53 |
- Based on **LLaVA-1.5** with custom visual reasoning pipeline
|
| 54 |
- **CLIP ViT-L/14** vision encoder
|
| 55 |
- **Two-step reasoning**: ROI detection → Question answering
|
| 56 |
|
| 57 |
+
### Demo Features
|
| 58 |
- **Interactive playground**: Upload your own images
|
| 59 |
- **Benchmark explorer**: Browse evaluation examples
|
| 60 |
- **Visual explanations**: See detected regions with bounding boxes
|
| 61 |
- **Zero GPU**: Powered by Hugging Face's efficient GPU allocation
|
| 62 |
|
| 63 |
+
## How to Use
|
| 64 |
|
| 65 |
### Interactive Demo
|
| 66 |
1. **Upload an image** or choose an example
|
|
|
|
| 75 |
- See model performance on diverse visual reasoning tasks
|
| 76 |
- Compare ground truth with model predictions
|
| 77 |
|
| 78 |
+
## Performance
|
| 79 |
|
| 80 |
| Benchmark | Detection Acc | Answer Acc | Overall |
|
| 81 |
|-----------|--------------|------------|---------|
|
|
|
|
| 117 |
}
|
| 118 |
```
|
| 119 |
|
| 120 |
+
## Resources
|
| 121 |
|
| 122 |
+
- **Paper**: [arXiv:2403.16999](https://arxiv.org/abs/2403.16999)
|
| 123 |
+
- **Code**: [GitHub Repository](https://github.com/deepcs233/Visual-CoT)
|
| 124 |
+
- **Dataset**: [Hugging Face Dataset](https://huggingface.co/datasets/deepcs233/Visual-CoT)
|
| 125 |
+
- **Project Page**: [https://hao-shao.com/projects/viscot.html](https://hao-shao.com/projects/viscot.html)
|
| 126 |
|
| 127 |
+
## License
|
| 128 |
|
| 129 |
- **Code**: Apache License 2.0
|
| 130 |
- **Dataset**: Research use only
|
| 131 |
- **Model**: Subject to LLaMA model license
|
| 132 |
|
| 133 |
+
## Acknowledgements
|
| 134 |
|
| 135 |
This work builds upon:
|
| 136 |
- [LLaVA](https://github.com/haotian-liu/LLaVA)
|
| 137 |
- [Shikra](https://github.com/shikras/shikra)
|
| 138 |
- [Vicuna](https://github.com/lm-sys/FastChat)
|
| 139 |
- [CLIP](https://github.com/openai/CLIP)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
app.py
CHANGED
|
@@ -593,23 +593,23 @@ def create_demo():
|
|
| 593 |
│ Visual-CoT Pipeline │
|
| 594 |
├─────────────────────────────────────┤
|
| 595 |
│ │
|
| 596 |
-
│ 📸 Image Input
|
| 597 |
│ ↓ │
|
| 598 |
│ 🔍 CLIP ViT-L/14 (Vision Encoder) │
|
| 599 |
│ ↓ │
|
| 600 |
-
│ 🔗 MLP Projector (2-layer)
|
| 601 |
-
│ ↓
|
| 602 |
-
│ 🧠 LLaMA/Vicuna (Language Model)
|
| 603 |
-
│ ↓
|
| 604 |
-
│ ┌──────────────┐
|
| 605 |
-
│ │ Step 1: ROI │ → Bounding Box
|
| 606 |
-
│ └──────────────┘
|
| 607 |
-
│ ↓
|
| 608 |
-
│ ┌──────────────┐
|
| 609 |
-
│ │ Step 2: QA │ → Final Answer
|
| 610 |
-
│ └──────────────┘
|
| 611 |
-
│
|
| 612 |
-
|
| 613 |
```
|
| 614 |
|
| 615 |
---
|
|
|
|
| 593 |
│ Visual-CoT Pipeline │
|
| 594 |
├─────────────────────────────────────┤
|
| 595 |
│ │
|
| 596 |
+
│ 📸 Image Input \│
|
| 597 |
│ ↓ │
|
| 598 |
│ 🔍 CLIP ViT-L/14 (Vision Encoder) │
|
| 599 |
│ ↓ │
|
| 600 |
+
│ 🔗 MLP Projector (2-layer) │
|
| 601 |
+
│ ↓ │
|
| 602 |
+
│ 🧠 LLaMA/Vicuna (Language Model) │
|
| 603 |
+
│ ↓ │
|
| 604 |
+
│ ┌──────────────┐ │
|
| 605 |
+
│ │ Step 1: ROI │ → Bounding Box │
|
| 606 |
+
│ └──────────────┘ │
|
| 607 |
+
│ ↓ │
|
| 608 |
+
│ ┌──────────────┐ │
|
| 609 |
+
│ │ Step 2: QA │ → Final Answer │
|
| 610 |
+
│ └──────────────┘ │
|
| 611 |
+
│ │
|
| 612 |
+
└─────────────────────────────────────┘
|
| 613 |
```
|
| 614 |
|
| 615 |
---
|