Spaces:

tuandunghcmut
/

viscot-demo

Running on Zero

dung-vpt-uney commited on Oct 12

Commit

78a732a

1 Parent(s): 98b73d7

Update Visual-CoT demo - 2025-10-12 22:56:46

Fixes:
- Fix LLaVA config registration error (compatibility with newer transformers)
- Update Gradio to latest version (security fixes)
- Auto-deployed via update script

Files changed (2) hide show

README.md +18 -25
app.py +14 -14

README.md CHANGED Viewed

@@ -23,7 +23,7 @@ datasets:
   - deepcs233/Visual-CoT
 ---
-# 🌋 Visual-CoT: Chain-of-Thought Reasoning Demo
 <div align="center">
@@ -34,33 +34,33 @@ datasets:
 </div>
-## 📖 About
 **Visual Chain-of-Thought (VisCoT)** advances multi-modal language models by introducing a comprehensive dataset and benchmark for chain-of-thought reasoning. Our model can:
-- 🎯 **Identify key regions** in images using bounding boxes
-- 💭 **Reason step-by-step** with visual grounding
-- 💡 **Answer complex questions** about visual content
-## 🎯 Key Features
-### 📊 Dataset
 - **438K** question-answer pairs with bounding box annotations
 - **13 diverse benchmarks** spanning multiple visual reasoning tasks
 - **High-quality annotations** from expert annotators
-### 🏗️ Model Architecture
 - Based on **LLaVA-1.5** with custom visual reasoning pipeline
 - **CLIP ViT-L/14** vision encoder
 - **Two-step reasoning**: ROI detection → Question answering
-### 🚀 Demo Features
 - **Interactive playground**: Upload your own images
 - **Benchmark explorer**: Browse evaluation examples
 - **Visual explanations**: See detected regions with bounding boxes
 - **Zero GPU**: Powered by Hugging Face's efficient GPU allocation
-## 🎨 How to Use
 ### Interactive Demo
 1. **Upload an image** or choose an example
@@ -75,7 +75,7 @@ datasets:
 - See model performance on diverse visual reasoning tasks
 - Compare ground truth with model predictions
-## 📊 Performance
 | Benchmark | Detection Acc | Answer Acc | Overall |
 |-----------|--------------|------------|---------|
@@ -117,30 +117,23 @@ If you find our work useful, please cite:
 }
 ```
-## 🔗 Resources
-- 📄 **Paper**: [arXiv:2403.16999](https://arxiv.org/abs/2403.16999)
-- 💻 **Code**: [GitHub Repository](https://github.com/deepcs233/Visual-CoT)
-- 🤗 **Dataset**: [Hugging Face Dataset](https://huggingface.co/datasets/deepcs233/Visual-CoT)
-- 🌐 **Project Page**: [https://hao-shao.com/projects/viscot.html](https://hao-shao.com/projects/viscot.html)
-## ⚖️ License
 - **Code**: Apache License 2.0
 - **Dataset**: Research use only
 - **Model**: Subject to LLaMA model license
-## 🙏 Acknowledgements
 This work builds upon:
 - [LLaVA](https://github.com/haotian-liu/LLaVA)
 - [Shikra](https://github.com/shikras/shikra)
 - [Vicuna](https://github.com/lm-sys/FastChat)
 - [CLIP](https://github.com/openai/CLIP)
----
-<div align="center">
-Made with ❤️ by the Visual-CoT Team
-</div>

   - deepcs233/Visual-CoT
 ---
+# Visual-CoT: Chain-of-Thought Reasoning Demo
 <div align="center">
 </div>
+## About
 **Visual Chain-of-Thought (VisCoT)** advances multi-modal language models by introducing a comprehensive dataset and benchmark for chain-of-thought reasoning. Our model can:
+- **Identify key regions** in images using bounding boxes
+- **Reason step-by-step** with visual grounding
+- **Answer complex questions** about visual content
+## Key Features
+### Dataset
 - **438K** question-answer pairs with bounding box annotations
 - **13 diverse benchmarks** spanning multiple visual reasoning tasks
 - **High-quality annotations** from expert annotators
+### Model Architecture
 - Based on **LLaVA-1.5** with custom visual reasoning pipeline
 - **CLIP ViT-L/14** vision encoder
 - **Two-step reasoning**: ROI detection → Question answering
+### Demo Features
 - **Interactive playground**: Upload your own images
 - **Benchmark explorer**: Browse evaluation examples
 - **Visual explanations**: See detected regions with bounding boxes
 - **Zero GPU**: Powered by Hugging Face's efficient GPU allocation
+## How to Use
 ### Interactive Demo
 1. **Upload an image** or choose an example
 - See model performance on diverse visual reasoning tasks
 - Compare ground truth with model predictions
+## Performance
 | Benchmark | Detection Acc | Answer Acc | Overall |
 |-----------|--------------|------------|---------|
 }
 ```
+## Resources
+- **Paper**: [arXiv:2403.16999](https://arxiv.org/abs/2403.16999)
+- **Code**: [GitHub Repository](https://github.com/deepcs233/Visual-CoT)
+- **Dataset**: [Hugging Face Dataset](https://huggingface.co/datasets/deepcs233/Visual-CoT)
+- **Project Page**: [https://hao-shao.com/projects/viscot.html](https://hao-shao.com/projects/viscot.html)
+## License
 - **Code**: Apache License 2.0
 - **Dataset**: Research use only
 - **Model**: Subject to LLaMA model license
+## Acknowledgements
 This work builds upon:
 - [LLaVA](https://github.com/haotian-liu/LLaVA)
 - [Shikra](https://github.com/shikras/shikra)
 - [Vicuna](https://github.com/lm-sys/FastChat)
 - [CLIP](https://github.com/openai/CLIP)

app.py CHANGED Viewed

@@ -593,23 +593,23 @@ def create_demo():
                 │      Visual-CoT Pipeline            │
                 ├─────────────────────────────────────┤
                 │                                     │
-                │  📸 Image Input                      │
                 │         ↓                           │
                 │  🔍 CLIP ViT-L/14 (Vision Encoder)  │
                 │         ↓                           │
-                │  🔗 MLP Projector (2-layer)        │
-                │         ↓                          │
-                │  🧠 LLaMA/Vicuna (Language Model)  │
-                │         ↓                          │
-                │  ┌──────────────┐                  │
-                │  │ Step 1: ROI  │ → Bounding Box   │
-                │  └──────────────┘                  │
-                │         ↓                          │
-                │  ┌──────────────┐                  │
-                │  │ Step 2: QA   │ → Final Answer   │
-                │  └──────────────┘                  │
-                │                                    │
-                └────────────────────────────────────┘
                 ```
                 ---

                 │      Visual-CoT Pipeline            │
                 ├─────────────────────────────────────┤
                 │                                     │
+                │  📸 Image Input                      \│
                 │         ↓                           │
                 │  🔍 CLIP ViT-L/14 (Vision Encoder)  │
                 │         ↓                           │
+                │  🔗 MLP Projector (2-layer)         │
+                │         ↓                           │
+                │  🧠 LLaMA/Vicuna (Language Model)   │
+                │         ↓                           │
+                │  ┌──────────────┐                   │
+                │  │ Step 1: ROI  │ → Bounding Box    │
+                │  └──────────────┘                   │
+                │         ↓                           │
+                │  ┌──────────────┐                   │
+                │  │ Step 2: QA   │ → Final Answer    │
+                │  └──────────────┘                   │
+                │                                     │
+                └─────────────────────────────────────┘
                 ```
                 ---