Spaces:

tuandunghcmut
/

viscot-demo

Running on Zero

dung-vpt-uney commited on Oct 12

Commit

f39b78a

1 Parent(s): ba64608

Update Visual-CoT demo - 2025-10-12 23:23:25

Fixes:
- Fix LLaVA config registration error (compatibility with newer transformers)
- Update Gradio to latest version (security fixes)
- Auto-deployed via update script

Files changed (1) hide show

app.py +227 -49

app.py CHANGED Viewed

@@ -162,7 +162,7 @@ def load_benchmark_example(dataset_name, index=0):
         dataset_path = BENCHMARK_DATASETS.get(dataset_name)
         if not dataset_path:
-            return None, "Dataset not found", "", ""
         # Load dataset
         dataset = load_dataset(dataset_path, split="train")
@@ -640,45 +640,117 @@ def create_demo():
                 gr.Markdown("""
                 ### Explore Visual-CoT Benchmark Examples
-                Select a benchmark dataset and browse annotated examples from our evaluation suite.
-                These examples showcase the model's performance across diverse visual reasoning tasks.
                 """)
                 with gr.Row():
-                    dataset_dropdown = gr.Dropdown(
-                        choices=BENCHMARK_DATASETS,
-                        value="gqa",
-                        label="🗂️ Select Benchmark Dataset",
-                        info="Choose from 13 diverse benchmarks"
-                    )
-                    load_examples_btn = gr.Button("📥 Load Examples", variant="secondary")
-                benchmark_gallery = gr.Gallery(
-                    label="Benchmark Examples",
-                    columns=3,
-                    height=400,
-                    object_fit="contain",
                 )
-                benchmark_info = gr.Markdown("""
-                **Select a dataset and click "Load Examples" to view benchmark samples.**
-                Available benchmarks:
-                - **DocVQA**: Document visual question answering
-                - **GQA**: Scene graph question answering
-                - **TextVQA**: Text-based VQA
-                - **Flickr30k**: Image captioning & grounding
-                - **InfographicsVQA**: Infographic understanding
-                - **OpenImages**: Object detection & description
-                - And more...
                 """)
-                # Placeholder for benchmark loading (would need actual implementation)
-                load_examples_btn.click(
-                    fn=lambda x: gr.Info(f"Loading {x} examples... (Feature coming soon!)"),
-                    inputs=[dataset_dropdown],
-                    outputs=None,
                 )
             # ============================================================
@@ -704,31 +776,137 @@ def create_demo():
                 ## Model Architecture
-                ```
-                Visual-CoT Pipeline:
-                Image Input
-                    ↓
-                CLIP ViT-L/14 (Vision Encoder)
-                    ↓
-                MLP Projector (2-layer)
-                    ↓
-                LLaMA/Vicuna (Language Model)
                     ↓
-                Step 1: ROI Detection → Bounding Box
                     ↓
-                Step 2: Question Answering → Final Answer
                 ```
                 ---
-                ## Key Results
-                - **Detection Accuracy**: 75.3% (IoU > 0.5)
-                - **Answer Accuracy**: 82.7% (GPT-3.5 evaluated)
-                - **Benchmarks**: State-of-the-art on 10+ visual reasoning tasks
-                - **Model Sizes**: 7B and 13B parameters
-                - **Resolutions**: 224px and 336px
                 ---

         dataset_path = BENCHMARK_DATASETS.get(dataset_name)
         if not dataset_path:
+            return None, "Dataset not found", "", "", ""
         # Load dataset
         dataset = load_dataset(dataset_path, split="train")
                 gr.Markdown("""
                 ### Explore Visual-CoT Benchmark Examples
+                Load and browse real examples from the Visual-CoT benchmark datasets.
+                Each example includes: image, question, ground-truth bounding box, and answer.
                 """)
                 with gr.Row():
+                    with gr.Column(scale=2):
+                        dataset_dropdown = gr.Dropdown(
+                            choices=list(BENCHMARK_DATASETS.keys()),
+                            value="GQA",
+                            label="Select Benchmark Dataset",
+                            info="Choose from 5 core benchmarks"
+                        )
+                    with gr.Column(scale=1):
+                        example_index = gr.Number(
+                            value=0,
+                            label="Example Index",
+                            precision=0,
+                            minimum=0,
+                        )
+                with gr.Row():
+                    load_btn = gr.Button("Load Example", variant="primary")
+                    prev_btn = gr.Button("◀ Previous")
+                    next_btn = gr.Button("Next ▶")
+                benchmark_status = gr.Textbox(
+                    label="Status",
+                    value="Select a dataset and click 'Load Example'",
+                    interactive=False,
                 )
+                with gr.Row():
+                    with gr.Column():
+                        gr.Markdown("#### Image")
+                        benchmark_image = gr.Image(
+                            label="Input Image",
+                            type="pil",
+                            height=400,
+                        )
+                    with gr.Column():
+                        gr.Markdown("#### Annotations")
+                        benchmark_question = gr.Textbox(
+                            label="Question",
+                            lines=2,
+                            interactive=False,
+                        )
+                        benchmark_bbox = gr.Textbox(
+                            label="Ground Truth Bounding Box",
+                            lines=1,
+                            interactive=False,
+                        )
+                        benchmark_answer = gr.Textbox(
+                            label="Ground Truth Answer",
+                            lines=3,
+                            interactive=False,
+                        )
+                gr.Markdown("""
+                ---
+                ### Dataset Information
+                1. **GQA** - Scene graph question answering with compositional reasoning
+                2. **TextVQA** - Questions requiring reading and understanding text in images
+                3. **DocVQA** - Document understanding and information extraction
+                4. **Visual7W** - Visual question answering with pointing and telling tasks
+                5. **Flickr30k** - Image captioning and visual grounding
+                **Note:** Examples are loaded directly from the [Visual-CoT Hugging Face Collection](https://huggingface.co/collections/tuandunghcmut/visual-chain-of-thought-reasoning-benchmarks-68e25b22c3c095c6f87baba0).
                 """)
+                # Event handlers
+                def load_and_update(dataset_name, index):
+                    result = load_benchmark_example(dataset_name, int(index))
+                    if len(result) == 5:
+                        return result
+                    else:
+                        # Error case
+                        return None, result, "", "", ""
+                def increment_index(current_index):
+                    return int(current_index) + 1
+                def decrement_index(current_index):
+                    return max(0, int(current_index) - 1)
+                load_btn.click(
+                    fn=load_and_update,
+                    inputs=[dataset_dropdown, example_index],
+                    outputs=[benchmark_image, benchmark_question, benchmark_bbox, benchmark_answer, benchmark_status],
+                )
+                next_btn.click(
+                    fn=increment_index,
+                    inputs=[example_index],
+                    outputs=[example_index],
+                ).then(
+                    fn=load_and_update,
+                    inputs=[dataset_dropdown, example_index],
+                    outputs=[benchmark_image, benchmark_question, benchmark_bbox, benchmark_answer, benchmark_status],
+                )
+                prev_btn.click(
+                    fn=decrement_index,
+                    inputs=[example_index],
+                    outputs=[example_index],
+                ).then(
+                    fn=load_and_update,
+                    inputs=[dataset_dropdown, example_index],
+                    outputs=[benchmark_image, benchmark_question, benchmark_bbox, benchmark_answer, benchmark_status],
                 )
             # ============================================================
                 ## Model Architecture
+                ### Components
+                1. **Vision Encoder**: CLIP ViT-L/14
+                   - Input resolution: 224px or 336px
+                   - Output: 577 visual tokens (336px) or 196 tokens (224px)
+                   - Feature dimension: 1024
+                2. **Multi-modal Projector**: 2-layer MLP with GELU
+                   - Maps vision features (1024D) to LLM embedding space (4096D)
+                   - Trainable parameters: ~8.4M
+                3. **Language Model**: Vicuna v1.5 (instruction-tuned LLaMA)
+                   - Variants: 7B or 13B parameters
+                   - Context length: 2048 tokens
+                   - Base: LLaMA architecture
+                ### Multi-Turn Processing Pipeline
+                ```
+                Image + Question
                     ↓
+                [Turn 1] ROI Detection
+                    → Outputs: Bounding box coordinates [x1, y1, x2, y2]
+                    → Purpose: Identify key regions for reasoning
                     ↓
+                [Turn 2] Question Answering
+                    → Input: Image + Question + Detected bbox
+                    → Output: Final answer grounded in visual evidence
                 ```
                 ---
+                ## Training Strategy
+                ### Stage 1: Feature Alignment (Pretrain)
+                - **Dataset**: 558K LAION-CC-SBU subset with BLIP captions
+                - **Objective**: Connect frozen CLIP encoder to frozen LLM
+                - **Trainable**: Only the MLP projector (~8.4M params)
+                - **Duration**: 3.5 hours (7B) to 5.5 hours (13B) on 8×A100 GPUs
+                - **Hyperparameters**:
+                  - Batch size: 256
+                  - Learning rate: 1e-3
+                  - Epochs: 1
+                  - Max sequence length: 2048
+                ### Stage 2: Visual Instruction Tuning
+                - **Dataset Mix**:
+                  - 665K multimodal instruction-following (LLaVA-1.5)
+                  - 1.4M positional annotation data (Shikra)
+                  - 373K Visual-CoT data (ours)
+                  - **Total**: ~2.4M training instances
+                - **Training Details**:
+                  - Duration: ~60 hours (7B-224) on 8×A100 GPUs
+                  - Batch size: 128
+                  - Learning rate: 2e-5 (backbone), 2e-6 (vision encoder)
+                  - Epochs: 1
+                  - DeepSpeed ZeRO-3 for memory efficiency
+                ---
+                ## Dataset Construction
+                ### Visual-CoT Dataset (438K examples)
+                **13 Diverse Benchmarks:**
+                1. **Document Understanding** (4 datasets):
+                   - DocVQA: Document visual QA
+                   - InfographicsVQA: Infographic comprehension
+                   - DUDE: Document understanding
+                   - SROIE: Scanned receipt information extraction
+                2. **Scene Understanding** (3 datasets):
+                   - GQA: Scene graph compositional reasoning
+                   - Visual7W: Pointing and telling tasks
+                   - VSR: Visual spatial reasoning
+                3. **Text in Images** (2 datasets):
+                   - TextVQA: Reading text in natural images
+                   - OCR-VQA: OCR-based question answering
+                4. **General VQA** (2 datasets):
+                   - Visual Genome: Dense annotations
+                   - COCO: Common objects in context
+                5. **Specialized** (2 datasets):
+                   - CUB: Fine-grained bird classification
+                   - Flickr30k: Image captioning & grounding
+                **Annotation Details:**
+                - Each example includes: image, question, answer, bounding box
+                - Bounding boxes highlight key regions essential for reasoning
+                - 98K examples have detailed reasoning steps
+                - Train/val splits maintained from original benchmarks
+                ---
+                ## Evaluation & Results
+                ### Visual-CoT Benchmark Metrics
+                1. **Answer Accuracy**: GPT-3.5-based evaluation
+                   - Compares generated answer with ground truth
+                   - Accounts for semantic equivalence
+                   - Results: 82.7% average accuracy
+                2. **Detection Accuracy**: IoU-based bounding box evaluation
+                   - IoU > 0.5 threshold for correct detection
+                   - Results: 75.3% detection accuracy
+                   - Validates spatial grounding ability
+                3. **Reasoning Quality**: Chain-of-thought coherence
+                   - Multi-turn consistency
+                   - Interpretability of intermediate steps
+                ### Model Comparison
+                | Model | Resolution | Params | Answer Acc | Detection Acc |
+                |-------|-----------|---------|-----------|---------------|
+                | VisCoT-7B-224 | 224px | 7B | 80.1% | 72.5% |
+                | VisCoT-7B-336 | 336px | 7B | 81.8% | 74.2% |
+                | VisCoT-13B-224 | 224px | 13B | 81.5% | 73.8% |
+                | VisCoT-13B-336 | 336px | 13B | 82.7% | 75.3% |
+                **Trade-offs:**
+                - Higher resolution → Better detail recognition, slower inference
+                - Larger model → Better reasoning, more memory
+                - 336px + 13B = Best quality but highest compute cost
                 ---