File size: 3,055 Bytes
f403b46
3fd9d26
 
 
 
 
 
 
f403b46
 
 
3fd9d26
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
---
title: "DAM vs DAM-QA Comparison Demo"
emoji: "πŸ€–"
colorFrom: "blue"
colorTo: "red"
sdk: "gradio"
sdk_version: "5.38.0"
app_file: "app.py"
pinned: false
---

# πŸ€– DAM vs DAM-QA Visual Question Answering Demo

An interactive demo that compares DAM (Original) and DAM-QA (Sliding Window) models on Visual Question Answering tasks for text-rich images.


## πŸš€ Quick Start

### Local Installation
```bash
git clone <repository-url>
cd DAM-QA-Demo
pip install -r requirements.txt
python app.py
```

### Usage
1. **Ensure GPU**: Models require CUDA-compatible GPU with 8GB+ memory
2. Launch the app: `python app.py`
3. Wait for models to load (status will update automatically)
4. Choose a sample from dropdown OR upload your own image
5. Enter a question about the image (or use auto-filled sample question)
6. Click "Compare Models" to see both DAM Original and DAM-QA results
7. Analyze the detailed voting breakdown for DAM-QA's sliding window approach

### ⚠️ Hardware Requirements
- **GPU**: CUDA-compatible with 8GB+ VRAM recommended
- **CPU**: Multi-core processor for fallback (much slower)
- **RAM**: 16GB+ system memory recommended

## 🧠 Technical Highlights

- **DAM Original**: Uses the full image with NVIDIA's DAM-3B-Self-Contained model
- **DAM-QA Sliding Window**: Implements sliding window approach with weighted voting aggregation
- **Model Architecture**: Transformer-based visual language model with attention mechanisms
- **Inference**: Supports both GPU and CPU inference with automatic device selection
- **UI Framework**: Built with Gradio and custom VLAI template for professional presentation

## πŸ“‹ Requirements

- Python 3.10+
- PyTorch 2.0+
- Transformers 4.30+
- Gradio 5.38+
- CUDA-compatible GPU (recommended)
- 8GB+ GPU memory for optimal performance

## 🎨 Theming & Branding

The UI is powered by `vlai_template.py` and can be customized programmatically:

```python
import vlai_template as vt

vt.configure(
    project_name="DAM vs DAM-QA Comparison Demo",
    year="2025",
    module="DAM",
    description=(
        "Compare DAM (Original) and DAM-QA (Sliding Window) performance "
        "on Visual Question Answering tasks"
    ),
    colors={
        "primary": "#0F6CBD",
        "accent": "#C4314B",
        "bg1": "#F0F7FF",
        "bg2": "#E8F0FA",
        "bg3": "#DDE7F8",
    },
    font_family=(
        "'Inter', -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, "
        "'Helvetica Neue', Arial, 'Noto Sans', 'Liberation Sans', sans-serif"
    ),
    meta_items=[
        ("Original DAM", "Full image processing"),
        ("DAM-QA", "Sliding window + voting"),
        ("Datasets", "DocVQA, InfographicVQA, TextVQA, ChartQA, VQAv2"),
    ],
)
```

## πŸ“Š Datasets Used

This demo includes sample images and questions from:

- **DocVQA**: Document visual question answering
- **InfographicVQA**: Infographic-based questions  
- **TextVQA**: Scene text visual question answering
- **ChartQA**: Chart and graph question answering
- **VQAv2**: General visual question answering