Spaces:

VLAI-AIVN
/

DAM-QA_Demo

Sleeping

App Files Files Community

duongtruongbinh commited on Sep 7

Commit

3fd9d26

1 Parent(s): f403b46

Initial commit

Browse files

Files changed (18) hide show

.gitattributes +3 -0
.gitignore +4 -0
README.md +97 -8
app.py +381 -0
packages.txt +2 -0
requirements.txt +10 -0
sample_images/chartqa_sample1.jpeg +3 -0
sample_images/docvqa_sample1.png +3 -0
sample_images/docvqa_sample2.png +3 -0
sample_images/infovqa_sample1.jpeg +3 -0
sample_images/textvqa_sample1.jpg +3 -0
sample_images/vqav2_sample1.png +3 -0
samples.json +44 -0
src/__init__.py +1 -0
src/dam_models.py +260 -0
static/aivn_logo.png +3 -0
static/vlai_logo.png +3 -0
vlai_template.py +240 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+*.png filter=lfs diff=lfs merge=lfs -text
+*.jpg filter=lfs diff=lfs merge=lfs -text
+*.jpeg filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,4 @@

+__pycache__/
+__MACOSX/
+.DS_Store

README.md CHANGED Viewed

@@ -1,12 +1,101 @@
 ---
-title: DAM-QA Demo
-emoji: 👁
-colorFrom: blue
-colorTo: indigo
-sdk: gradio
-sdk_version: 5.44.1
-app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: "DAM vs DAM-QA Comparison Demo"
+emoji: "🤖"
+colorFrom: "blue"
+colorTo: "red"
+sdk: "gradio"
+sdk_version: "5.38.0"
+app_file: "app.py"
 pinned: false
 ---
+# 🤖 DAM vs DAM-QA Visual Question Answering Demo
+An interactive demo that compares DAM (Original) and DAM-QA (Sliding Window) models on Visual Question Answering tasks for text-rich images.
+## 🚀 Quick Start
+### Local Installation
+```bash
+git clone <repository-url>
+cd DAM-QA-Demo
+pip install -r requirements.txt
+python app.py
+```
+### Usage
+1. **Ensure GPU**: Models require CUDA-compatible GPU with 8GB+ memory
+2. Launch the app: `python app.py`
+3. Wait for models to load (status will update automatically)
+4. Choose a sample from dropdown OR upload your own image
+5. Enter a question about the image (or use auto-filled sample question)
+6. Click "Compare Models" to see both DAM Original and DAM-QA results
+7. Analyze the detailed voting breakdown for DAM-QA's sliding window approach
+### ⚠️ Hardware Requirements
+- **GPU**: CUDA-compatible with 8GB+ VRAM recommended
+- **CPU**: Multi-core processor for fallback (much slower)
+- **RAM**: 16GB+ system memory recommended
+## 🧠 Technical Highlights
+- **DAM Original**: Uses the full image with NVIDIA's DAM-3B-Self-Contained model
+- **DAM-QA Sliding Window**: Implements sliding window approach with weighted voting aggregation
+- **Model Architecture**: Transformer-based visual language model with attention mechanisms
+- **Inference**: Supports both GPU and CPU inference with automatic device selection
+- **UI Framework**: Built with Gradio and custom VLAI template for professional presentation
+## 📋 Requirements
+- Python 3.10+
+- PyTorch 2.0+
+- Transformers 4.30+
+- Gradio 5.38+
+- CUDA-compatible GPU (recommended)
+- 8GB+ GPU memory for optimal performance
+## 🎨 Theming & Branding
+The UI is powered by `vlai_template.py` and can be customized programmatically:
+```python
+import vlai_template as vt
+vt.configure(
+    project_name="DAM vs DAM-QA Comparison Demo",
+    year="2025",
+    module="DAM",
+    description=(
+        "Compare DAM (Original) and DAM-QA (Sliding Window) performance "
+        "on Visual Question Answering tasks"
+    ),
+    colors={
+        "primary": "#0F6CBD",
+        "accent": "#C4314B",
+        "bg1": "#F0F7FF",
+        "bg2": "#E8F0FA",
+        "bg3": "#DDE7F8",
+    },
+    font_family=(
+        "'Inter', -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, "
+        "'Helvetica Neue', Arial, 'Noto Sans', 'Liberation Sans', sans-serif"
+    ),
+    meta_items=[
+        ("Original DAM", "Full image processing"),
+        ("DAM-QA", "Sliding window + voting"),
+        ("Datasets", "DocVQA, InfographicVQA, TextVQA, ChartQA, VQAv2"),
+    ],
+)
+```
+## 📊 Datasets Used
+This demo includes sample images and questions from:
+- **DocVQA**: Document visual question answering
+- **InfographicVQA**: Infographic-based questions
+- **TextVQA**: Scene text visual question answering
+- **ChartQA**: Chart and graph question answering
+- **VQAv2**: General visual question answering

app.py ADDED Viewed

	@@ -0,0 +1,381 @@

+import os
+import json
+import gradio as gr
+import plotly.graph_objects as go
+import pandas as pd
+import time
+from PIL import Image
+import vlai_template
+from src.dam_models import get_dam_original, get_dam_sliding
+# App configuration
+vlai_template.set_meta(
+    project_name="DAM-QA Demo",
+    year="2025",
+    module="DAM",
+    description="DAM-QA performance on Visual Question Answering tasks",
+    meta_items=[
+        ("Original DAM", "Full image processing"),
+        ("DAM-QA", "Sliding window + voting"),
+        ("Datasets", "DocVQA, InfographicVQA, TextVQA, ChartQA, VQAv2"),
+    ],
+)
+# Global state for models
+STATE = {
+    "dam_original": None,
+    "dam_sliding": None,
+    "samples": []
+}
+# Load sample data
+def load_samples():
+    """Load sample questions and images."""
+    try:
+        with open("samples.json", "r") as f:
+            samples = json.load(f)
+        STATE["samples"] = samples
+        return samples
+    except Exception as e:
+        print(f"Error loading samples: {e}")
+        return []
+def init_models():
+    """Initialize both DAM models."""
+    try:
+        STATE["dam_original"] = get_dam_original()
+        STATE["dam_sliding"] = get_dam_sliding()
+        return "✅ Both DAM models loaded successfully!"
+    except Exception as e:
+        error_msg = f"❌ Error loading models: {str(e)}"
+        print(error_msg)
+        return error_msg
+def get_sample_choices():
+    """Get list of sample choices for dropdown."""
+    samples = STATE["samples"]
+    choices = []
+    for i, sample in enumerate(samples):
+        label = f"{sample['dataset']}: {sample['question'][:50]}..."
+        choices.append((label, i))
+    return choices
+def fill_from_sample(sample_idx):
+    """Fill inputs from selected sample."""
+    if not STATE["samples"] or sample_idx is None or sample_idx >= len(STATE["samples"]):
+        return None, "", "", None, ""
+    sample = STATE["samples"][sample_idx]
+    # Load the sample image
+    try:
+        sample_img = Image.open(sample["image"])
+        return (
+            sample_img,  # sample_image_display
+            sample["ground_truth"],  # ground_truth_display
+            f"Dataset: {sample['dataset']}\nDescription: {sample['description']}",  # sample_info_display
+            sample_img,  # image_input (copy to main input)
+            sample["question"]  # question_input (copy to main input)
+        )
+    except Exception as e:
+        print(f"Error loading sample image {sample['image']}: {e}")
+        return None, sample["ground_truth"], f"Error loading image: {e}", None, sample["question"]
+def compare_models(image, question, max_tokens):
+    """Compare both models on the same input."""
+    if STATE["dam_original"] is None or STATE["dam_sliding"] is None:
+        return "❌ Models not loaded. Please wait for models to initialize.", "", "", None, ""
+    if image is None:
+        return "❌ Please provide an image", "", "", None, ""
+    if not question or not question.strip():
+        return "❌ Please provide a question", "", "", None, ""
+    try:
+        # Convert to PIL Image if needed
+        if isinstance(image, str):
+            img = Image.open(image)
+        elif hasattr(image, 'save'):  # PIL Image
+            img = image
+        else:
+            return "❌ Invalid image format", "", "", None, ""
+        # DAM Original prediction
+        original_answer, original_time = STATE["dam_original"].predict(
+            img, question, max_tokens
+        )
+        # DAM Sliding Window prediction
+        sliding_answer, sliding_time, voting_details = STATE["dam_sliding"].predict(
+            img, question, max_tokens
+        )
+        # Format results
+        original_result = f"""
+### 🔍 DAM Original (Full Image)
+**Answer:** {original_answer}
+**Inference Time:** {original_time:.2f}s
+**Method:** Processes the entire image at once
+"""
+        sliding_result = f"""
+### 🧩 DAM-QA (Sliding Window + Voting)
+**Answer:** {sliding_answer}
+**Inference Time:** {sliding_time:.2f}s
+**Method:** Sliding windows with weighted voting
+**Total Windows:** {voting_details.get('total_windows', 'N/A')}
+"""
+        # Create comparison summary
+        comparison = f"""
+## 📊 Comparison Summary
+| Method | Answer | Time (s) | Approach |
+|--------|--------|----------|----------|
+| DAM Original | {original_answer} | {original_time:.2f} | Full image |
+| DAM-QA Sliding | {sliding_answer} | {sliding_time:.2f} | Window + voting |
+**Speed Difference:** {abs(original_time - sliding_time):.2f}s
+**Faster Method:** {'DAM Original' if original_time < sliding_time else 'DAM-QA'}
+"""
+        # Create voting visualization
+        vote_fig = create_voting_chart(voting_details)
+        # Detailed voting info
+        voting_info = format_voting_details(voting_details)
+        return comparison, original_result, sliding_result, vote_fig, voting_info
+    except Exception as e:
+        error_msg = f"❌ Error during inference: {str(e)}"
+        return error_msg, "", "", None, ""
+def create_voting_chart(voting_details):
+    """Create a visualization of the voting process."""
+    if not voting_details or "vote_summary" not in voting_details:
+        return None
+    votes = voting_details["vote_summary"]
+    if not votes:
+        return None
+    answers = list(votes.keys())
+    weights = list(votes.values())
+    # Create bar chart
+    fig = go.Figure(data=[
+        go.Bar(
+            x=answers,
+            y=weights,
+            text=[f"{w:.3f}" for w in weights],
+            textposition='auto',
+            marker_color=['#C4314B' if ans == voting_details.get('final_answer', '') else '#0F6CBD' for ans in answers]
+        )
+    ])
+    fig.update_layout(
+        title="DAM-QA Voting Results",
+        xaxis_title="Answers",
+        yaxis_title="Vote Weight",
+        plot_bgcolor="white",
+        paper_bgcolor="white",
+        font=dict(color="black", size=12),
+        height=400,
+        margin=dict(l=30, r=20, t=60, b=40)
+    )
+    return fig
+def format_voting_details(voting_details):
+    """Format detailed voting information."""
+    if not voting_details:
+        return "No voting details available."
+    details = []
+    # Full image vote
+    if "full_image" in voting_details and voting_details["full_image"]:
+        full_vote = voting_details["full_image"]
+        details.append(f"**Full Image Vote:**")
+        details.append(f"- Answer: {full_vote['answer']}")
+        details.append(f"- Weight: {full_vote['weight']:.3f}")
+        details.append("")
+    # Window votes summary
+    if "windows" in voting_details:
+        windows = voting_details["windows"]
+        details.append(f"**Window Votes:** {len(windows)} windows processed")
+        # Group by answer
+        answer_groups = {}
+        for window in windows:
+            ans = window["answer"]
+            if ans not in answer_groups:
+                answer_groups[ans] = []
+            answer_groups[ans].append(window)
+        for answer, windows_for_ans in answer_groups.items():
+            total_weight = sum(w["weight"] for w in windows_for_ans)
+            details.append(f"- **{answer}**: {len(windows_for_ans)} windows, total weight: {total_weight:.3f}")
+        details.append("")
+    # Final summary
+    if "vote_summary" in voting_details:
+        details.append("**Final Vote Tally:**")
+        for answer, weight in voting_details["vote_summary"].items():
+            marker = "🏆" if answer == voting_details.get("final_answer", "") else "  "
+            details.append(f"{marker} {answer}: {weight:.3f}")
+    return "\n".join(details)
+# Force light theme
+force_light_theme_js = """
+() => {
+  const params = new URLSearchParams(window.location.search);
+  if (!params.has('__theme')) {
+    params.set('__theme', 'light');
+    window.location.search = params.toString();
+  }
+}
+"""
+# Main Gradio interface
+with gr.Blocks(theme="gstaff/sketch", css=vlai_template.custom_css, fill_width=True, js=force_light_theme_js) as demo:
+    vlai_template.create_header()
+    gr.HTML(vlai_template.render_info_card(
+        icon="🤖",
+        title="About this Demo",
+        description="This demo compares two approaches for Visual Question Answering: DAM (original) processes the full image, while DAM-QA uses a sliding window approach with weighted voting to better handle text-rich images."
+    ))
+    gr.HTML(vlai_template.render_disclaimer(
+        text=(
+            "This demo is for research and educational purposes only. "
+            "The models are designed for visual question answering on text-rich images. "
+            "Results may vary based on image quality and question complexity."
+        )
+    ))
+    gr.Markdown("### 🎯 **How to Use**: Select a sample or upload your image → Ask a question → Compare both models → Analyze the voting results!")
+    # Model Status at top
+    with gr.Accordion("🤖 Model Status", open=True):
+        with gr.Row():
+            status_display = gr.Markdown("Loading models...")
+            refresh_btn = gr.Button("🔄 Refresh Status", variant="secondary", scale=1)
+    with gr.Row(equal_height=False, variant="panel"):
+        # LEFT: Input Section
+        with gr.Column(scale=35):
+            with gr.Accordion("📤 Upload Image & Question", open=True):
+                image_input = gr.Image(label="Upload Image", type="pil", height=300)
+                question_input = gr.Textbox(
+                    label="Your Question",
+                    placeholder="Ask a question about the image...",
+                    lines=3
+                )
+                with gr.Row():
+                    max_tokens_slider = gr.Slider(
+                        minimum=10, maximum=200, value=100, step=10,
+                        label="Max Tokens", scale=2
+                    )
+                    compare_btn = gr.Button("🔍 Compare Models", variant="primary", size="lg", scale=1)
+            with gr.Accordion("📋 Try Sample Images", open=True):
+                sample_dropdown = gr.Dropdown(
+                    label="Select Sample Dataset",
+                    choices=[],
+                    value=None,
+                    info="Choose a sample to auto-fill the inputs above"
+                )
+                sample_image_display = gr.Image(label="Sample Preview", interactive=False, height=200)
+                with gr.Row():
+                    ground_truth_display = gr.Textbox(label="Expected Answer", interactive=False, scale=2)
+                    sample_info_display = gr.Textbox(label="Dataset Info", interactive=False, lines=3, scale=1)
+        # MIDDLE: Results Comparison
+        with gr.Column(scale=40):
+            with gr.Accordion("📊 Model Comparison Results", open=True):
+                comparison_output = gr.Markdown("Click 'Compare Models' to see results...")
+                with gr.Row():
+                    with gr.Column():
+                        gr.Markdown("#### 🔍 DAM Original")
+                        original_output = gr.Markdown("Results will appear here...")
+                    with gr.Column():
+                        gr.Markdown("#### 🧩 DAM-QA Sliding Window")
+                        sliding_output = gr.Markdown("Results will appear here...")
+        # RIGHT: Voting Analysis
+        with gr.Column(scale=25):
+            with gr.Accordion("🗳️ DAM-QA Voting Analysis", open=True):
+                voting_chart = gr.Plot(label="Vote Weights")
+                voting_details = gr.Markdown("Voting details will appear here...", max_height=200)
+    gr.Markdown("""
+    ## 📋 **Key Differences**
+    - **DAM Original**: Processes the entire image at once, faster but may miss fine details
+    - **DAM-QA Sliding Window**: Divides image into overlapping windows, slower but better for text-rich images
+    - **Voting Mechanism**: DAM-QA aggregates predictions from multiple windows using weighted voting
+    - **Use Cases**: DAM-QA typically performs better on documents, charts, and infographics
+    """)
+    vlai_template.create_footer()
+    # Event handlers
+    def on_load():
+        # Load samples first
+        samples = load_samples()
+        choices = [(f"{s['dataset']}: {s['question'][:50]}...", i) for i, s in enumerate(samples)]
+        # Load models immediately (this will take time but ensures they're ready)
+        print("Loading DAM models...")
+        status = init_models()
+        print(f"Model initialization complete: {status}")
+        return status, gr.Dropdown(choices=choices, value=0 if choices else None)
+    def refresh_status():
+        """Check current model status."""
+        if STATE["dam_original"] is not None and STATE["dam_sliding"] is not None:
+            return "✅ Both DAM models loaded successfully!"
+        else:
+            return "🔄 Models not loaded. Click to retry."
+    def retry_loading():
+        """Retry loading models."""
+        return init_models()
+    demo.load(
+        fn=on_load,
+        outputs=[status_display, sample_dropdown]
+    )
+    # Add refresh button functionality
+    refresh_btn.click(
+        fn=refresh_status,
+        outputs=[status_display]
+    )
+    sample_dropdown.change(
+        fn=fill_from_sample,
+        inputs=[sample_dropdown],
+        outputs=[sample_image_display, ground_truth_display, sample_info_display, image_input, question_input]
+    )
+    compare_btn.click(
+        fn=compare_models,
+        inputs=[image_input, question_input, max_tokens_slider],
+        outputs=[comparison_output, original_output, sliding_output, voting_chart, voting_details]
+    )
+if __name__ == "__main__":
+    demo.launch(
+        share=False,
+        show_error=True,
+        allowed_paths=["sample_images", "static"]
+    )

packages.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ graphviz
2	+ fonts-liberation

requirements.txt ADDED Viewed

	@@ -0,0 +1,10 @@

+gradio==5.38.0
+pandas>=1.5.0
+numpy>=1.24.0
+plotly>=5.15.0
+torch>=2.0.0
+transformers>=4.30.0
+pillow>=10.0.0
+accelerate>=0.20.0
+opencv-python
+sentencepiece

sample_images/chartqa_sample1.jpeg ADDED Viewed

Git LFS Details

SHA256: e5c01960bb29555da99235166125a3408aa072429ea9ce7f7abc6159ae66cfa3
Pointer size: 130 Bytes
Size of remote file: 39.1 kB

sample_images/docvqa_sample1.png ADDED Viewed

Git LFS Details

SHA256: 09cf31be6a014d3aed8edfb0a84149e18e18805b29046e2e58fe9085dfb04ac1
Pointer size: 132 Bytes
Size of remote file: 1.31 MB

sample_images/docvqa_sample2.png ADDED Viewed

Git LFS Details

SHA256: ca886fcfe19bd4b2b0e5b56f12c9f0b9af0e4a5912e53099f90231c634f5bba3
Pointer size: 130 Bytes
Size of remote file: 92.4 kB

sample_images/infovqa_sample1.jpeg ADDED Viewed

Git LFS Details

SHA256: 9cedfac842f4f4c3da178449c4a7582111879c5f11178d850d04c8d6024d1ea7
Pointer size: 131 Bytes
Size of remote file: 256 kB

sample_images/textvqa_sample1.jpg ADDED Viewed

Git LFS Details

SHA256: 9510ad325e070b2d7ed9cdb7fb4be8babecc18f3b309e3e1f13bff5250ba6d07
Pointer size: 131 Bytes
Size of remote file: 179 kB

sample_images/vqav2_sample1.png ADDED Viewed

Git LFS Details

SHA256: 76438bfa85ee5a1777e863739d7c758d8e7ef87c74f9570d14b8846d9ddca5d1
Pointer size: 131 Bytes
Size of remote file: 287 kB

samples.json ADDED Viewed

	@@ -0,0 +1,44 @@

+[
+    {
+        "dataset": "DocVQA",
+        "image": "sample_images/docvqa_sample1.png",
+        "question": "What is the 'actual' value per 1000, during the year 1975?",
+        "ground_truth": "0.28",
+        "description": "Document question answering about statistical data"
+    },
+    {
+        "dataset": "DocVQA",
+        "image": "sample_images/docvqa_sample2.png",
+        "question": "What is name of university?",
+        "ground_truth": "University of California",
+        "description": "Document question answering about institutional information"
+    },
+    {
+        "dataset": "InfographicVQA",
+        "image": "sample_images/infovqa_sample1.jpeg",
+        "question": "Which social platform has heavy female audience?",
+        "ground_truth": "Pinterest",
+        "description": "Infographic question answering about social media demographics"
+    },
+    {
+        "dataset": "ChartQA",
+        "image": "sample_images/chartqa_sample1.jpeg",
+        "question": "What is the highest value in the chart?",
+        "ground_truth": "Unknown (sample chart)",
+        "description": "Chart question answering about data visualization"
+    },
+    {
+        "dataset": "TextVQA",
+        "image": "sample_images/textvqa_sample1.jpg",
+        "question": "What text is visible in the image?",
+        "ground_truth": "Various text (sample image)",
+        "description": "Text-based visual question answering"
+    },
+    {
+        "dataset": "VQAv2",
+        "image": "sample_images/vqav2_sample1.png",
+        "question": "What is in the image?",
+        "ground_truth": "Various objects (sample image)",
+        "description": "General visual question answering"
+    }
+]

src/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # DAM Demo Package

src/dam_models.py ADDED Viewed

	@@ -0,0 +1,260 @@

+"""
+DAM Model Classes for Demo
+Simplified versions of DAM inference for Hugging Face Space deployment
+"""
+import os
+import torch
+import time
+from PIL import Image
+from collections import defaultdict
+from typing import Dict, Tuple, Optional
+from transformers import AutoModel
+# Simplified utility functions
+def resize_keep_aspect(img: Image.Image, max_size: int = 1024) -> Image.Image:
+    """Resize image while keeping aspect ratio."""
+    W, H = img.size
+    if max(W, H) <= max_size:
+        return img
+    if W > H:
+        new_W, new_H = max_size, int(H * max_size / W)
+    else:
+        new_W, new_H = int(W * max_size / H), max_size
+    return img.resize((new_W, new_H), Image.LANCZOS)
+def create_full_image_mask(width: int, height: int) -> Image.Image:
+    """Create a full white mask for the entire image."""
+    return Image.new("L", (width, height), 255)
+def get_windows(width: int, height: int, window_size: int, stride: int):
+    """Generate sliding window coordinates."""
+    windows = []
+    for y in range(0, height - window_size + 1, stride):
+        for x in range(0, width - window_size + 1, stride):
+            windows.append((x, y, min(x + window_size, width), min(y + window_size, height)))
+    # Add remaining edge windows
+    if width % stride != 0:
+        for y in range(0, height - window_size + 1, stride):
+            windows.append((width - window_size, y, width, min(y + window_size, height)))
+    if height % stride != 0:
+        for x in range(0, width - window_size + 1, stride):
+            windows.append((x, height - window_size, min(x + window_size, width), height))
+    return windows
+def aggregate_votes(votes: Dict[str, float]) -> str:
+    """Aggregate votes and return the answer with highest weight."""
+    if not votes:
+        return ""
+    return max(votes.items(), key=lambda x: x[1])[0]
+class DAMOriginal:
+    """Original DAM model using full image."""
+    def __init__(self, device: str = "auto"):
+        if device == "auto":
+            self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+        else:
+            self.device = torch.device(device)
+        print(f"Loading DAM model on {self.device}...")
+        self.dam_model = AutoModel.from_pretrained(
+            "nvidia/DAM-3B-Self-Contained",
+            trust_remote_code=True,
+        ).to(self.device)
+        self.dam = self.dam_model.init_dam(conv_mode="v1", prompt_mode="full+focal_crop")
+        print("DAM Original model loaded successfully!")
+    def predict(self, img: Image.Image, question: str, max_new_tokens: int = 100) -> Tuple[str, float]:
+        """
+        Generate prediction for the question using full image.
+        Returns:
+            Tuple of (answer, inference_time)
+        """
+        # Resize image
+        img = resize_keep_aspect(img, 1024)
+        W, H = img.size
+        # Create full image mask
+        mask = create_full_image_mask(W, H)
+        # Format prompt
+        prompt = (
+            "<image>\n"
+            "Answer each question concisely in a single word or short phrase, "
+            "without any lengthy descriptions or explanations.\n"
+            "Rely only on information that is clearly visible in the provided image.\n"
+            "If the answer cannot be determined from the image, respond with \"unanswerable\".\n"
+            f"Question: {question}\nAnswer:"
+        )
+        # Inference parameters
+        params = {
+            "streaming": False,
+            "temperature": 1e-7,
+            "top_p": 0.5,
+            "num_beams": 1,
+            "max_new_tokens": max_new_tokens
+        }
+        start_time = time.time()
+        try:
+            tokens = self.dam.get_description(img, mask, prompt, **params)
+            inference_time = time.time() - start_time
+            if isinstance(tokens, str):
+                answer = tokens.strip()
+            else:
+                answer = "".join(tokens).strip()
+            return answer, inference_time
+        except Exception as e:
+            inference_time = time.time() - start_time
+            print(f"Error in DAM Original prediction: {e}")
+            return f"Error: {str(e)}", inference_time
+class DAMSlidingWindow:
+    """DAM model with sliding window approach."""
+    def __init__(self, device: str = "auto", window_size: int = 512, stride: int = 256):
+        if device == "auto":
+            self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+        else:
+            self.device = torch.device(device)
+        self.window_size = window_size
+        self.stride = stride
+        print(f"Loading DAM model on {self.device}...")
+        self.dam_model = AutoModel.from_pretrained(
+            "nvidia/DAM-3B-Self-Contained",
+            trust_remote_code=True,
+        ).to(self.device)
+        self.dam = self.dam_model.init_dam(conv_mode="v1", prompt_mode="full+focal_crop")
+        print(f"DAM Sliding Window model loaded successfully! (window_size={window_size}, stride={stride})")
+    def predict(self, img: Image.Image, question: str, max_new_tokens: int = 100,
+                unanswerable_weight: float = 1.0) -> Tuple[str, float, Dict]:
+        """
+        Generate prediction using sliding window approach with voting.
+        Returns:
+            Tuple of (answer, inference_time, voting_details)
+        """
+        # Resize image
+        img = resize_keep_aspect(img, 1024)
+        W, H = img.size
+        # Format prompt
+        prompt = (
+            "<image>\n"
+            "Answer each question concisely in a single word or short phrase, "
+            "without any lengthy descriptions or explanations.\n"
+            "Rely only on information that is clearly visible in the provided image.\n"
+            "If the answer cannot be determined from the image, respond with \"unanswerable\".\n"
+            f"Question: {question}\nAnswer:"
+        )
+        # Inference parameters
+        params = {
+            "streaming": False,
+            "temperature": 1e-7,
+            "top_p": 0.5,
+            "num_beams": 1,
+            "max_new_tokens": max_new_tokens
+        }
+        start_time = time.time()
+        votes = defaultdict(float)
+        voting_details = {"full_image": None, "windows": []}
+        try:
+            # Full image vote
+            mask_full = create_full_image_mask(W, H)
+            ans_full = self.dam.get_description(img, mask_full, prompt, **params)
+            if isinstance(ans_full, str):
+                ans_full = ans_full.strip()
+            else:
+                ans_full = "".join(ans_full).strip()
+            if ans_full:
+                weight = 1.0
+                if ans_full.lower() == "unanswerable":
+                    weight *= unanswerable_weight
+                votes[ans_full] += weight
+                voting_details["full_image"] = {"answer": ans_full, "weight": weight}
+            # Sliding window votes
+            windows = get_windows(W, H, self.window_size, self.stride)
+            for i, (x0, y0, x1, y1) in enumerate(windows):
+                crop = img.crop((x0, y0, x1, y1))
+                mask_crop = Image.new("L", (x1-x0, y1-y0), 255)
+                ans = self.dam.get_description(crop, mask_crop, prompt, **params)
+                if isinstance(ans, str):
+                    ans = ans.strip()
+                else:
+                    ans = "".join(ans).strip()
+                if ans:
+                    weight = ((x1-x0) * (y1-y0)) / (W * H)
+                    if ans.lower() == "unanswerable":
+                        weight *= unanswerable_weight
+                    votes[ans] += weight
+                    voting_details["windows"].append({
+                        "window_id": i,
+                        "coords": (x0, y0, x1, y1),
+                        "answer": ans,
+                        "weight": weight
+                    })
+            # Aggregate votes
+            prediction = aggregate_votes(votes)
+            if not prediction:
+                prediction = ans_full if 'ans_full' in locals() else "No answer"
+            inference_time = time.time() - start_time
+            # Add vote summary to details
+            voting_details["vote_summary"] = dict(votes)
+            voting_details["final_answer"] = prediction
+            voting_details["total_windows"] = len(windows)
+            return prediction, inference_time, voting_details
+        except Exception as e:
+            inference_time = time.time() - start_time
+            print(f"Error in DAM Sliding Window prediction: {e}")
+            return f"Error: {str(e)}", inference_time, {"error": str(e)}
+# Global model instances (lazy loading)
+_dam_original = None
+_dam_sliding = None
+def get_dam_original(device: str = "auto"):
+    """Get or create DAM Original model instance."""
+    global _dam_original
+    if _dam_original is None:
+        _dam_original = DAMOriginal(device)
+    return _dam_original
+def get_dam_sliding(device: str = "auto", window_size: int = 512, stride: int = 256):
+    """Get or create DAM Sliding Window model instance."""
+    global _dam_sliding
+    if _dam_sliding is None:
+        _dam_sliding = DAMSlidingWindow(device, window_size, stride)
+    return _dam_sliding

static/aivn_logo.png ADDED Viewed

Git LFS Details

SHA256: f77eb32896696eb0237ddd42863365e942bc1dc79352c1639a062a4237871857
Pointer size: 130 Bytes
Size of remote file: 32 kB

static/vlai_logo.png ADDED Viewed

Git LFS Details

SHA256: b021166780f6662166d8f0330c073957d449e4fac4968c89a8de7cedf9576199
Pointer size: 130 Bytes
Size of remote file: 25.4 kB

vlai_template.py ADDED Viewed

	@@ -0,0 +1,240 @@

+import os, base64
+import gradio as gr
+# Theming (can be overridden by the host app)
+PRIMARY_COLOR = "#0F6CBD"   # medical calm blue
+ACCENT_COLOR = "#C4314B"    # medical alert red
+SUCCESS_COLOR = "#2E7D32"   # positive/ok
+BG1 = "#F0F7FF"
+BG2 = "#E8F0FA"
+BG3 = "#DDE7F8"
+FONT_FAMILY = "'Inter', -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Helvetica Neue', Arial, 'Noto Sans', 'Liberation Sans', sans-serif"
+PROJECT_DESCRIPTION = ""
+META_INFO = []  # list of (label, value)
+def set_colors(primary: str = None, accent: str = None, bg1: str = None, bg2: str = None, bg3: str = None):
+    """Allow host app to set theme colors dynamically."""
+    global PRIMARY_COLOR, ACCENT_COLOR, BG1, BG2, BG3, custom_css
+    if primary:
+        PRIMARY_COLOR = primary
+    if accent:
+        ACCENT_COLOR = accent
+    if bg1:
+        BG1 = bg1
+    if bg2:
+        BG2 = bg2
+    if bg3:
+        BG3 = bg3
+    # Rebuild CSS with new colors
+    custom_css = _build_custom_css()
+def set_font(font_family: str):
+    """Allow host app to set a custom font stack (e.g., 'Inter', system fallbacks)."""
+    global FONT_FAMILY, custom_css
+    if font_family and isinstance(font_family, str):
+        FONT_FAMILY = font_family
+        custom_css = _build_custom_css()
+def set_meta(project_name: str = None, year: str = None, module: str = None, description: str = None, meta_items: list = None):
+    """Set project metadata used across the header and info sections."""
+    global PROJECT_DESCRIPTION, META_INFO
+    if description is not None:
+        PROJECT_DESCRIPTION = description
+    if meta_items is not None:
+        META_INFO = meta_items
+def configure(project_name: str = None, year: str = None, module: str = None, description: str = None,
+              colors: dict = None, font_family: str = None, meta_items: list = None):
+    """One-call configuration for meta, theme, and font."""
+    if colors:
+        set_colors(
+            primary=colors.get("primary"),
+            accent=colors.get("accent"),
+            bg1=colors.get("bg1"),
+            bg2=colors.get("bg2"),
+            bg3=colors.get("bg3"),
+        )
+    if font_family:
+        set_font(font_family)
+    set_meta(project_name, year, module, description, meta_items)
+def image_to_base64(image_path: str):
+    # Construct the absolute path to the image
+    current_dir = os.path.dirname(os.path.abspath(__file__))
+    full_image_path = os.path.join(current_dir, image_path)
+    with open(full_image_path, "rb") as f:
+        return base64.b64encode(f.read()).decode("utf-8")
+def create_header():
+    with gr.Row():
+        with gr.Column(scale=2):
+            logo_base64 = image_to_base64("static/aivn_logo.png")
+            gr.HTML(
+                f"""<img src="data:image/png;base64,{logo_base64}"
+                        alt="Logo"
+                        style="height:120px;width:auto;margin:0 auto;margin-bottom:16px; display:block;">"""
+            )
+        with gr.Column(scale=2):
+            gr.HTML(f"""
+<div style="display:flex;justify-content:flex-start;align-items:center;gap:30px;">
+    <div>
+        <h1 style="margin-bottom:0; color: {PRIMARY_COLOR}; font-size: 2.5em; font-weight: bold;">DAM-QA Demo </h1>
+        <h3 style="color: #888; font-style: italic">Describe Anything Model for Visual Question Answering on Text-rich Images</h3>
+    </div>
+</div>
+""")
+def create_footer():
+    logo_base64_vlai = image_to_base64("static/vlai_logo.png")
+    footer_html = """
+<style>
+  .sticky-footer{position:fixed;bottom:0px;left:0;width:100%;background:#E8F5E8;
+                 padding:10px;box-shadow:0 -2px 10px rgba(0,0,0,0.1);z-index:1000;}
+  .content-wrap{padding-bottom:60px;}
+</style>""" + f"""
+<div class="sticky-footer">
+  <div style="text-align:center;font-size:18px; color: #888">
+    Created by
+    <a href="https://vlai.work" target="_blank" style="color:#465C88;text-decoration:none;font-weight:bold; display:inline-flex; align-items:center;"> VLAI
+    <img src="data:image/png;base64,{logo_base64_vlai}" alt="Logo" style="height:20px; width:auto;">
+    </a> from <a href="https://aivietnam.edu.vn/" target="_blank" style="color:#355724;text-decoration:none;font-weight:bold">AI VIET NAM</a>
+  </div>
+</div>
+"""
+    return gr.HTML(footer_html)
+def _build_custom_css() -> str:
+    return f"""
+@import url('https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&display=swap');
+.gradio-container {{
+    min-height: 100vh !important;
+    width: 100vw !important;
+    margin: 0 !important;
+    padding: 0px !important;
+    background: linear-gradient(135deg, {BG1} 0%, {BG2} 50%, {BG3} 100%);
+    background-size: 600% 600%;
+    animation: gradientBG 7s ease infinite;
+}}
+/* Global font setup */
+body, .gradio-container, .gr-block, .gr-markdown, .gr-button, .gr-input,
+.gr-dropdown, .gr-number, .gr-plot, .gr-dataframe, .gr-accordion, .gr-form,
+.gr-textbox, .gr-html, table, th, td, label, h1, h2, h3, h4, h5, h6, p, span, div {{
+    font-family: {FONT_FAMILY} !important;
+}}
+@keyframes gradientBG {{
+    0% {{background-position: 0% 50%;}}
+    50% {{background-position: 100% 50%;}}
+    100% {{background-position: 0% 50%;}}
+}}
+/* Minimize spacing and padding */
+.content-wrap {{
+    padding: 2px !important;
+    margin: 0 !important;
+}}
+/* Reduce component spacing */
+.gr-row {{
+    gap: 5px !important;
+    margin: 2px 0 !important;
+}}
+.gr-column {{
+    gap: 4px !important;
+    padding: 4px !important;
+}}
+/* Accordion optimization */
+.gr-accordion {{
+    margin: 4px 0 !important;
+}}
+.gr-accordion .gr-accordion-content {{
+    padding: 2px !important;
+}}
+/* Form elements spacing */
+.gr-form {{
+    gap: 2px !important;
+}}
+/* Button styling */
+.gr-button {{
+    margin: 2px 0 !important;
+}}
+/* DataFrame optimization */
+.gr-dataframe {{
+    margin: 4px 0 !important;
+}}
+/* Remove horizontal scroll from data preview */
+.gr-dataframe .wrap {{
+    overflow-x: auto !important;
+    max-width: 100% !important;
+}}
+/* Plot optimization */
+.gr-plot {{
+    margin: 4px 0 !important;
+}}
+/* Reduce markdown margins */
+.gr-markdown {{
+    margin: 2px 0 !important;
+}}
+/* Footer positioning */
+.sticky-footer {{
+    position: fixed;
+    bottom: 0px;
+    left: 0;
+    width: 100%;
+    background: {BG1};
+    padding: 6px !important;
+    box-shadow: 0 -2px 10px rgba(0,0,0,0.1);
+    z-index: 1000;
+}}
+"""
+# Initialize CSS using defaults
+custom_css = _build_custom_css()
+def render_info_card(description: str = None, meta_items: list = None, icon: str = "🧠", title: str = "About this demo") -> str:
+    desc = description if description is not None else PROJECT_DESCRIPTION
+    items = meta_items if meta_items is not None else META_INFO
+    meta_html = " · ".join([f"<span><strong>{k}</strong>: {v}</span>" for k, v in items]) if items else ""
+    return f"""
+    <div style="margin: 8px 0 8px 0;">
+      <div style="background:#F5F9FF;border-left:6px solid {PRIMARY_COLOR};padding:14px 16px;border-radius:10px;box-shadow:0 1px 3px rgba(0,0,0,0.06);">
+        <div style="display:flex;gap:14px;align-items:flex-start;">
+          <div style="font-size:22px;">{icon}</div>
+          <div>
+            <div style="font-weight:700;color:{PRIMARY_COLOR};margin-bottom:4px;">{title}</div>
+            <div style="color:#000;font-size:14px;line-height:1.5;">{desc}</div>
+            <div style="margin-top:8px;color:#000;font-size:13px;">{meta_html}</div>
+          </div>
+        </div>
+      </div>
+    </div>
+    """
+def render_disclaimer(text: str, icon: str = "⚠️", title: str = "Educational Use Only") -> str:
+    return f"""
+    <div style=\"margin: 8px 0 6px 0;\">
+      <div style=\"background:#FFF4F4;border-left:6px solid {ACCENT_COLOR};padding:12px 16px;border-radius:8px;box-shadow:0 1px 3px rgba(0,0,0,0.06);\">
+        <div style=\"display:flex;gap:10px;align-items:flex-start;color:#000;\">
+          <span style=\"font-size:20px\">{icon}</span>
+          <div>
+            <div style=\"font-weight:700; margin-bottom:4px;\">{title}</div>
+            <div style=\"font-size:14px; line-height:1.4;\">{text}</div>
+          </div>
+        </div>
+      </div>
+    </div>
+    """