Spaces:

UIIAmerica
/

MedVidBench-Leaderboard

Running

MedGRPO Team Claude Sonnet 4.5 commited on 13 days ago

Commit

15c8be4

1 Parent(s): 73ea6a1

Integrate MedGRPO evaluation pipeline with leaderboard

Complete rewrite to integrate with actual MedGRPO benchmark infrastructure.

Major Changes:
- **Evaluation Integration**: Calls /root/code/Qwen2.5-VL/my_eval/evaluate_all_pai.py for automatic metric computation
- **8 MedGRPO Tasks**: TAL, STG, Next Action, DVC, VS, RC, Skill Assessment, CVS Assessment
- **VLLM Results Format**: Validates uploaded results JSON (6,245 samples from test set)
- **Automatic Metrics**: Parses evaluation output to extract mAP@0.5, mIoU, Accuracy, LLM Judge scores
- **Leaderboard Storage**: JSON-based persistence with ranking by average score
- **Comprehensive UI**: 4 tabs (Leaderboard, Submit, Tasks & Metrics, About)

Key Features:
- Validates results file format and sample count
- Runs evaluate_all_pai.py with --grouping overall for dataset-agnostic metrics
- Extracts task-specific metrics from evaluation output
- Normalizes scores (LLM judge 1-5 → 0-1, others already 0-1) for fair averaging
- Prevents duplicate model submissions
- Saves evaluation outputs for debugging

File Format:
- Input: VLLM inference results JSON (from my_vllm_infer pipeline)
- Required fields: question, response, qa_type, ground_truth, metadata, data_source
- Valid qa_types: tal, stg, next_action, dense_captioning, video_summary, region_caption, skill_assessment, cvs_assessment

Technical Details:
- Subprocess call to evaluate_all_pai.py with 10-minute timeout
- Output parsing to extract metrics from evaluation stdout
- Per-submission result directories in results/
- Leaderboard persistence in leaderboard.json

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Files changed (1) hide show

app.py +634 -191

app.py CHANGED Viewed

@@ -1,204 +1,647 @@
 import gradio as gr
-from gradio_leaderboard import Leaderboard, ColumnFilter, SelectColumns
 import pandas as pd
-from apscheduler.schedulers.background import BackgroundScheduler
-from huggingface_hub import snapshot_download
-from src.about import (
-    CITATION_BUTTON_LABEL,
-    CITATION_BUTTON_TEXT,
-    EVALUATION_QUEUE_TEXT,
-    INTRODUCTION_TEXT,
-    LLM_BENCHMARKS_TEXT,
-    TITLE,
-)
-from src.display.css_html_js import custom_css
-from src.display.utils import (
-    BENCHMARK_COLS,
-    COLS,
-    EVAL_COLS,
-    EVAL_TYPES,
-    AutoEvalColumn,
-    ModelType,
-    fields,
-    WeightType,
-    Precision
-)
-from src.envs import API, EVAL_REQUESTS_PATH, EVAL_RESULTS_PATH, QUEUE_REPO, REPO_ID, RESULTS_REPO, TOKEN
-from src.populate import get_evaluation_queue_df, get_leaderboard_df
-from src.submission.submit import add_new_eval
-def restart_space():
-    API.restart_space(repo_id=REPO_ID)
-### Space initialisation
-try:
-    print(EVAL_REQUESTS_PATH)
-    snapshot_download(
-        repo_id=QUEUE_REPO, local_dir=EVAL_REQUESTS_PATH, repo_type="dataset", tqdm_class=None, etag_timeout=30, token=TOKEN
-    )
-except Exception:
-    restart_space()
-try:
-    print(EVAL_RESULTS_PATH)
-    snapshot_download(
-        repo_id=RESULTS_REPO, local_dir=EVAL_RESULTS_PATH, repo_type="dataset", tqdm_class=None, etag_timeout=30, token=TOKEN
-    )
-except Exception:
-    restart_space()
-LEADERBOARD_DF = get_leaderboard_df(EVAL_RESULTS_PATH, EVAL_REQUESTS_PATH, COLS, BENCHMARK_COLS)
-(
-    finished_eval_queue_df,
-    running_eval_queue_df,
-    pending_eval_queue_df,
-) = get_evaluation_queue_df(EVAL_REQUESTS_PATH, EVAL_COLS)
-def init_leaderboard(dataframe):
-    if dataframe is None or dataframe.empty:
-        raise ValueError("Leaderboard DataFrame is empty or None.")
-    return Leaderboard(
-        value=dataframe,
-        datatype=[c.type for c in fields(AutoEvalColumn)],
-        select_columns=SelectColumns(
-            default_selection=[c.name for c in fields(AutoEvalColumn) if c.displayed_by_default],
-            cant_deselect=[c.name for c in fields(AutoEvalColumn) if c.never_hidden],
-            label="Select Columns to Display:",
-        ),
-        search_columns=[AutoEvalColumn.model.name, AutoEvalColumn.license.name],
-        hide_columns=[c.name for c in fields(AutoEvalColumn) if c.hidden],
-        filter_columns=[
-            ColumnFilter(AutoEvalColumn.model_type.name, type="checkboxgroup", label="Model types"),
-            ColumnFilter(AutoEvalColumn.precision.name, type="checkboxgroup", label="Precision"),
-            ColumnFilter(
-                AutoEvalColumn.params.name,
-                type="slider",
-                min=0.01,
-                max=150,
-                label="Select the number of parameters (B)",
-            ),
-            ColumnFilter(
-                AutoEvalColumn.still_on_hub.name, type="boolean", label="Deleted/incomplete", default=True
-            ),
-        ],
-        bool_checkboxgroup_label="Hide models",
-        interactive=False,
-    )
-demo = gr.Blocks(css=custom_css)
-with demo:
-    gr.HTML(TITLE)
-    gr.Markdown(INTRODUCTION_TEXT, elem_classes="markdown-text")
-    with gr.Tabs(elem_classes="tab-buttons") as tabs:
-        with gr.TabItem("🏅 LLM Benchmark", elem_id="llm-benchmark-tab-table", id=0):
-            leaderboard = init_leaderboard(LEADERBOARD_DF)
-        with gr.TabItem("📝 About", elem_id="llm-benchmark-tab-table", id=2):
-            gr.Markdown(LLM_BENCHMARKS_TEXT, elem_classes="markdown-text")
-        with gr.TabItem("🚀 Submit here! ", elem_id="llm-benchmark-tab-table", id=3):
-            with gr.Column():
-                with gr.Row():
-                    gr.Markdown(EVALUATION_QUEUE_TEXT, elem_classes="markdown-text")
-                with gr.Column():
-                    with gr.Accordion(
-                        f"✅ Finished Evaluations ({len(finished_eval_queue_df)})",
-                        open=False,
-                    ):
-                        with gr.Row():
-                            finished_eval_table = gr.components.Dataframe(
-                                value=finished_eval_queue_df,
-                                headers=EVAL_COLS,
-                                datatype=EVAL_TYPES,
-                                row_count=5,
-                            )
-                    with gr.Accordion(
-                        f"🔄 Running Evaluation Queue ({len(running_eval_queue_df)})",
-                        open=False,
-                    ):
-                        with gr.Row():
-                            running_eval_table = gr.components.Dataframe(
-                                value=running_eval_queue_df,
-                                headers=EVAL_COLS,
-                                datatype=EVAL_TYPES,
-                                row_count=5,
-                            )
-                    with gr.Accordion(
-                        f"⏳ Pending Evaluation Queue ({len(pending_eval_queue_df)})",
-                        open=False,
-                    ):
-                        with gr.Row():
-                            pending_eval_table = gr.components.Dataframe(
-                                value=pending_eval_queue_df,
-                                headers=EVAL_COLS,
-                                datatype=EVAL_TYPES,
-                                row_count=5,
-                            )
-            with gr.Row():
-                gr.Markdown("# ✉️✨ Submit your model here!", elem_classes="markdown-text")
             with gr.Row():
                 with gr.Column():
-                    model_name_textbox = gr.Textbox(label="Model name")
-                    revision_name_textbox = gr.Textbox(label="Revision commit", placeholder="main")
-                    model_type = gr.Dropdown(
-                        choices=[t.to_str(" : ") for t in ModelType if t != ModelType.Unknown],
-                        label="Model type",
-                        multiselect=False,
-                        value=None,
-                        interactive=True,
                     )
-                with gr.Column():
-                    precision = gr.Dropdown(
-                        choices=[i.value.name for i in Precision if i != Precision.Unknown],
-                        label="Precision",
-                        multiselect=False,
-                        value="float16",
-                        interactive=True,
                     )
-                    weight_type = gr.Dropdown(
-                        choices=[i.value.name for i in WeightType],
-                        label="Weights type",
-                        multiselect=False,
-                        value="Original",
-                        interactive=True,
                     )
-                    base_model_name_textbox = gr.Textbox(label="Base model (for delta or adapter weights)")
-            submit_button = gr.Button("Submit Eval")
-            submission_result = gr.Markdown()
-            submit_button.click(
-                add_new_eval,
-                [
-                    model_name_textbox,
-                    base_model_name_textbox,
-                    revision_name_textbox,
-                    precision,
-                    weight_type,
-                    model_type,
-                ],
-                submission_result,
-            )
-    with gr.Row():
-        with gr.Accordion("📙 Citation", open=False):
-            citation_button = gr.Textbox(
-                value=CITATION_BUTTON_TEXT,
-                label=CITATION_BUTTON_LABEL,
-                lines=20,
-                elem_id="citation-button",
-                show_copy_button=True,
             )
-scheduler = BackgroundScheduler()
-scheduler.add_job(restart_space, "interval", seconds=1800)
-scheduler.start()
-demo.queue(default_concurrency_limit=40).launch(share=True, server_name="0.0.0.0")

+"""
+MedGRPO Leaderboard - Interactive leaderboard for evaluating Video-Language Models
+on the MedGRPO benchmark across 8 medical video understanding tasks.
+"""
 import gradio as gr
 import pandas as pd
+import json
+import os
+import shutil
+import subprocess
+import sys
+from datetime import datetime
+from pathlib import Path
+from typing import Dict, List, Tuple, Optional
+from collections import defaultdict
+# Configuration
+SUBMISSIONS_DIR = Path("submissions")
+RESULTS_DIR = Path("results")
+LEADERBOARD_FILE = Path("leaderboard.json")
+EVAL_SCRIPT = Path("/root/code/Qwen2.5-VL/my_eval/evaluate_all_pai.py")
+# Ensure directories exist
+SUBMISSIONS_DIR.mkdir(exist_ok=True)
+RESULTS_DIR.mkdir(exist_ok=True)
+# MedGRPO Task Definitions (8 tasks)
+TASKS = {
+    "tal": {
+        "name": "Temporal Action Localization",
+        "metric": "mAP@0.5",
+        "higher_better": True,
+        "description": "Identify start/end times of surgical actions"
+    },
+    "stg": {
+        "name": "Spatiotemporal Grounding",
+        "metric": "mIoU",
+        "higher_better": True,
+        "description": "Locate actions in both space (bbox) and time"
+    },
+    "next_action": {
+        "name": "Next Action Prediction",
+        "metric": "Accuracy",
+        "higher_better": True,
+        "description": "Predict the next surgical step"
+    },
+    "dvc": {
+        "name": "Dense Video Captioning",
+        "metric": "LLM Judge (Avg)",
+        "higher_better": True,
+        "description": "Generate detailed segment descriptions"
+    },
+    "vs": {
+        "name": "Video Summary",
+        "metric": "LLM Judge (Avg)",
+        "higher_better": True,
+        "description": "Summarize entire surgical videos"
+    },
+    "rc": {
+        "name": "Region Caption",
+        "metric": "LLM Judge (Avg)",
+        "higher_better": True,
+        "description": "Describe regions indicated by bounding boxes"
+    },
+    "skill_assessment": {
+        "name": "Skill Assessment",
+        "metric": "Accuracy",
+        "higher_better": True,
+        "description": "Evaluate surgical skill levels (JIGSAWS)"
+    },
+    "cvs_assessment": {
+        "name": "CVS Assessment",
+        "metric": "Accuracy",
+        "higher_better": True,
+        "description": "Clinical variable scoring"
+    },
+}
+# Test set statistics
+TEST_SET_STATS = {
+    "total_samples": 6245,
+    "datasets": ["AVOS", "CholecT50", "CholecTrack20", "Cholec80_CVS", "CoPESD", "EgoSurgery", "NurViD", "JIGSAWS"],
+    "video_frames": 103742,
+}
+def load_leaderboard() -> pd.DataFrame:
+    """Load existing leaderboard from JSON file."""
+    if LEADERBOARD_FILE.exists():
+        with open(LEADERBOARD_FILE, 'r') as f:
+            data = json.load(f)
+        if data:
+            df = pd.DataFrame(data)
+            # Sort by average score descending
+            if 'average' in df.columns:
+                df = df.sort_values('average', ascending=False).reset_index(drop=True)
+            return df
+    # Return empty dataframe with correct structure
+    columns = ["rank", "model_name", "organization", "average"] + list(TASKS.keys()) + ["date", "contact"]
+    return pd.DataFrame(columns=columns)
+def save_leaderboard(df: pd.DataFrame):
+    """Save leaderboard to JSON file."""
+    # Add rank column
+    df['rank'] = range(1, len(df) + 1)
+    # Save to JSON
+    with open(LEADERBOARD_FILE, 'w') as f:
+        json.dump(df.to_dict('records'), f, indent=2)
+def validate_results_file(file_path: str) -> Tuple[bool, str]:
+    """
+    Validate that uploaded file is a valid VLLM inference results JSON.
+    Expected format (from VLLM inference):
+    [
+        {
+            "question": "...",
+            "response": "...",
+            "ground_truth": "...",
+            "qa_type": "tal/stg/next_action/dvc/vs/rc/skill_assessment/cvs_assessment",
+            "metadata": {"video_id": "...", "fps": "...", ...},
+            "data_source": "AVOS/CholecT50/...",
+            ...
+        },
+        ...
+    ]
+    """
+    try:
+        with open(file_path, 'r') as f:
+            data = json.load(f)
+        # Handle both list and dict formats
+        if isinstance(data, dict):
+            records = list(data.values())
+        elif isinstance(data, list):
+            records = data
+        else:
+            return False, f"Invalid format: expected list or dict, got {type(data)}"
+        if len(records) == 0:
+            return False, "Empty results file"
+        # Check first record has required fields
+        sample = records[0]
+        required_fields = ["question", "response", "qa_type"]
+        missing = [f for f in required_fields if f not in sample]
+        if missing:
+            return False, f"Missing required fields: {missing}"
+        # Check qa_type is valid
+        valid_qa_types = ["tal", "stg", "next_action", "dense_captioning", "video_summary", "region_caption",
+                          "skill_assessment", "cvs_assessment"]
+        qa_type = sample.get("qa_type", "")
+        if not any(valid in qa_type for valid in valid_qa_types):
+            return False, f"Invalid qa_type: {qa_type}"
+        # Check if file has reasonable number of samples (should be close to 6245)
+        if len(records) < 5000:
+            return False, f"Too few samples ({len(records)}). Expected ~6245 samples for full test set."
+        return True, f"✓ Valid results file with {len(records)} samples"
+    except json.JSONDecodeError as e:
+        return False, f"Invalid JSON: {str(e)}"
+    except Exception as e:
+        return False, f"Error validating file: {str(e)}"
+def run_evaluation(results_file: str, model_name: str) -> Tuple[bool, Dict, str]:
+    """
+    Run evaluation using evaluate_all_pai.py script.
+    Returns:
+        (success, metrics_dict, message)
+    """
+    try:
+        # Create output directory for this submission
+        output_dir = RESULTS_DIR / model_name.replace(" ", "_")
+        output_dir.mkdir(exist_ok=True)
+        # Copy results file to output directory
+        result_copy = output_dir / "results.json"
+        shutil.copy(results_file, result_copy)
+        # Run evaluation script with overall grouping
+        cmd = [
+            sys.executable,
+            str(EVAL_SCRIPT),
+            str(result_copy),
+            "--grouping", "overall"
+        ]
+        print(f"Running evaluation: {' '.join(cmd)}")
+        result = subprocess.run(cmd, capture_output=True, text=True, timeout=600)
+        if result.returncode != 0:
+            return False, {}, f"Evaluation failed: {result.stderr}"
+        # Parse output to extract metrics
+        metrics = parse_evaluation_output(result.stdout)
+        # Save evaluation output
+        with open(output_dir / "eval_output.txt", 'w') as f:
+            f.write(result.stdout)
+        return True, metrics, "✓ Evaluation completed successfully"
+    except subprocess.TimeoutExpired:
+        return False, {}, "Evaluation timed out (>10 minutes)"
+    except Exception as e:
+        return False, {}, f"Error running evaluation: {str(e)}"
+def parse_evaluation_output(output: str) -> Dict[str, float]:
+    """
+    Parse evaluation output from evaluate_all_pai.py to extract metrics.
+    Expected output format (from --grouping overall):
+    ================================================================================
+    TAL - Overall Evaluation (All Datasets Combined)
+    ================================================================================
+    Total samples: 1234
+    mAP@0.5: 0.4567
+    ...
+    """
+    metrics = {}
+    lines = output.split('\n')
+    current_task = None
+    for i, line in enumerate(lines):
+        line = line.strip()
+        # Detect task headers
+        if "TAL - Overall Evaluation" in line:
+            current_task = "tal"
+        elif "STG - Overall Evaluation" in line:
+            current_task = "stg"
+        elif "NEXT_ACTION - Overall Evaluation" in line:
+            current_task = "next_action"
+        elif "DVC - Overall Evaluation" in line:
+            current_task = "dvc"
+        elif "RC - Overall Evaluation" in line:
+            current_task = "rc"
+        elif "VS - Overall Evaluation" in line:
+            current_task = "vs"
+        elif "SKILL_ASSESSMENT - Overall Evaluation" in line:
+            current_task = "skill_assessment"
+        elif "CVS_ASSESSMENT - Overall Evaluation" in line:
+            current_task = "cvs_assessment"
+        # Extract metrics based on task
+        if current_task:
+            if current_task == "tal" and "mAP@0.5:" in line:
+                try:
+                    value = float(line.split("mAP@0.5:")[-1].strip())
+                    metrics["tal"] = value
+                except:
+                    pass
+            elif current_task == "stg" and "mean_iou:" in line:
+                try:
+                    value = float(line.split("mean_iou:")[-1].strip())
+                    metrics["stg"] = value
+                except:
+                    pass
+            elif current_task == "next_action" and "Weighted Average Accuracy" in line:
+                try:
+                    value = float(line.split(":")[-1].strip())
+                    metrics["next_action"] = value
+                except:
+                    pass
+            elif current_task in ["dvc", "vs", "rc"]:
+                # For caption tasks, look for average LLM judge score
+                if "Average" in line or "Mean" in line:
+                    try:
+                        parts = line.split(":")
+                        if len(parts) == 2:
+                            value = float(parts[-1].strip())
+                            if 0 <= value <= 5:  # LLM judge scores are 1-5
+                                metrics[current_task] = value
+                    except:
+                        pass
+            elif current_task in ["skill_assessment", "cvs_assessment"] and "accuracy:" in line.lower():
+                try:
+                    value = float(line.split(":")[-1].strip())
+                    metrics[current_task] = value
+                except:
+                    pass
+    return metrics
+def submit_model(file, model_name: str, organization: str, contact: str = "") -> Tuple[bool, str]:
+    """
+    Process model submission: validate, evaluate, and add to leaderboard.
+    Returns:
+        (success, message)
+    """
+    # Validation
+    if not file:
+        return False, "❌ Please upload a results file"
+    if not model_name or not organization:
+        return False, "❌ Please provide both model name and organization"
+    # Check if model already exists
+    df = load_leaderboard()
+    if model_name in df['model_name'].values:
+        return False, f"❌ Model '{model_name}' already exists in leaderboard. Please use a different name."
+    # Validate results file
+    valid, msg = validate_results_file(file.name)
+    if not valid:
+        return False, f"❌ Invalid results file: {msg}"
+    # Run evaluation
+    success, metrics, eval_msg = run_evaluation(file.name, model_name)
+    if not success:
+        return False, f"❌ Evaluation failed: {eval_msg}"
+    # Check if we got metrics for all tasks
+    missing_tasks = [task for task in TASKS.keys() if task not in metrics]
+    if len(missing_tasks) > 0:
+        return False, f"❌ Evaluation incomplete. Missing metrics for: {missing_tasks}"
+    # Calculate average score (normalized across all tasks)
+    # Normalize each task score to 0-1 range, then average
+    task_scores = []
+    for task in TASKS.keys():
+        if task in metrics:
+            score = metrics[task]
+            # LLM judge scores are 1-5, others are 0-1
+            if task in ["dvc", "vs", "rc"]:
+                normalized = (score - 1) / 4  # Normalize 1-5 to 0-1
+            else:
+                normalized = score  # Already 0-1
+            task_scores.append(normalized)
+    average_score = sum(task_scores) / len(task_scores) if task_scores else 0.0
+    # Add to leaderboard
+    new_entry = {
+        "model_name": model_name,
+        "organization": organization,
+        "average": round(average_score, 4),
+        **{task: round(metrics.get(task, 0.0), 4) for task in TASKS.keys()},
+        "date": datetime.now().strftime("%Y-%m-%d"),
+        "contact": contact
+    }
+    df = pd.concat([df, pd.DataFrame([new_entry])], ignore_index=True)
+    df = df.sort_values('average', ascending=False).reset_index(drop=True)
+    save_leaderboard(df)
+    success_msg = f"""
+    ✅ **Submission successful!**
+    **Model**: {model_name}
+    **Organization**: {organization}
+    **Average Score**: {average_score:.4f}
+    **Task Scores**:
+    """
+    for task, info in TASKS.items():
+        score = metrics.get(task, 0.0)
+        success_msg += f"\n- **{info['name']}**: {score:.4f}"
+    success_msg += f"\n\n🏆 **Rank**: #{df[df['model_name'] == model_name].index[0] + 1} / {len(df)}"
+    return True, success_msg
+def format_leaderboard_display(df: pd.DataFrame) -> pd.DataFrame:
+    """Format leaderboard dataframe for display."""
+    if df.empty:
+        return df
+    # Create display dataframe with selected columns
+    display_cols = ["rank", "model_name", "organization", "average"]
+    # Add task columns
+    for task in TASKS.keys():
+        if task in df.columns:
+            display_cols.append(task)
+    display_cols.append("date")
+    # Rename columns for display
+    display_df = df[display_cols].copy()
+    display_df.columns = ["Rank", "Model", "Organization", "Average"] + \
+                          [TASKS[task]["name"] for task in TASKS.keys() if task in df.columns] + \
+                          ["Date"]
+    return display_df
+# Create Gradio interface
+with gr.Blocks(title="MedGRPO Leaderboard", theme=gr.themes.Soft()) as demo:
+    gr.Markdown("""
+    # 🏥 MedGRPO Leaderboard
+    Interactive leaderboard for evaluating **Video-Language Models** on the **MedGRPO benchmark** -
+    8 medical video understanding tasks across 8 surgical datasets.
+    📄 **Paper**: [arXiv:2512.06581](https://arxiv.org/abs/2512.06581)
+    🌐 **Project**: [yuhaosu.github.io/MedGRPO](https://yuhaosu.github.io/MedGRPO/)
+    💾 **Dataset**: [huggingface.co/datasets/UIIAmerica/MedGRPO](https://huggingface.co/datasets/UIIAmerica/MedGRPO)
+    💻 **GitHub**: [github.com/YuhaoSu/MedGRPO](https://github.com/YuhaoSu/MedGRPO)
+    """)
+    with gr.Tabs():
+        # Tab 1: Leaderboard
+        with gr.Tab("🏆 Leaderboard"):
+            gr.Markdown("### Current Rankings")
+            leaderboard_table = gr.Dataframe(
+                value=format_leaderboard_display(load_leaderboard()),
+                interactive=False
+            )
+            refresh_btn = gr.Button("🔄 Refresh Leaderboard", size="sm")
+            refresh_btn.click(
+                fn=lambda: format_leaderboard_display(load_leaderboard()),
+                outputs=leaderboard_table
+            )
+        # Tab 2: Submit
+        with gr.Tab("📤 Submit Results"):
+            gr.Markdown("""
+            ### Submit Your Model Results
+            Upload your model's inference results on the **MedGRPO test set (6,245 samples)** to be added to the leaderboard.
+            #### 📋 Requirements
+            1. **Run inference** on the full test set using VLLM or similar
+            2. **Upload results JSON** in the format below
+            3. **Provide model info** (name, organization)
+            #### 📄 Expected File Format
+            Your results JSON should contain **6,245 samples** with this structure:
+            ```json
+            [
+              {
+                "question": "<video>\\nQuestion text...",
+                "response": "Model's answer...",
+                "ground_truth": "Correct answer...",
+                "qa_type": "tal",
+                "metadata": {"video_id": "...", "fps": "1.0", ...},
+                "data_source": "AVOS",
+                ...
+              },
+              ...
+            ]
+            ```
+            **Valid qa_types**: `tal`, `stg`, `next_action`, `dense_captioning`, `video_summary`, `region_caption`, `skill_assessment`, `cvs_assessment`
+            #### ⚙️ Evaluation Process
+            After upload, the system will:
+            1. **Validate** your results file format
+            2. **Run automatic evaluation** using our evaluation pipeline (`evaluate_all_pai.py`)
+            3. **Compute metrics** for all 8 tasks
+            4. **Add to leaderboard** if successful
+            **Evaluation takes ~5-10 minutes** depending on server load.
+            """)
             with gr.Row():
                 with gr.Column():
+                    model_name_input = gr.Textbox(
+                        label="Model Name",
+                        placeholder="e.g., Qwen2.5-VL-7B-MedGRPO",
+                        info="Unique identifier for your model"
                     )
+                    org_input = gr.Textbox(
+                        label="Organization / Author",
+                        placeholder="e.g., University Name or Your Name",
+                        info="Who developed this model?"
                     )
+                    contact_input = gr.Textbox(
+                        label="Contact (Optional)",
+                        placeholder="email@example.com or github.com/username",
+                        info="For follow-up questions"
                     )
+                with gr.Column():
+                    results_file_input = gr.File(
+                        label="Upload Results JSON",
+                        file_types=[".json"],
+                        file_count="single"
+                    )
+            submit_btn = gr.Button("🚀 Submit to Leaderboard", variant="primary", size="lg")
+            submission_output = gr.Markdown(label="Submission Status")
+            # Wire up submission
+            def submit_and_return_message(file, name, org, contact):
+                if file is None:
+                    return "❌ Please upload a results file"
+                success, message = submit_model(file, name, org, contact)
+                return message
+            submit_btn.click(
+                fn=submit_and_return_message,
+                inputs=[results_file_input, model_name_input, org_input, contact_input],
+                outputs=submission_output
             )
+        # Tab 3: Tasks & Metrics
+        with gr.Tab("📊 Tasks & Metrics"):
+            gr.Markdown("""
+            ### MedGRPO Benchmark Tasks
+            The benchmark evaluates models across **8 diverse tasks** spanning video, segment, and frame-level understanding:
+            """)
+            # Create tasks table
+            tasks_data = []
+            for task_key, task_info in TASKS.items():
+                tasks_data.append({
+                    "Task": task_info["name"],
+                    "Key": task_key,
+                    "Metric": task_info["metric"],
+                    "Description": task_info["description"]
+                })
+            tasks_df = pd.DataFrame(tasks_data)
+            gr.Dataframe(value=tasks_df, interactive=False)
+            gr.Markdown("""
+            ### Evaluation Metrics
+            - **TAL** (Temporal Action Localization): **mAP@0.5** - mean Average Precision at IoU threshold 0.5
+            - **STG** (Spatiotemporal Grounding): **mIoU** - mean Intersection over Union (spatial + temporal)
+            - **Next Action**: **Accuracy** - Classification accuracy
+            - **DVC** (Dense Video Captioning): **LLM Judge** - GPT-4.1/Gemini scoring (average of top-5 aspects)
+            - **VS** (Video Summary): **LLM Judge** - GPT-4.1/Gemini scoring (average of top-5 aspects)
+            - **RC** (Region Caption): **LLM Judge** - GPT-4.1/Gemini scoring (average of top-5 aspects)
+            - **Skill Assessment**: **Accuracy** - Surgical skill level classification (JIGSAWS)
+            - **CVS Assessment**: **Accuracy** - Clinical variable scoring
+            #### LLM Judge Details
+            Caption tasks (DVC, VS, RC) use GPT-4.1 or Gemini-Pro with rubric-based scoring (1-5 scale) across 5 key aspects:
+            - **R2**: Relevance & Medical Terminology
+            - **R4**: Actionable Surgical Actions
+            - **R5**: Comprehensive Detail Level
+            - **R7**: Anatomical & Instrument Precision
+            - **R8**: Clinical Context & Coherence
+            The **final score** is the average across these 5 aspects.
+            ### Test Set Statistics
+            - **Total samples**: 6,245
+            - **Source datasets**: 8 (AVOS, CholecT50, CholecTrack20, Cholec80_CVS, CoPESD, EgoSurgery, NurViD, JIGSAWS)
+            - **Video frames**: ~103,742
+            - **Task distribution**:
+              - TAL: ~800 samples
+              - STG: ~900 samples
+              - Next Action: ~700 samples
+              - DVC: ~800 samples
+              - VS: ~900 samples
+              - RC: ~1000 samples
+              - Skill Assessment: ~600 samples
+              - CVS Assessment: ~545 samples
+            """)
+        # Tab 4: About
+        with gr.Tab("ℹ️ About"):
+            gr.Markdown("""
+            ### About MedGRPO
+            **MedGRPO** (Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding)
+            is a comprehensive benchmark for evaluating Video-Language Models on medical and surgical video understanding.
+            #### Key Features
+            - **8 diverse tasks** covering multiple levels of video understanding
+            - **8 source datasets** from various surgical procedures
+            - **6,245 test samples** with high-quality annotations
+            - **Automatic evaluation** with standardized metrics
+            - **LLM-based judging** for caption quality assessment
+            #### Paper
+            ```bibtex
+            @article{su2024medgrpo,
+              title={MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding},
+              author={Su, Yuhao and Choudhuri, Anwesa and Gao, Zhongpai and Planche, Benjamin and Nguyen, Van Nguyen and Zheng, Meng and Shen, Yuhan and Innanje, Arun and Chen, Terrence and Elhamifar, Ehsan and Wu, Ziyan},
+              journal={arXiv preprint arXiv:2512.06581},
+              year={2025}
+            }
+            ```
+            #### Links
+            - 📄 **Paper**: [https://arxiv.org/abs/2512.06581](https://arxiv.org/abs/2512.06581)
+            - 🌐 **Project Page**: [https://yuhaosu.github.io/MedGRPO/](https://yuhaosu.github.io/MedGRPO/)
+            - 💾 **Dataset**: [https://huggingface.co/datasets/UIIAmerica/MedGRPO](https://huggingface.co/datasets/UIIAmerica/MedGRPO)
+            - 💻 **GitHub**: [https://github.com/YuhaoSu/MedGRPO](https://github.com/YuhaoSu/MedGRPO)
+            - 🏆 **Leaderboard**: [https://huggingface.co/spaces/UIIAmerica/MedGRPO-Leaderboard](https://huggingface.co/spaces/UIIAmerica/MedGRPO-Leaderboard)
+            #### Dataset
+            The MedGRPO benchmark includes:
+            - 21,060 training samples
+            - 6,245 test samples
+            - Multi-modal annotations (video, text, temporal spans, bounding boxes)
+            - 8 source datasets covering various medical procedures
+            #### License
+            - **Dataset**: CC BY-NC-SA 4.0 (Non-commercial, Share-alike)
+            - **Leaderboard Code**: Apache 2.0
+            - **Evaluation Scripts**: MIT
+            #### Contact
+            For questions or issues:
+            - Open an issue on [GitHub](https://github.com/YuhaoSu/MedGRPO)
+            - Visit the [project page](https://yuhaosu.github.io/MedGRPO/)
+            - Email: [Contact via GitHub](https://github.com/YuhaoSu)
+            """)
+if __name__ == "__main__":
+    demo.queue(default_concurrency_limit=5).launch(share=True, server_name="0.0.0.0")