Spaces:

UIIAmerica
/

MedVidBench-Leaderboard

Running

App Files Files Community

MedGRPO Team commited on 10 days ago

Commit

04f5f37

1 Parent(s): 73fd321

update name

Browse files

Files changed (3) hide show

.gitignore +1 -1
README.md +12 -12
app.py +18 -18

.gitignore CHANGED Viewed

@@ -12,5 +12,5 @@ eval-queue-bk/
 eval-results-bk/
 logs/
 .gradio/
-data/ground_truth.json
 cache/

 eval-results-bk/
 logs/
 .gradio/
+data/*.json
 cache/

README.md CHANGED Viewed

@@ -1,5 +1,5 @@
 ---
-title: MedGRPO Leaderboard
 emoji: 🏥
 colorFrom: blue
 colorTo: purple
@@ -7,7 +7,7 @@ sdk: gradio
 app_file: app.py
 pinned: true
 license: apache-2.0
-short_description: MedGRPO Benchmark Leaderboard - 8 medical video tasks
 sdk_version: 5.50.0
 tags:
 - leaderboard
@@ -16,18 +16,18 @@ tags:
 - surgical-ai
 ---
-# MedGRPO Leaderboard
-Interactive leaderboard for evaluating Video-Language Models on the **MedGRPO benchmark** - 8 medical video understanding tasks across 8 surgical datasets.
-🏆 **Live Demo**: [huggingface.co/spaces/UIIAmerica/MedGRPO-Leaderboard](https://huggingface.co/spaces/UIIAmerica/MedGRPO-Leaderboard)
 📄 **Paper**: [arXiv:2512.06581](https://arxiv.org/abs/2512.06581)
 ## Overview
 This leaderboard provides a centralized platform for researchers to:
-- **Submit** inference results on the MedGRPO test set
 - **Automatically evaluate** across 8 diverse tasks
 - **Compare** model performance on standardized metrics
 - **Track** state-of-the-art progress in medical video understanding
@@ -49,7 +49,7 @@ This leaderboard provides a centralized platform for researchers to:
 ### ⚙️ Automatic Evaluation
-The leaderboard integrates directly with the MedGRPO evaluation pipeline:
 - **Validation**: Checks results file format and sample count
 - **Execution**: Runs `evaluate_all_pai.py` with dataset-agnostic grouping
 - **Parsing**: Extracts task-specific metrics from evaluation output
@@ -65,7 +65,7 @@ The leaderboard integrates directly with the MedGRPO evaluation pipeline:
 ### 1. Run Inference
-Run your model on the MedGRPO test set (6,245 samples) to generate predictions for all 8 tasks.
 ### 2. Expected Results Format
@@ -102,10 +102,10 @@ Your results file should be a JSON with this structure:
 ### 3. Upload to Leaderboard
-1. Visit the [leaderboard](https://huggingface.co/spaces/UIIAmerica/MedGRPO-Leaderboard)
 2. Go to the **Submit Results** tab
 3. Fill in:
-   - **Model Name** (e.g., "Qwen2.5-VL-7B-MedGRPO")
    - **Organization** (e.g., "Your University")
    - **Contact** (optional)
 4. Upload your results JSON file
@@ -153,9 +153,9 @@ To compute the **average score** fairly across tasks:
 - 📄 **Paper**: [https://arxiv.org/abs/2512.06581](https://arxiv.org/abs/2512.06581)
 - 🌐 **Project**: [https://yuhaosu.github.io/MedGRPO/](https://yuhaosu.github.io/MedGRPO/)
-- 💾 **Dataset**: [https://huggingface.co/datasets/UIIAmerica/MedGRPO](https://huggingface.co/datasets/UIIAmerica/MedGRPO)
 - 💻 **GitHub**: [https://github.com/YuhaoSu/MedGRPO](https://github.com/YuhaoSu/MedGRPO)
-- 🏆 **Leaderboard**: [https://huggingface.co/spaces/UIIAmerica/MedGRPO-Leaderboard](https://huggingface.co/spaces/UIIAmerica/MedGRPO-Leaderboard)
 ## Citation

 ---
+title: MedVidBench Leaderboard
 emoji: 🏥
 colorFrom: blue
 colorTo: purple
 app_file: app.py
 pinned: true
 license: apache-2.0
+short_description: MedVidBench Benchmark Leaderboard - 8 medical video tasks
 sdk_version: 5.50.0
 tags:
 - leaderboard
 - surgical-ai
 ---
+# MedVidBench Leaderboard
+Interactive leaderboard for evaluating Video-Language Models on the **MedVidBench benchmark** - 8 medical video understanding tasks across 8 surgical datasets.
+🏆 **Live Demo**: [huggingface.co/spaces/UIIAmerica/MedVidBench-Leaderboard](https://huggingface.co/spaces/UIIAmerica/MedVidBench-Leaderboard)
 📄 **Paper**: [arXiv:2512.06581](https://arxiv.org/abs/2512.06581)
 ## Overview
 This leaderboard provides a centralized platform for researchers to:
+- **Submit** inference results on the MedVidBench test set
 - **Automatically evaluate** across 8 diverse tasks
 - **Compare** model performance on standardized metrics
 - **Track** state-of-the-art progress in medical video understanding
 ### ⚙️ Automatic Evaluation
+The leaderboard integrates directly with the MedVidBench evaluation pipeline:
 - **Validation**: Checks results file format and sample count
 - **Execution**: Runs `evaluate_all_pai.py` with dataset-agnostic grouping
 - **Parsing**: Extracts task-specific metrics from evaluation output
 ### 1. Run Inference
+Run your model on the MedVidBench test set (6,245 samples) to generate predictions for all 8 tasks.
 ### 2. Expected Results Format
 ### 3. Upload to Leaderboard
+1. Visit the [leaderboard](https://huggingface.co/spaces/UIIAmerica/MedVidBench-Leaderboard)
 2. Go to the **Submit Results** tab
 3. Fill in:
+   - **Model Name** (e.g., "Qwen2.5-VL-7B-MedVidBench")
    - **Organization** (e.g., "Your University")
    - **Contact** (optional)
 4. Upload your results JSON file
 - 📄 **Paper**: [https://arxiv.org/abs/2512.06581](https://arxiv.org/abs/2512.06581)
 - 🌐 **Project**: [https://yuhaosu.github.io/MedGRPO/](https://yuhaosu.github.io/MedGRPO/)
+- 💾 **Dataset**: [https://huggingface.co/datasets/UIIAmerica/MedVidBench](https://huggingface.co/datasets/UIIAmerica/MedVidBench)
 - 💻 **GitHub**: [https://github.com/YuhaoSu/MedGRPO](https://github.com/YuhaoSu/MedGRPO)
+- 🏆 **Leaderboard**: [https://huggingface.co/spaces/UIIAmerica/MedVidBench-Leaderboard](https://huggingface.co/spaces/UIIAmerica/MedVidBench-Leaderboard)
 ## Citation

app.py CHANGED Viewed

@@ -1,6 +1,6 @@
 """
-MedGRPO Leaderboard - Interactive leaderboard for evaluating Video-Language Models
-on the MedGRPO benchmark across 8 medical video understanding tasks.
 """
 import gradio as gr
@@ -32,7 +32,7 @@ def load_ground_truth():
         # Download from private repository
         print("⏳ Downloading ground truth from private repository...")
         gt_file = hf_hub_download(
-            repo_id="UIIAmerica/MedGRPO-GroundTruth",
             filename="ground_truth.json",
             repo_type="dataset",
             token=token,
@@ -78,7 +78,7 @@ print("=" * 60)
 GROUND_TRUTH = load_ground_truth()
 print("=" * 60)
-# MedGRPO Metrics Definitions (10 metrics from 8 tasks)
 # Note: TAL has 2 metrics, DVC has 2 metrics, others have 1 metric each
 METRICS = {
     "cvs_acc": {
@@ -831,17 +831,17 @@ def format_leaderboard_display(df: pd.DataFrame) -> pd.DataFrame:
 # Create Gradio interface
-with gr.Blocks(title="MedGRPO Leaderboard", theme=gr.themes.Soft()) as demo:
     gr.Markdown("""
-    # 🏥 MedGRPO Leaderboard
-    Interactive leaderboard for evaluating **Video-Language Models** on the **MedGRPO benchmark** -
     8 medical video understanding tasks across 8 surgical datasets.
-    📄 **Paper**: [arXiv:2512.06581](https://arxiv.org/abs/2512.06581)
     🌐 **Project**: [yuhaosu.github.io/MedGRPO](https://yuhaosu.github.io/MedGRPO/)
-    💾 **Dataset**: [huggingface.co/datasets/UIIAmerica/MedGRPO](https://huggingface.co/datasets/UIIAmerica/MedGRPO)
     💻 **GitHub**: [github.com/YuhaoSu/MedGRPO](https://github.com/YuhaoSu/MedGRPO)
     """)
@@ -867,11 +867,11 @@ with gr.Blocks(title="MedGRPO Leaderboard", theme=gr.themes.Soft()) as demo:
             gr.Markdown("""
             ### Submit Your Model Results
-            Upload your model's **predictions only** on the **MedGRPO test set (6,245 samples)** to be added to the leaderboard.
             #### 📋 Requirements
-            1. **Run inference** on the full test set (download from [HuggingFace](https://huggingface.co/datasets/UIIAmerica/MedGRPO))
             2. **Upload predictions JSON** in the format below (NO ground truth needed)
             3. **Provide model info** (name, organization)
@@ -965,7 +965,7 @@ with gr.Blocks(title="MedGRPO Leaderboard", theme=gr.themes.Soft()) as demo:
         # Tab 3: Tasks & Metrics
         with gr.Tab("📊 Tasks & Metrics"):
             gr.Markdown("""
-            ### MedGRPO Benchmark Tasks
             The benchmark evaluates models across **8 diverse tasks** spanning video, segment, and frame-level understanding:
             """)
@@ -1025,10 +1025,10 @@ with gr.Blocks(title="MedGRPO Leaderboard", theme=gr.themes.Soft()) as demo:
         # Tab 4: About
         with gr.Tab("ℹ️ About"):
             gr.Markdown("""
-            ### About MedGRPO
-            **MedGRPO** (Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding)
-            is a comprehensive benchmark for evaluating Video-Language Models on medical and surgical video understanding.
             #### Key Features
@@ -1053,13 +1053,13 @@ with gr.Blocks(title="MedGRPO Leaderboard", theme=gr.themes.Soft()) as demo:
             - 📄 **Paper**: [https://arxiv.org/abs/2512.06581](https://arxiv.org/abs/2512.06581)
             - 🌐 **Project Page**: [https://yuhaosu.github.io/MedGRPO/](https://yuhaosu.github.io/MedGRPO/)
-            - 💾 **Dataset**: [https://huggingface.co/datasets/UIIAmerica/MedGRPO](https://huggingface.co/datasets/UIIAmerica/MedGRPO)
             - 💻 **GitHub**: [https://github.com/YuhaoSu/MedGRPO](https://github.com/YuhaoSu/MedGRPO)
-            - 🏆 **Leaderboard**: [https://huggingface.co/spaces/UIIAmerica/MedGRPO-Leaderboard](https://huggingface.co/spaces/UIIAmerica/MedGRPO-Leaderboard)
             #### Dataset
-            The MedGRPO benchmark includes:
             - 21,060 training samples
             - 6,245 test samples
             - Multi-modal annotations (video, text, temporal spans, bounding boxes)

 """
+MedVidBench Leaderboard - Interactive leaderboard for evaluating Video-Language Models
+on the MedVidBench benchmark across 8 medical video understanding tasks.
 """
 import gradio as gr
         # Download from private repository
         print("⏳ Downloading ground truth from private repository...")
         gt_file = hf_hub_download(
+            repo_id="UIIAmerica/MedVidBench-GroundTruth",
             filename="ground_truth.json",
             repo_type="dataset",
             token=token,
 GROUND_TRUTH = load_ground_truth()
 print("=" * 60)
+# MedVidBench Metrics Definitions (10 metrics from 8 tasks)
 # Note: TAL has 2 metrics, DVC has 2 metrics, others have 1 metric each
 METRICS = {
     "cvs_acc": {
 # Create Gradio interface
+with gr.Blocks(title="MedVidBench Leaderboard", theme=gr.themes.Soft()) as demo:
     gr.Markdown("""
+    # 🏥 MedVidBench Leaderboard
+    Interactive leaderboard for evaluating **Video-Language Models** on the **MedVidBench benchmark** -
     8 medical video understanding tasks across 8 surgical datasets.
+    📄 **Paper**: [MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding](https://arxiv.org/abs/2512.06581)
     🌐 **Project**: [yuhaosu.github.io/MedGRPO](https://yuhaosu.github.io/MedGRPO/)
+    💾 **Dataset**: [huggingface.co/datasets/UIIAmerica/MedVidBench](https://huggingface.co/datasets/UIIAmerica/MedVidBench)
     💻 **GitHub**: [github.com/YuhaoSu/MedGRPO](https://github.com/YuhaoSu/MedGRPO)
     """)
             gr.Markdown("""
             ### Submit Your Model Results
+            Upload your model's **predictions only** on the **MedVidBench test set (6,245 samples)** to be added to the leaderboard.
             #### 📋 Requirements
+            1. **Run inference** on the full test set (download from [HuggingFace](https://huggingface.co/datasets/UIIAmerica/MedVidBench))
             2. **Upload predictions JSON** in the format below (NO ground truth needed)
             3. **Provide model info** (name, organization)
         # Tab 3: Tasks & Metrics
         with gr.Tab("📊 Tasks & Metrics"):
             gr.Markdown("""
+            ### MedVidBench Benchmark Tasks
             The benchmark evaluates models across **8 diverse tasks** spanning video, segment, and frame-level understanding:
             """)
         # Tab 4: About
         with gr.Tab("ℹ️ About"):
             gr.Markdown("""
+            ### About MedVidBench
+            **MedVidBench** is a comprehensive benchmark for evaluating Video-Language Models on medical and surgical video understanding.
+            It was introduced in the **MedGRPO** paper (Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding).
             #### Key Features
             - 📄 **Paper**: [https://arxiv.org/abs/2512.06581](https://arxiv.org/abs/2512.06581)
             - 🌐 **Project Page**: [https://yuhaosu.github.io/MedGRPO/](https://yuhaosu.github.io/MedGRPO/)
+            - 💾 **Dataset**: [https://huggingface.co/datasets/UIIAmerica/MedVidBench](https://huggingface.co/datasets/UIIAmerica/MedVidBench)
             - 💻 **GitHub**: [https://github.com/YuhaoSu/MedGRPO](https://github.com/YuhaoSu/MedGRPO)
+            - 🏆 **Leaderboard**: [https://huggingface.co/spaces/UIIAmerica/MedVidBench-Leaderboard](https://huggingface.co/spaces/UIIAmerica/MedVidBench-Leaderboard)
             #### Dataset
+            The MedVidBench benchmark includes:
             - 21,060 training samples
             - 6,245 test samples
             - Multi-modal annotations (video, text, temporal spans, bounding boxes)