MedGRPO Team
commited on
Commit
Β·
04f5f37
1
Parent(s):
73fd321
update name
Browse files- .gitignore +1 -1
- README.md +12 -12
- app.py +18 -18
.gitignore
CHANGED
|
@@ -12,5 +12,5 @@ eval-queue-bk/
|
|
| 12 |
eval-results-bk/
|
| 13 |
logs/
|
| 14 |
.gradio/
|
| 15 |
-
data
|
| 16 |
cache/
|
|
|
|
| 12 |
eval-results-bk/
|
| 13 |
logs/
|
| 14 |
.gradio/
|
| 15 |
+
data/*.json
|
| 16 |
cache/
|
README.md
CHANGED
|
@@ -1,5 +1,5 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
emoji: π₯
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: purple
|
|
@@ -7,7 +7,7 @@ sdk: gradio
|
|
| 7 |
app_file: app.py
|
| 8 |
pinned: true
|
| 9 |
license: apache-2.0
|
| 10 |
-
short_description:
|
| 11 |
sdk_version: 5.50.0
|
| 12 |
tags:
|
| 13 |
- leaderboard
|
|
@@ -16,18 +16,18 @@ tags:
|
|
| 16 |
- surgical-ai
|
| 17 |
---
|
| 18 |
|
| 19 |
-
#
|
| 20 |
|
| 21 |
-
Interactive leaderboard for evaluating Video-Language Models on the **
|
| 22 |
|
| 23 |
-
π **Live Demo**: [huggingface.co/spaces/UIIAmerica/
|
| 24 |
|
| 25 |
π **Paper**: [arXiv:2512.06581](https://arxiv.org/abs/2512.06581)
|
| 26 |
|
| 27 |
## Overview
|
| 28 |
|
| 29 |
This leaderboard provides a centralized platform for researchers to:
|
| 30 |
-
- **Submit** inference results on the
|
| 31 |
- **Automatically evaluate** across 8 diverse tasks
|
| 32 |
- **Compare** model performance on standardized metrics
|
| 33 |
- **Track** state-of-the-art progress in medical video understanding
|
|
@@ -49,7 +49,7 @@ This leaderboard provides a centralized platform for researchers to:
|
|
| 49 |
|
| 50 |
### βοΈ Automatic Evaluation
|
| 51 |
|
| 52 |
-
The leaderboard integrates directly with the
|
| 53 |
- **Validation**: Checks results file format and sample count
|
| 54 |
- **Execution**: Runs `evaluate_all_pai.py` with dataset-agnostic grouping
|
| 55 |
- **Parsing**: Extracts task-specific metrics from evaluation output
|
|
@@ -65,7 +65,7 @@ The leaderboard integrates directly with the MedGRPO evaluation pipeline:
|
|
| 65 |
|
| 66 |
### 1. Run Inference
|
| 67 |
|
| 68 |
-
Run your model on the
|
| 69 |
|
| 70 |
### 2. Expected Results Format
|
| 71 |
|
|
@@ -102,10 +102,10 @@ Your results file should be a JSON with this structure:
|
|
| 102 |
|
| 103 |
### 3. Upload to Leaderboard
|
| 104 |
|
| 105 |
-
1. Visit the [leaderboard](https://huggingface.co/spaces/UIIAmerica/
|
| 106 |
2. Go to the **Submit Results** tab
|
| 107 |
3. Fill in:
|
| 108 |
-
- **Model Name** (e.g., "Qwen2.5-VL-7B-
|
| 109 |
- **Organization** (e.g., "Your University")
|
| 110 |
- **Contact** (optional)
|
| 111 |
4. Upload your results JSON file
|
|
@@ -153,9 +153,9 @@ To compute the **average score** fairly across tasks:
|
|
| 153 |
|
| 154 |
- π **Paper**: [https://arxiv.org/abs/2512.06581](https://arxiv.org/abs/2512.06581)
|
| 155 |
- π **Project**: [https://yuhaosu.github.io/MedGRPO/](https://yuhaosu.github.io/MedGRPO/)
|
| 156 |
-
- πΎ **Dataset**: [https://huggingface.co/datasets/UIIAmerica/
|
| 157 |
- π» **GitHub**: [https://github.com/YuhaoSu/MedGRPO](https://github.com/YuhaoSu/MedGRPO)
|
| 158 |
-
- π **Leaderboard**: [https://huggingface.co/spaces/UIIAmerica/
|
| 159 |
|
| 160 |
## Citation
|
| 161 |
|
|
|
|
| 1 |
---
|
| 2 |
+
title: MedVidBench Leaderboard
|
| 3 |
emoji: π₯
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: purple
|
|
|
|
| 7 |
app_file: app.py
|
| 8 |
pinned: true
|
| 9 |
license: apache-2.0
|
| 10 |
+
short_description: MedVidBench Benchmark Leaderboard - 8 medical video tasks
|
| 11 |
sdk_version: 5.50.0
|
| 12 |
tags:
|
| 13 |
- leaderboard
|
|
|
|
| 16 |
- surgical-ai
|
| 17 |
---
|
| 18 |
|
| 19 |
+
# MedVidBench Leaderboard
|
| 20 |
|
| 21 |
+
Interactive leaderboard for evaluating Video-Language Models on the **MedVidBench benchmark** - 8 medical video understanding tasks across 8 surgical datasets.
|
| 22 |
|
| 23 |
+
π **Live Demo**: [huggingface.co/spaces/UIIAmerica/MedVidBench-Leaderboard](https://huggingface.co/spaces/UIIAmerica/MedVidBench-Leaderboard)
|
| 24 |
|
| 25 |
π **Paper**: [arXiv:2512.06581](https://arxiv.org/abs/2512.06581)
|
| 26 |
|
| 27 |
## Overview
|
| 28 |
|
| 29 |
This leaderboard provides a centralized platform for researchers to:
|
| 30 |
+
- **Submit** inference results on the MedVidBench test set
|
| 31 |
- **Automatically evaluate** across 8 diverse tasks
|
| 32 |
- **Compare** model performance on standardized metrics
|
| 33 |
- **Track** state-of-the-art progress in medical video understanding
|
|
|
|
| 49 |
|
| 50 |
### βοΈ Automatic Evaluation
|
| 51 |
|
| 52 |
+
The leaderboard integrates directly with the MedVidBench evaluation pipeline:
|
| 53 |
- **Validation**: Checks results file format and sample count
|
| 54 |
- **Execution**: Runs `evaluate_all_pai.py` with dataset-agnostic grouping
|
| 55 |
- **Parsing**: Extracts task-specific metrics from evaluation output
|
|
|
|
| 65 |
|
| 66 |
### 1. Run Inference
|
| 67 |
|
| 68 |
+
Run your model on the MedVidBench test set (6,245 samples) to generate predictions for all 8 tasks.
|
| 69 |
|
| 70 |
### 2. Expected Results Format
|
| 71 |
|
|
|
|
| 102 |
|
| 103 |
### 3. Upload to Leaderboard
|
| 104 |
|
| 105 |
+
1. Visit the [leaderboard](https://huggingface.co/spaces/UIIAmerica/MedVidBench-Leaderboard)
|
| 106 |
2. Go to the **Submit Results** tab
|
| 107 |
3. Fill in:
|
| 108 |
+
- **Model Name** (e.g., "Qwen2.5-VL-7B-MedVidBench")
|
| 109 |
- **Organization** (e.g., "Your University")
|
| 110 |
- **Contact** (optional)
|
| 111 |
4. Upload your results JSON file
|
|
|
|
| 153 |
|
| 154 |
- π **Paper**: [https://arxiv.org/abs/2512.06581](https://arxiv.org/abs/2512.06581)
|
| 155 |
- π **Project**: [https://yuhaosu.github.io/MedGRPO/](https://yuhaosu.github.io/MedGRPO/)
|
| 156 |
+
- πΎ **Dataset**: [https://huggingface.co/datasets/UIIAmerica/MedVidBench](https://huggingface.co/datasets/UIIAmerica/MedVidBench)
|
| 157 |
- π» **GitHub**: [https://github.com/YuhaoSu/MedGRPO](https://github.com/YuhaoSu/MedGRPO)
|
| 158 |
+
- π **Leaderboard**: [https://huggingface.co/spaces/UIIAmerica/MedVidBench-Leaderboard](https://huggingface.co/spaces/UIIAmerica/MedVidBench-Leaderboard)
|
| 159 |
|
| 160 |
## Citation
|
| 161 |
|
app.py
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
"""
|
| 2 |
-
|
| 3 |
-
on the
|
| 4 |
"""
|
| 5 |
|
| 6 |
import gradio as gr
|
|
@@ -32,7 +32,7 @@ def load_ground_truth():
|
|
| 32 |
# Download from private repository
|
| 33 |
print("β³ Downloading ground truth from private repository...")
|
| 34 |
gt_file = hf_hub_download(
|
| 35 |
-
repo_id="UIIAmerica/
|
| 36 |
filename="ground_truth.json",
|
| 37 |
repo_type="dataset",
|
| 38 |
token=token,
|
|
@@ -78,7 +78,7 @@ print("=" * 60)
|
|
| 78 |
GROUND_TRUTH = load_ground_truth()
|
| 79 |
print("=" * 60)
|
| 80 |
|
| 81 |
-
#
|
| 82 |
# Note: TAL has 2 metrics, DVC has 2 metrics, others have 1 metric each
|
| 83 |
METRICS = {
|
| 84 |
"cvs_acc": {
|
|
@@ -831,17 +831,17 @@ def format_leaderboard_display(df: pd.DataFrame) -> pd.DataFrame:
|
|
| 831 |
|
| 832 |
|
| 833 |
# Create Gradio interface
|
| 834 |
-
with gr.Blocks(title="
|
| 835 |
|
| 836 |
gr.Markdown("""
|
| 837 |
-
# π₯
|
| 838 |
|
| 839 |
-
Interactive leaderboard for evaluating **Video-Language Models** on the **
|
| 840 |
8 medical video understanding tasks across 8 surgical datasets.
|
| 841 |
|
| 842 |
-
π **Paper**: [
|
| 843 |
π **Project**: [yuhaosu.github.io/MedGRPO](https://yuhaosu.github.io/MedGRPO/)
|
| 844 |
-
πΎ **Dataset**: [huggingface.co/datasets/UIIAmerica/
|
| 845 |
π» **GitHub**: [github.com/YuhaoSu/MedGRPO](https://github.com/YuhaoSu/MedGRPO)
|
| 846 |
""")
|
| 847 |
|
|
@@ -867,11 +867,11 @@ with gr.Blocks(title="MedGRPO Leaderboard", theme=gr.themes.Soft()) as demo:
|
|
| 867 |
gr.Markdown("""
|
| 868 |
### Submit Your Model Results
|
| 869 |
|
| 870 |
-
Upload your model's **predictions only** on the **
|
| 871 |
|
| 872 |
#### π Requirements
|
| 873 |
|
| 874 |
-
1. **Run inference** on the full test set (download from [HuggingFace](https://huggingface.co/datasets/UIIAmerica/
|
| 875 |
2. **Upload predictions JSON** in the format below (NO ground truth needed)
|
| 876 |
3. **Provide model info** (name, organization)
|
| 877 |
|
|
@@ -965,7 +965,7 @@ with gr.Blocks(title="MedGRPO Leaderboard", theme=gr.themes.Soft()) as demo:
|
|
| 965 |
# Tab 3: Tasks & Metrics
|
| 966 |
with gr.Tab("π Tasks & Metrics"):
|
| 967 |
gr.Markdown("""
|
| 968 |
-
###
|
| 969 |
|
| 970 |
The benchmark evaluates models across **8 diverse tasks** spanning video, segment, and frame-level understanding:
|
| 971 |
""")
|
|
@@ -1025,10 +1025,10 @@ with gr.Blocks(title="MedGRPO Leaderboard", theme=gr.themes.Soft()) as demo:
|
|
| 1025 |
# Tab 4: About
|
| 1026 |
with gr.Tab("βΉοΈ About"):
|
| 1027 |
gr.Markdown("""
|
| 1028 |
-
### About
|
| 1029 |
|
| 1030 |
-
**
|
| 1031 |
-
|
| 1032 |
|
| 1033 |
#### Key Features
|
| 1034 |
|
|
@@ -1053,13 +1053,13 @@ with gr.Blocks(title="MedGRPO Leaderboard", theme=gr.themes.Soft()) as demo:
|
|
| 1053 |
|
| 1054 |
- π **Paper**: [https://arxiv.org/abs/2512.06581](https://arxiv.org/abs/2512.06581)
|
| 1055 |
- π **Project Page**: [https://yuhaosu.github.io/MedGRPO/](https://yuhaosu.github.io/MedGRPO/)
|
| 1056 |
-
- πΎ **Dataset**: [https://huggingface.co/datasets/UIIAmerica/
|
| 1057 |
- π» **GitHub**: [https://github.com/YuhaoSu/MedGRPO](https://github.com/YuhaoSu/MedGRPO)
|
| 1058 |
-
- π **Leaderboard**: [https://huggingface.co/spaces/UIIAmerica/
|
| 1059 |
|
| 1060 |
#### Dataset
|
| 1061 |
|
| 1062 |
-
The
|
| 1063 |
- 21,060 training samples
|
| 1064 |
- 6,245 test samples
|
| 1065 |
- Multi-modal annotations (video, text, temporal spans, bounding boxes)
|
|
|
|
| 1 |
"""
|
| 2 |
+
MedVidBench Leaderboard - Interactive leaderboard for evaluating Video-Language Models
|
| 3 |
+
on the MedVidBench benchmark across 8 medical video understanding tasks.
|
| 4 |
"""
|
| 5 |
|
| 6 |
import gradio as gr
|
|
|
|
| 32 |
# Download from private repository
|
| 33 |
print("β³ Downloading ground truth from private repository...")
|
| 34 |
gt_file = hf_hub_download(
|
| 35 |
+
repo_id="UIIAmerica/MedVidBench-GroundTruth",
|
| 36 |
filename="ground_truth.json",
|
| 37 |
repo_type="dataset",
|
| 38 |
token=token,
|
|
|
|
| 78 |
GROUND_TRUTH = load_ground_truth()
|
| 79 |
print("=" * 60)
|
| 80 |
|
| 81 |
+
# MedVidBench Metrics Definitions (10 metrics from 8 tasks)
|
| 82 |
# Note: TAL has 2 metrics, DVC has 2 metrics, others have 1 metric each
|
| 83 |
METRICS = {
|
| 84 |
"cvs_acc": {
|
|
|
|
| 831 |
|
| 832 |
|
| 833 |
# Create Gradio interface
|
| 834 |
+
with gr.Blocks(title="MedVidBench Leaderboard", theme=gr.themes.Soft()) as demo:
|
| 835 |
|
| 836 |
gr.Markdown("""
|
| 837 |
+
# π₯ MedVidBench Leaderboard
|
| 838 |
|
| 839 |
+
Interactive leaderboard for evaluating **Video-Language Models** on the **MedVidBench benchmark** -
|
| 840 |
8 medical video understanding tasks across 8 surgical datasets.
|
| 841 |
|
| 842 |
+
π **Paper**: [MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding](https://arxiv.org/abs/2512.06581)
|
| 843 |
π **Project**: [yuhaosu.github.io/MedGRPO](https://yuhaosu.github.io/MedGRPO/)
|
| 844 |
+
πΎ **Dataset**: [huggingface.co/datasets/UIIAmerica/MedVidBench](https://huggingface.co/datasets/UIIAmerica/MedVidBench)
|
| 845 |
π» **GitHub**: [github.com/YuhaoSu/MedGRPO](https://github.com/YuhaoSu/MedGRPO)
|
| 846 |
""")
|
| 847 |
|
|
|
|
| 867 |
gr.Markdown("""
|
| 868 |
### Submit Your Model Results
|
| 869 |
|
| 870 |
+
Upload your model's **predictions only** on the **MedVidBench test set (6,245 samples)** to be added to the leaderboard.
|
| 871 |
|
| 872 |
#### π Requirements
|
| 873 |
|
| 874 |
+
1. **Run inference** on the full test set (download from [HuggingFace](https://huggingface.co/datasets/UIIAmerica/MedVidBench))
|
| 875 |
2. **Upload predictions JSON** in the format below (NO ground truth needed)
|
| 876 |
3. **Provide model info** (name, organization)
|
| 877 |
|
|
|
|
| 965 |
# Tab 3: Tasks & Metrics
|
| 966 |
with gr.Tab("π Tasks & Metrics"):
|
| 967 |
gr.Markdown("""
|
| 968 |
+
### MedVidBench Benchmark Tasks
|
| 969 |
|
| 970 |
The benchmark evaluates models across **8 diverse tasks** spanning video, segment, and frame-level understanding:
|
| 971 |
""")
|
|
|
|
| 1025 |
# Tab 4: About
|
| 1026 |
with gr.Tab("βΉοΈ About"):
|
| 1027 |
gr.Markdown("""
|
| 1028 |
+
### About MedVidBench
|
| 1029 |
|
| 1030 |
+
**MedVidBench** is a comprehensive benchmark for evaluating Video-Language Models on medical and surgical video understanding.
|
| 1031 |
+
It was introduced in the **MedGRPO** paper (Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding).
|
| 1032 |
|
| 1033 |
#### Key Features
|
| 1034 |
|
|
|
|
| 1053 |
|
| 1054 |
- π **Paper**: [https://arxiv.org/abs/2512.06581](https://arxiv.org/abs/2512.06581)
|
| 1055 |
- π **Project Page**: [https://yuhaosu.github.io/MedGRPO/](https://yuhaosu.github.io/MedGRPO/)
|
| 1056 |
+
- πΎ **Dataset**: [https://huggingface.co/datasets/UIIAmerica/MedVidBench](https://huggingface.co/datasets/UIIAmerica/MedVidBench)
|
| 1057 |
- π» **GitHub**: [https://github.com/YuhaoSu/MedGRPO](https://github.com/YuhaoSu/MedGRPO)
|
| 1058 |
+
- π **Leaderboard**: [https://huggingface.co/spaces/UIIAmerica/MedVidBench-Leaderboard](https://huggingface.co/spaces/UIIAmerica/MedVidBench-Leaderboard)
|
| 1059 |
|
| 1060 |
#### Dataset
|
| 1061 |
|
| 1062 |
+
The MedVidBench benchmark includes:
|
| 1063 |
- 21,060 training samples
|
| 1064 |
- 6,245 test samples
|
| 1065 |
- Multi-modal annotations (video, text, temporal spans, bounding boxes)
|