MedGRPO Team commited on
Commit
04f5f37
Β·
1 Parent(s): 73fd321

update name

Browse files
Files changed (3) hide show
  1. .gitignore +1 -1
  2. README.md +12 -12
  3. app.py +18 -18
.gitignore CHANGED
@@ -12,5 +12,5 @@ eval-queue-bk/
12
  eval-results-bk/
13
  logs/
14
  .gradio/
15
- data/ground_truth.json
16
  cache/
 
12
  eval-results-bk/
13
  logs/
14
  .gradio/
15
+ data/*.json
16
  cache/
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- title: MedGRPO Leaderboard
3
  emoji: πŸ₯
4
  colorFrom: blue
5
  colorTo: purple
@@ -7,7 +7,7 @@ sdk: gradio
7
  app_file: app.py
8
  pinned: true
9
  license: apache-2.0
10
- short_description: MedGRPO Benchmark Leaderboard - 8 medical video tasks
11
  sdk_version: 5.50.0
12
  tags:
13
  - leaderboard
@@ -16,18 +16,18 @@ tags:
16
  - surgical-ai
17
  ---
18
 
19
- # MedGRPO Leaderboard
20
 
21
- Interactive leaderboard for evaluating Video-Language Models on the **MedGRPO benchmark** - 8 medical video understanding tasks across 8 surgical datasets.
22
 
23
- πŸ† **Live Demo**: [huggingface.co/spaces/UIIAmerica/MedGRPO-Leaderboard](https://huggingface.co/spaces/UIIAmerica/MedGRPO-Leaderboard)
24
 
25
  πŸ“„ **Paper**: [arXiv:2512.06581](https://arxiv.org/abs/2512.06581)
26
 
27
  ## Overview
28
 
29
  This leaderboard provides a centralized platform for researchers to:
30
- - **Submit** inference results on the MedGRPO test set
31
  - **Automatically evaluate** across 8 diverse tasks
32
  - **Compare** model performance on standardized metrics
33
  - **Track** state-of-the-art progress in medical video understanding
@@ -49,7 +49,7 @@ This leaderboard provides a centralized platform for researchers to:
49
 
50
  ### βš™οΈ Automatic Evaluation
51
 
52
- The leaderboard integrates directly with the MedGRPO evaluation pipeline:
53
  - **Validation**: Checks results file format and sample count
54
  - **Execution**: Runs `evaluate_all_pai.py` with dataset-agnostic grouping
55
  - **Parsing**: Extracts task-specific metrics from evaluation output
@@ -65,7 +65,7 @@ The leaderboard integrates directly with the MedGRPO evaluation pipeline:
65
 
66
  ### 1. Run Inference
67
 
68
- Run your model on the MedGRPO test set (6,245 samples) to generate predictions for all 8 tasks.
69
 
70
  ### 2. Expected Results Format
71
 
@@ -102,10 +102,10 @@ Your results file should be a JSON with this structure:
102
 
103
  ### 3. Upload to Leaderboard
104
 
105
- 1. Visit the [leaderboard](https://huggingface.co/spaces/UIIAmerica/MedGRPO-Leaderboard)
106
  2. Go to the **Submit Results** tab
107
  3. Fill in:
108
- - **Model Name** (e.g., "Qwen2.5-VL-7B-MedGRPO")
109
  - **Organization** (e.g., "Your University")
110
  - **Contact** (optional)
111
  4. Upload your results JSON file
@@ -153,9 +153,9 @@ To compute the **average score** fairly across tasks:
153
 
154
  - πŸ“„ **Paper**: [https://arxiv.org/abs/2512.06581](https://arxiv.org/abs/2512.06581)
155
  - 🌐 **Project**: [https://yuhaosu.github.io/MedGRPO/](https://yuhaosu.github.io/MedGRPO/)
156
- - πŸ’Ύ **Dataset**: [https://huggingface.co/datasets/UIIAmerica/MedGRPO](https://huggingface.co/datasets/UIIAmerica/MedGRPO)
157
  - πŸ’» **GitHub**: [https://github.com/YuhaoSu/MedGRPO](https://github.com/YuhaoSu/MedGRPO)
158
- - πŸ† **Leaderboard**: [https://huggingface.co/spaces/UIIAmerica/MedGRPO-Leaderboard](https://huggingface.co/spaces/UIIAmerica/MedGRPO-Leaderboard)
159
 
160
  ## Citation
161
 
 
1
  ---
2
+ title: MedVidBench Leaderboard
3
  emoji: πŸ₯
4
  colorFrom: blue
5
  colorTo: purple
 
7
  app_file: app.py
8
  pinned: true
9
  license: apache-2.0
10
+ short_description: MedVidBench Benchmark Leaderboard - 8 medical video tasks
11
  sdk_version: 5.50.0
12
  tags:
13
  - leaderboard
 
16
  - surgical-ai
17
  ---
18
 
19
+ # MedVidBench Leaderboard
20
 
21
+ Interactive leaderboard for evaluating Video-Language Models on the **MedVidBench benchmark** - 8 medical video understanding tasks across 8 surgical datasets.
22
 
23
+ πŸ† **Live Demo**: [huggingface.co/spaces/UIIAmerica/MedVidBench-Leaderboard](https://huggingface.co/spaces/UIIAmerica/MedVidBench-Leaderboard)
24
 
25
  πŸ“„ **Paper**: [arXiv:2512.06581](https://arxiv.org/abs/2512.06581)
26
 
27
  ## Overview
28
 
29
  This leaderboard provides a centralized platform for researchers to:
30
+ - **Submit** inference results on the MedVidBench test set
31
  - **Automatically evaluate** across 8 diverse tasks
32
  - **Compare** model performance on standardized metrics
33
  - **Track** state-of-the-art progress in medical video understanding
 
49
 
50
  ### βš™οΈ Automatic Evaluation
51
 
52
+ The leaderboard integrates directly with the MedVidBench evaluation pipeline:
53
  - **Validation**: Checks results file format and sample count
54
  - **Execution**: Runs `evaluate_all_pai.py` with dataset-agnostic grouping
55
  - **Parsing**: Extracts task-specific metrics from evaluation output
 
65
 
66
  ### 1. Run Inference
67
 
68
+ Run your model on the MedVidBench test set (6,245 samples) to generate predictions for all 8 tasks.
69
 
70
  ### 2. Expected Results Format
71
 
 
102
 
103
  ### 3. Upload to Leaderboard
104
 
105
+ 1. Visit the [leaderboard](https://huggingface.co/spaces/UIIAmerica/MedVidBench-Leaderboard)
106
  2. Go to the **Submit Results** tab
107
  3. Fill in:
108
+ - **Model Name** (e.g., "Qwen2.5-VL-7B-MedVidBench")
109
  - **Organization** (e.g., "Your University")
110
  - **Contact** (optional)
111
  4. Upload your results JSON file
 
153
 
154
  - πŸ“„ **Paper**: [https://arxiv.org/abs/2512.06581](https://arxiv.org/abs/2512.06581)
155
  - 🌐 **Project**: [https://yuhaosu.github.io/MedGRPO/](https://yuhaosu.github.io/MedGRPO/)
156
+ - πŸ’Ύ **Dataset**: [https://huggingface.co/datasets/UIIAmerica/MedVidBench](https://huggingface.co/datasets/UIIAmerica/MedVidBench)
157
  - πŸ’» **GitHub**: [https://github.com/YuhaoSu/MedGRPO](https://github.com/YuhaoSu/MedGRPO)
158
+ - πŸ† **Leaderboard**: [https://huggingface.co/spaces/UIIAmerica/MedVidBench-Leaderboard](https://huggingface.co/spaces/UIIAmerica/MedVidBench-Leaderboard)
159
 
160
  ## Citation
161
 
app.py CHANGED
@@ -1,6 +1,6 @@
1
  """
2
- MedGRPO Leaderboard - Interactive leaderboard for evaluating Video-Language Models
3
- on the MedGRPO benchmark across 8 medical video understanding tasks.
4
  """
5
 
6
  import gradio as gr
@@ -32,7 +32,7 @@ def load_ground_truth():
32
  # Download from private repository
33
  print("⏳ Downloading ground truth from private repository...")
34
  gt_file = hf_hub_download(
35
- repo_id="UIIAmerica/MedGRPO-GroundTruth",
36
  filename="ground_truth.json",
37
  repo_type="dataset",
38
  token=token,
@@ -78,7 +78,7 @@ print("=" * 60)
78
  GROUND_TRUTH = load_ground_truth()
79
  print("=" * 60)
80
 
81
- # MedGRPO Metrics Definitions (10 metrics from 8 tasks)
82
  # Note: TAL has 2 metrics, DVC has 2 metrics, others have 1 metric each
83
  METRICS = {
84
  "cvs_acc": {
@@ -831,17 +831,17 @@ def format_leaderboard_display(df: pd.DataFrame) -> pd.DataFrame:
831
 
832
 
833
  # Create Gradio interface
834
- with gr.Blocks(title="MedGRPO Leaderboard", theme=gr.themes.Soft()) as demo:
835
 
836
  gr.Markdown("""
837
- # πŸ₯ MedGRPO Leaderboard
838
 
839
- Interactive leaderboard for evaluating **Video-Language Models** on the **MedGRPO benchmark** -
840
  8 medical video understanding tasks across 8 surgical datasets.
841
 
842
- πŸ“„ **Paper**: [arXiv:2512.06581](https://arxiv.org/abs/2512.06581)
843
  🌐 **Project**: [yuhaosu.github.io/MedGRPO](https://yuhaosu.github.io/MedGRPO/)
844
- πŸ’Ύ **Dataset**: [huggingface.co/datasets/UIIAmerica/MedGRPO](https://huggingface.co/datasets/UIIAmerica/MedGRPO)
845
  πŸ’» **GitHub**: [github.com/YuhaoSu/MedGRPO](https://github.com/YuhaoSu/MedGRPO)
846
  """)
847
 
@@ -867,11 +867,11 @@ with gr.Blocks(title="MedGRPO Leaderboard", theme=gr.themes.Soft()) as demo:
867
  gr.Markdown("""
868
  ### Submit Your Model Results
869
 
870
- Upload your model's **predictions only** on the **MedGRPO test set (6,245 samples)** to be added to the leaderboard.
871
 
872
  #### πŸ“‹ Requirements
873
 
874
- 1. **Run inference** on the full test set (download from [HuggingFace](https://huggingface.co/datasets/UIIAmerica/MedGRPO))
875
  2. **Upload predictions JSON** in the format below (NO ground truth needed)
876
  3. **Provide model info** (name, organization)
877
 
@@ -965,7 +965,7 @@ with gr.Blocks(title="MedGRPO Leaderboard", theme=gr.themes.Soft()) as demo:
965
  # Tab 3: Tasks & Metrics
966
  with gr.Tab("πŸ“Š Tasks & Metrics"):
967
  gr.Markdown("""
968
- ### MedGRPO Benchmark Tasks
969
 
970
  The benchmark evaluates models across **8 diverse tasks** spanning video, segment, and frame-level understanding:
971
  """)
@@ -1025,10 +1025,10 @@ with gr.Blocks(title="MedGRPO Leaderboard", theme=gr.themes.Soft()) as demo:
1025
  # Tab 4: About
1026
  with gr.Tab("ℹ️ About"):
1027
  gr.Markdown("""
1028
- ### About MedGRPO
1029
 
1030
- **MedGRPO** (Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding)
1031
- is a comprehensive benchmark for evaluating Video-Language Models on medical and surgical video understanding.
1032
 
1033
  #### Key Features
1034
 
@@ -1053,13 +1053,13 @@ with gr.Blocks(title="MedGRPO Leaderboard", theme=gr.themes.Soft()) as demo:
1053
 
1054
  - πŸ“„ **Paper**: [https://arxiv.org/abs/2512.06581](https://arxiv.org/abs/2512.06581)
1055
  - 🌐 **Project Page**: [https://yuhaosu.github.io/MedGRPO/](https://yuhaosu.github.io/MedGRPO/)
1056
- - πŸ’Ύ **Dataset**: [https://huggingface.co/datasets/UIIAmerica/MedGRPO](https://huggingface.co/datasets/UIIAmerica/MedGRPO)
1057
  - πŸ’» **GitHub**: [https://github.com/YuhaoSu/MedGRPO](https://github.com/YuhaoSu/MedGRPO)
1058
- - πŸ† **Leaderboard**: [https://huggingface.co/spaces/UIIAmerica/MedGRPO-Leaderboard](https://huggingface.co/spaces/UIIAmerica/MedGRPO-Leaderboard)
1059
 
1060
  #### Dataset
1061
 
1062
- The MedGRPO benchmark includes:
1063
  - 21,060 training samples
1064
  - 6,245 test samples
1065
  - Multi-modal annotations (video, text, temporal spans, bounding boxes)
 
1
  """
2
+ MedVidBench Leaderboard - Interactive leaderboard for evaluating Video-Language Models
3
+ on the MedVidBench benchmark across 8 medical video understanding tasks.
4
  """
5
 
6
  import gradio as gr
 
32
  # Download from private repository
33
  print("⏳ Downloading ground truth from private repository...")
34
  gt_file = hf_hub_download(
35
+ repo_id="UIIAmerica/MedVidBench-GroundTruth",
36
  filename="ground_truth.json",
37
  repo_type="dataset",
38
  token=token,
 
78
  GROUND_TRUTH = load_ground_truth()
79
  print("=" * 60)
80
 
81
+ # MedVidBench Metrics Definitions (10 metrics from 8 tasks)
82
  # Note: TAL has 2 metrics, DVC has 2 metrics, others have 1 metric each
83
  METRICS = {
84
  "cvs_acc": {
 
831
 
832
 
833
  # Create Gradio interface
834
+ with gr.Blocks(title="MedVidBench Leaderboard", theme=gr.themes.Soft()) as demo:
835
 
836
  gr.Markdown("""
837
+ # πŸ₯ MedVidBench Leaderboard
838
 
839
+ Interactive leaderboard for evaluating **Video-Language Models** on the **MedVidBench benchmark** -
840
  8 medical video understanding tasks across 8 surgical datasets.
841
 
842
+ πŸ“„ **Paper**: [MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding](https://arxiv.org/abs/2512.06581)
843
  🌐 **Project**: [yuhaosu.github.io/MedGRPO](https://yuhaosu.github.io/MedGRPO/)
844
+ πŸ’Ύ **Dataset**: [huggingface.co/datasets/UIIAmerica/MedVidBench](https://huggingface.co/datasets/UIIAmerica/MedVidBench)
845
  πŸ’» **GitHub**: [github.com/YuhaoSu/MedGRPO](https://github.com/YuhaoSu/MedGRPO)
846
  """)
847
 
 
867
  gr.Markdown("""
868
  ### Submit Your Model Results
869
 
870
+ Upload your model's **predictions only** on the **MedVidBench test set (6,245 samples)** to be added to the leaderboard.
871
 
872
  #### πŸ“‹ Requirements
873
 
874
+ 1. **Run inference** on the full test set (download from [HuggingFace](https://huggingface.co/datasets/UIIAmerica/MedVidBench))
875
  2. **Upload predictions JSON** in the format below (NO ground truth needed)
876
  3. **Provide model info** (name, organization)
877
 
 
965
  # Tab 3: Tasks & Metrics
966
  with gr.Tab("πŸ“Š Tasks & Metrics"):
967
  gr.Markdown("""
968
+ ### MedVidBench Benchmark Tasks
969
 
970
  The benchmark evaluates models across **8 diverse tasks** spanning video, segment, and frame-level understanding:
971
  """)
 
1025
  # Tab 4: About
1026
  with gr.Tab("ℹ️ About"):
1027
  gr.Markdown("""
1028
+ ### About MedVidBench
1029
 
1030
+ **MedVidBench** is a comprehensive benchmark for evaluating Video-Language Models on medical and surgical video understanding.
1031
+ It was introduced in the **MedGRPO** paper (Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding).
1032
 
1033
  #### Key Features
1034
 
 
1053
 
1054
  - πŸ“„ **Paper**: [https://arxiv.org/abs/2512.06581](https://arxiv.org/abs/2512.06581)
1055
  - 🌐 **Project Page**: [https://yuhaosu.github.io/MedGRPO/](https://yuhaosu.github.io/MedGRPO/)
1056
+ - πŸ’Ύ **Dataset**: [https://huggingface.co/datasets/UIIAmerica/MedVidBench](https://huggingface.co/datasets/UIIAmerica/MedVidBench)
1057
  - πŸ’» **GitHub**: [https://github.com/YuhaoSu/MedGRPO](https://github.com/YuhaoSu/MedGRPO)
1058
+ - πŸ† **Leaderboard**: [https://huggingface.co/spaces/UIIAmerica/MedVidBench-Leaderboard](https://huggingface.co/spaces/UIIAmerica/MedVidBench-Leaderboard)
1059
 
1060
  #### Dataset
1061
 
1062
+ The MedVidBench benchmark includes:
1063
  - 21,060 training samples
1064
  - 6,245 test samples
1065
  - Multi-modal annotations (video, text, temporal spans, bounding boxes)