Spaces:
Sleeping
Sleeping
MedGRPO Team
commited on
Commit
Β·
a605ebb
1
Parent(s):
04f5f37
update
Browse files- evaluation/README.md +409 -0
- evaluation/eval_dvc.py +2 -1
- evaluation/eval_next_action.py +2 -1
- evaluation/eval_stg.py +2 -1
- evaluation/eval_tal.py +2 -1
- evaluation/evaluate_predictions.py +279 -0
- evaluation/test_evaluation.sh +140 -0
evaluation/README.md
ADDED
|
@@ -0,0 +1,409 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# MedVidBench Leaderboard Evaluation
|
| 2 |
+
|
| 3 |
+
Auto-detection wrapper for evaluating predictions with automatic ground-truth merging.
|
| 4 |
+
|
| 5 |
+
## Overview
|
| 6 |
+
|
| 7 |
+
This evaluation system supports two input formats:
|
| 8 |
+
|
| 9 |
+
1. **Merged format** (already contains ground-truth): Like `results.json`
|
| 10 |
+
2. **Prediction-only format** (needs ground-truth): Like user submissions
|
| 11 |
+
|
| 12 |
+
The wrapper automatically detects which format you're using and handles ground-truth merging if needed.
|
| 13 |
+
|
| 14 |
+
## Quick Start
|
| 15 |
+
|
| 16 |
+
```bash
|
| 17 |
+
# Evaluate predictions (auto-detects format)
|
| 18 |
+
python evaluate_predictions.py <predictions_file>
|
| 19 |
+
|
| 20 |
+
# Evaluate specific tasks
|
| 21 |
+
python evaluate_predictions.py <predictions_file> --tasks tal stg
|
| 22 |
+
|
| 23 |
+
# Only analyze file structure
|
| 24 |
+
python evaluate_predictions.py <predictions_file> --analyze-only
|
| 25 |
+
|
| 26 |
+
# Use overall grouping (aggregate all datasets)
|
| 27 |
+
python evaluate_predictions.py <predictions_file> --grouping overall
|
| 28 |
+
```
|
| 29 |
+
|
| 30 |
+
## Input Formats
|
| 31 |
+
|
| 32 |
+
### Format 1: Prediction-Only (User Submission Format)
|
| 33 |
+
|
| 34 |
+
```json
|
| 35 |
+
[
|
| 36 |
+
{
|
| 37 |
+
"id": "kcOqlifSukA&&22425&&25124&&1.0",
|
| 38 |
+
"qa_type": "tal",
|
| 39 |
+
"prediction": "22.0-78.0, 89.0-94.0 seconds."
|
| 40 |
+
},
|
| 41 |
+
...
|
| 42 |
+
]
|
| 43 |
+
```
|
| 44 |
+
|
| 45 |
+
**ID Format**: `video_id&&start_frame&&end_frame&&fps`
|
| 46 |
+
|
| 47 |
+
**Required fields**:
|
| 48 |
+
- `id`: Unique identifier matching ground-truth
|
| 49 |
+
- `qa_type`: Task type (tal, stg, dvc, next_action, rc, vs, skill_assessment, cvs_assessment)
|
| 50 |
+
- `prediction`: Model's prediction text
|
| 51 |
+
|
| 52 |
+
**What happens**: The wrapper automatically merges with `ground_truth.json` to create complete evaluation records.
|
| 53 |
+
|
| 54 |
+
### Format 2: Merged (Complete Format)
|
| 55 |
+
|
| 56 |
+
```json
|
| 57 |
+
{
|
| 58 |
+
"0": {
|
| 59 |
+
"metadata": {
|
| 60 |
+
"video_id": "kcOqlifSukA",
|
| 61 |
+
"fps": "1.0",
|
| 62 |
+
"input_video_start_frame": "22425",
|
| 63 |
+
"input_video_end_frame": "25124"
|
| 64 |
+
},
|
| 65 |
+
"qa_type": "tal",
|
| 66 |
+
"struc_info": [...],
|
| 67 |
+
"question": "...",
|
| 68 |
+
"gnd": "0.0-10.0 seconds.",
|
| 69 |
+
"answer": "22.0-78.0, 89.0-94.0 seconds.",
|
| 70 |
+
"data_source": "AVOS"
|
| 71 |
+
},
|
| 72 |
+
...
|
| 73 |
+
}
|
| 74 |
+
```
|
| 75 |
+
|
| 76 |
+
**Required fields**:
|
| 77 |
+
- `metadata`: Video metadata (video_id, fps, frame range)
|
| 78 |
+
- `qa_type`: Task type
|
| 79 |
+
- `struc_info`: Ground-truth structured information
|
| 80 |
+
- `question`: Question text
|
| 81 |
+
- `gnd`: Ground-truth answer
|
| 82 |
+
- `answer`: Model prediction
|
| 83 |
+
- `data_source`: Dataset name
|
| 84 |
+
|
| 85 |
+
**What happens**: The wrapper uses the file directly for evaluation.
|
| 86 |
+
|
| 87 |
+
## Ground-Truth File
|
| 88 |
+
|
| 89 |
+
**Location**: `/root/code/MedVidBench-Leaderboard/data/ground_truth.json`
|
| 90 |
+
|
| 91 |
+
**Structure**: Array of records, each containing:
|
| 92 |
+
- Complete metadata (video_id, fps, frame range)
|
| 93 |
+
- `struc_info`: Structured ground-truth (spans for TAL/STG, boxes for RC, etc.)
|
| 94 |
+
- Ground-truth answer
|
| 95 |
+
- Dataset source
|
| 96 |
+
|
| 97 |
+
**Note**: This file is NOT public. Users submit prediction-only files, which are merged server-side.
|
| 98 |
+
|
| 99 |
+
## Supported Tasks
|
| 100 |
+
|
| 101 |
+
| Task | qa_type | Metrics |
|
| 102 |
+
|------|---------|---------|
|
| 103 |
+
| **TAL** | `tal` | Recall@0.3/0.5, mIoU@0.3/0.5 |
|
| 104 |
+
| **STG** | `stg` | IoU@0.3/0.5/0.7, mIoU |
|
| 105 |
+
| **DVC** | `dense_captioning`, `dense_captioning_gpt`, `dense_captioning_gemini`, `dc` | CIDER, METEOR, Precision, Recall, F1, SODA_c |
|
| 106 |
+
| **Next Action** | `next_action` | Accuracy (per-dataset) |
|
| 107 |
+
| **RC** | `region_caption`, `region_caption_gpt`, `region_caption_gemini` | LLM Judge (GPT-4.1/Gemini) |
|
| 108 |
+
| **VS** | `video_summary`, `video_summary_gpt`, `video_summary_gemini` | LLM Judge (GPT-4.1/Gemini) |
|
| 109 |
+
| **Skill Assessment** | `skill_assessment` | Accuracy, Macro F1, Weighted F1 |
|
| 110 |
+
| **CVS Assessment** | `cvs_assessment` | Accuracy, Precision, Recall, F1 |
|
| 111 |
+
|
| 112 |
+
## Usage Examples
|
| 113 |
+
|
| 114 |
+
### Example 1: Evaluate User Submission (Prediction-Only)
|
| 115 |
+
|
| 116 |
+
```bash
|
| 117 |
+
# User submits predictions in prediction-only format
|
| 118 |
+
python evaluate_predictions.py user_predictions.json
|
| 119 |
+
|
| 120 |
+
# Output:
|
| 121 |
+
# [EvaluationWrapper] β Detected: Prediction-only format (id, qa_type, prediction)
|
| 122 |
+
# [EvaluationWrapper] Merging with ground-truth...
|
| 123 |
+
# [EvaluationWrapper] β Successfully merged 6245/6245 predictions
|
| 124 |
+
# ... [evaluation results] ...
|
| 125 |
+
```
|
| 126 |
+
|
| 127 |
+
### Example 2: Evaluate Internal Results (Already Merged)
|
| 128 |
+
|
| 129 |
+
```bash
|
| 130 |
+
# Internal evaluation with complete data
|
| 131 |
+
python evaluate_predictions.py results.json
|
| 132 |
+
|
| 133 |
+
# Output:
|
| 134 |
+
# [EvaluationWrapper] β Detected: Predictions already contain ground-truth
|
| 135 |
+
# [EvaluationWrapper] Using predictions file directly for evaluation
|
| 136 |
+
# ... [evaluation results] ...
|
| 137 |
+
```
|
| 138 |
+
|
| 139 |
+
### Example 3: Specific Tasks Only
|
| 140 |
+
|
| 141 |
+
```bash
|
| 142 |
+
# Evaluate only TAL and STG tasks
|
| 143 |
+
python evaluate_predictions.py predictions.json --tasks tal stg
|
| 144 |
+
|
| 145 |
+
# Evaluate captioning tasks with LLM judge
|
| 146 |
+
python evaluate_predictions.py predictions.json --tasks rc vs dvc
|
| 147 |
+
```
|
| 148 |
+
|
| 149 |
+
### Example 4: Different Grouping Modes
|
| 150 |
+
|
| 151 |
+
```bash
|
| 152 |
+
# Per-dataset results (default)
|
| 153 |
+
python evaluate_predictions.py predictions.json --grouping per-dataset
|
| 154 |
+
|
| 155 |
+
# Overall results (aggregate all datasets)
|
| 156 |
+
python evaluate_predictions.py predictions.json --grouping overall
|
| 157 |
+
```
|
| 158 |
+
|
| 159 |
+
### Example 5: Skip LLM Judge (Use Pre-computed Scores)
|
| 160 |
+
|
| 161 |
+
```bash
|
| 162 |
+
# Skip LLM judge evaluation for caption tasks
|
| 163 |
+
# Useful when LLM scores are already pre-computed in the predictions
|
| 164 |
+
python evaluate_predictions.py predictions.json --skip-llm-judge
|
| 165 |
+
```
|
| 166 |
+
|
| 167 |
+
### Example 6: Analyze File Structure
|
| 168 |
+
|
| 169 |
+
```bash
|
| 170 |
+
# Only analyze what tasks/datasets are present
|
| 171 |
+
python evaluate_predictions.py predictions.json --analyze-only
|
| 172 |
+
|
| 173 |
+
# Output:
|
| 174 |
+
# Found QA types:
|
| 175 |
+
# tal: 1637 records
|
| 176 |
+
# stg: 780 records
|
| 177 |
+
# ...
|
| 178 |
+
# Found datasets:
|
| 179 |
+
# AVOS: 321 records
|
| 180 |
+
# CholecT50: 409 records
|
| 181 |
+
# ...
|
| 182 |
+
```
|
| 183 |
+
|
| 184 |
+
## Command-Line Options
|
| 185 |
+
|
| 186 |
+
```
|
| 187 |
+
python evaluate_predictions.py PREDICTIONS_FILE [OPTIONS]
|
| 188 |
+
|
| 189 |
+
Required:
|
| 190 |
+
PREDICTIONS_FILE Path to predictions JSON (merged or prediction-only format)
|
| 191 |
+
|
| 192 |
+
Optional:
|
| 193 |
+
--ground-truth PATH Path to ground-truth JSON (default: data/ground_truth.json)
|
| 194 |
+
--tasks TASK [TASK ...] Specific tasks to evaluate (default: all available)
|
| 195 |
+
Choices: dvc, tal, next_action, stg, rc, vs,
|
| 196 |
+
skill_assessment, cvs_assessment
|
| 197 |
+
--grouping {per-dataset,overall}
|
| 198 |
+
Grouping strategy (default: per-dataset)
|
| 199 |
+
- per-dataset: Results per dataset
|
| 200 |
+
- overall: Aggregate all datasets
|
| 201 |
+
--analyze-only Only analyze file structure, no evaluation
|
| 202 |
+
--skip-llm-judge Skip LLM judge for caption tasks (use pre-computed scores)
|
| 203 |
+
-h, --help Show help message
|
| 204 |
+
```
|
| 205 |
+
|
| 206 |
+
## Output Format
|
| 207 |
+
|
| 208 |
+
### Per-Dataset Grouping (Default)
|
| 209 |
+
|
| 210 |
+
```
|
| 211 |
+
================================================================================
|
| 212 |
+
EVALUATION RESULTS - PER DATASET
|
| 213 |
+
================================================================================
|
| 214 |
+
|
| 215 |
+
AVOS:
|
| 216 |
+
TAL:
|
| 217 |
+
recall@0.3: 0.45
|
| 218 |
+
meanIoU@0.3: 0.42
|
| 219 |
+
recall@0.5: 0.32
|
| 220 |
+
meanIoU@0.5: 0.28
|
| 221 |
+
|
| 222 |
+
CholecT50:
|
| 223 |
+
TAL:
|
| 224 |
+
recall@0.3: 0.52
|
| 225 |
+
...
|
| 226 |
+
```
|
| 227 |
+
|
| 228 |
+
### Overall Grouping
|
| 229 |
+
|
| 230 |
+
```
|
| 231 |
+
================================================================================
|
| 232 |
+
EVALUATION RESULTS - OVERALL (Dataset-Agnostic)
|
| 233 |
+
================================================================================
|
| 234 |
+
|
| 235 |
+
TAL - Overall Evaluation (All Datasets Combined)
|
| 236 |
+
Total samples: 1637
|
| 237 |
+
|
| 238 |
+
recall@0.3: 0.48
|
| 239 |
+
meanIoU@0.3: 0.45
|
| 240 |
+
recall@0.5: 0.35
|
| 241 |
+
meanIoU@0.5: 0.30
|
| 242 |
+
```
|
| 243 |
+
|
| 244 |
+
## Workflow for User Submissions
|
| 245 |
+
|
| 246 |
+
1. **User downloads benchmark**: `/root/code/MedVidBench/cleaned_test_data_11_04.json`
|
| 247 |
+
- Contains questions but NO ground-truth (struc_info removed)
|
| 248 |
+
|
| 249 |
+
2. **User runs inference**: Generates predictions for each sample
|
| 250 |
+
|
| 251 |
+
3. **User submits predictions**: prediction-only format
|
| 252 |
+
```json
|
| 253 |
+
[
|
| 254 |
+
{
|
| 255 |
+
"id": "<from benchmark>",
|
| 256 |
+
"qa_type": "<from benchmark>",
|
| 257 |
+
"prediction": "<model output>"
|
| 258 |
+
},
|
| 259 |
+
...
|
| 260 |
+
]
|
| 261 |
+
```
|
| 262 |
+
|
| 263 |
+
4. **Server evaluates**:
|
| 264 |
+
```bash
|
| 265 |
+
python evaluate_predictions.py user_submission.json
|
| 266 |
+
```
|
| 267 |
+
- Auto-detects format
|
| 268 |
+
- Merges with server-side ground-truth
|
| 269 |
+
- Runs evaluation
|
| 270 |
+
- Returns metrics
|
| 271 |
+
|
| 272 |
+
## File Structure
|
| 273 |
+
|
| 274 |
+
```
|
| 275 |
+
evaluation/
|
| 276 |
+
βββ README.md # This file
|
| 277 |
+
βββ evaluate_predictions.py # Main wrapper (auto-detection + merging)
|
| 278 |
+
βββ evaluate_all_pai.py # Core evaluation orchestrator
|
| 279 |
+
βββ eval_tal.py # TAL evaluation
|
| 280 |
+
βββ eval_stg.py # STG evaluation
|
| 281 |
+
βββ eval_dvc.py # Dense captioning evaluation
|
| 282 |
+
βββ eval_next_action.py # Next action evaluation
|
| 283 |
+
βββ eval_caption_llm_judge.py # RC/VS LLM judge evaluation
|
| 284 |
+
βββ eval_skill_assessment.py # Skill assessment evaluation
|
| 285 |
+
βββ eval_cvs_assessment.py # CVS assessment evaluation
|
| 286 |
+
```
|
| 287 |
+
|
| 288 |
+
## Key Features
|
| 289 |
+
|
| 290 |
+
### Auto-Detection Logic
|
| 291 |
+
|
| 292 |
+
The wrapper detects format by checking for these indicators:
|
| 293 |
+
|
| 294 |
+
**Prediction-only format**:
|
| 295 |
+
- Has `id` field (video_id&&start&&end&&fps)
|
| 296 |
+
- Has `prediction` field
|
| 297 |
+
- Missing `gnd` or `struc_info`
|
| 298 |
+
|
| 299 |
+
**Merged format**:
|
| 300 |
+
- Has `question` field
|
| 301 |
+
- Has `gnd` field (ground-truth)
|
| 302 |
+
- Has `struc_info` field (structured GT)
|
| 303 |
+
- Has `metadata` dict
|
| 304 |
+
|
| 305 |
+
### Ground-Truth Merging
|
| 306 |
+
|
| 307 |
+
When prediction-only format is detected:
|
| 308 |
+
|
| 309 |
+
1. Load predictions and ground-truth
|
| 310 |
+
2. Build index: `{id -> ground_truth_record}`
|
| 311 |
+
3. For each prediction:
|
| 312 |
+
- Look up ground-truth by ID
|
| 313 |
+
- Merge into complete record
|
| 314 |
+
- Add `data_source` from ground-truth
|
| 315 |
+
4. Save to temporary file
|
| 316 |
+
5. Run evaluation
|
| 317 |
+
6. Clean up temporary file
|
| 318 |
+
|
| 319 |
+
### Dataset Detection
|
| 320 |
+
|
| 321 |
+
Datasets are detected from:
|
| 322 |
+
1. **`data_source` field** (primary, leaderboard format)
|
| 323 |
+
2. `dataset` field (fallback)
|
| 324 |
+
3. `dataset_name` field (fallback)
|
| 325 |
+
4. Video ID patterns (last resort):
|
| 326 |
+
- YouTube IDs (11 chars with letters) β AVOS
|
| 327 |
+
- `*_part*` pattern β CoPESD
|
| 328 |
+
- `video*` pattern β CholecT50
|
| 329 |
+
|
| 330 |
+
## Error Handling
|
| 331 |
+
|
| 332 |
+
### Missing Ground-Truth
|
| 333 |
+
|
| 334 |
+
```bash
|
| 335 |
+
# If ground-truth file not found
|
| 336 |
+
[EvaluationWrapper] β ERROR: Ground-truth file not found: /path/to/ground_truth.json
|
| 337 |
+
```
|
| 338 |
+
|
| 339 |
+
**Solution**: Specify correct path with `--ground-truth`
|
| 340 |
+
|
| 341 |
+
### Unmatched Predictions
|
| 342 |
+
|
| 343 |
+
```bash
|
| 344 |
+
[EvaluationWrapper] β οΈ WARNING: 10 predictions not found in ground-truth
|
| 345 |
+
[EvaluationWrapper] First 5 unmatched IDs: [...]
|
| 346 |
+
```
|
| 347 |
+
|
| 348 |
+
**Cause**: Prediction IDs don't match ground-truth IDs
|
| 349 |
+
|
| 350 |
+
**Solution**: Check ID format (video_id&&start&&end&&fps must match exactly)
|
| 351 |
+
|
| 352 |
+
### Invalid ID Format
|
| 353 |
+
|
| 354 |
+
```bash
|
| 355 |
+
ValueError: Invalid ID format: <id_string>
|
| 356 |
+
```
|
| 357 |
+
|
| 358 |
+
**Cause**: ID doesn't follow `video_id&&start&&end&&fps` format
|
| 359 |
+
|
| 360 |
+
**Solution**: Fix ID format in predictions
|
| 361 |
+
|
| 362 |
+
## API Keys for LLM Judge
|
| 363 |
+
|
| 364 |
+
For RC/VS evaluation with LLM judge:
|
| 365 |
+
|
| 366 |
+
```bash
|
| 367 |
+
export OPENAI_API_KEY="your-key" # For GPT-4.1
|
| 368 |
+
export GOOGLE_API_KEY="your-key" # For Gemini
|
| 369 |
+
|
| 370 |
+
# Then run evaluation
|
| 371 |
+
python evaluate_predictions.py predictions.json --tasks rc vs
|
| 372 |
+
```
|
| 373 |
+
|
| 374 |
+
**Cost**: ~$0.012 per RC/VS sample (GPT-4.1)
|
| 375 |
+
|
| 376 |
+
## Verification Checklist
|
| 377 |
+
|
| 378 |
+
Before evaluating submissions:
|
| 379 |
+
|
| 380 |
+
```bash
|
| 381 |
+
# 1. Check file format
|
| 382 |
+
python evaluate_predictions.py submission.json --analyze-only
|
| 383 |
+
|
| 384 |
+
# 2. Verify ground-truth file exists
|
| 385 |
+
ls -lh /root/code/MedVidBench-Leaderboard/data/ground_truth.json
|
| 386 |
+
|
| 387 |
+
# 3. Run evaluation on sample (first 100 records)
|
| 388 |
+
head -100 submission.json > sample.json
|
| 389 |
+
python evaluate_predictions.py sample.json
|
| 390 |
+
|
| 391 |
+
# 4. If successful, run full evaluation
|
| 392 |
+
python evaluate_predictions.py submission.json
|
| 393 |
+
```
|
| 394 |
+
|
| 395 |
+
## Performance
|
| 396 |
+
|
| 397 |
+
- **Small files** (100 samples): ~5-10 seconds
|
| 398 |
+
- **Full benchmark** (6245 samples): ~2-5 minutes
|
| 399 |
+
- TAL/STG: ~30 seconds per dataset
|
| 400 |
+
- Next Action: ~20 seconds per dataset
|
| 401 |
+
- DVC: ~1-2 minutes (metric computation)
|
| 402 |
+
- RC/VS with LLM judge: ~5-10 minutes (API calls)
|
| 403 |
+
|
| 404 |
+
## Notes
|
| 405 |
+
|
| 406 |
+
- Ground-truth file contains **1414 records** (subset for leaderboard testing)
|
| 407 |
+
- Full benchmark has **6245 records** across 8 datasets
|
| 408 |
+
- Temporary files are automatically cleaned up after evaluation
|
| 409 |
+
- LLM judge can be skipped with `--skip-llm-judge` if scores pre-computed
|
evaluation/eval_dvc.py
CHANGED
|
@@ -112,7 +112,8 @@ def group_records_by_dataset(data):
|
|
| 112 |
if not any(x in qa_type.lower() for x in ['dense_captioning', 'dense_caption', 'dc']):
|
| 113 |
continue
|
| 114 |
|
| 115 |
-
|
|
|
|
| 116 |
video_id = record.get('video_id', record.get('metadata', {}).get('video_id', ''))
|
| 117 |
|
| 118 |
if dataset == 'Unknown' and video_id:
|
|
|
|
| 112 |
if not any(x in qa_type.lower() for x in ['dense_captioning', 'dense_caption', 'dc']):
|
| 113 |
continue
|
| 114 |
|
| 115 |
+
# Check data_source first (leaderboard format), then fall back to dataset/dataset_name
|
| 116 |
+
dataset = record.get('data_source', record.get('dataset', record.get('dataset_name', record.get('metadata', {}).get('dataset', 'Unknown'))))
|
| 117 |
video_id = record.get('video_id', record.get('metadata', {}).get('video_id', ''))
|
| 118 |
|
| 119 |
if dataset == 'Unknown' and video_id:
|
evaluation/eval_next_action.py
CHANGED
|
@@ -494,7 +494,8 @@ def group_records_by_dataset(data):
|
|
| 494 |
if 'next_action' not in qa_type.lower():
|
| 495 |
continue
|
| 496 |
|
| 497 |
-
|
|
|
|
| 498 |
video_id = record.get('video_id', record.get('metadata', {}).get('video_id', ''))
|
| 499 |
|
| 500 |
if dataset == 'Unknown' and video_id:
|
|
|
|
| 494 |
if 'next_action' not in qa_type.lower():
|
| 495 |
continue
|
| 496 |
|
| 497 |
+
# Check data_source first (leaderboard format), then fall back to dataset/dataset_name
|
| 498 |
+
dataset = record.get('data_source', record.get('dataset', record.get('dataset_name', record.get('metadata', {}).get('dataset', 'Unknown'))))
|
| 499 |
video_id = record.get('video_id', record.get('metadata', {}).get('video_id', ''))
|
| 500 |
|
| 501 |
if dataset == 'Unknown' and video_id:
|
evaluation/eval_stg.py
CHANGED
|
@@ -210,7 +210,8 @@ def group_records_by_dataset(data):
|
|
| 210 |
if 'stg' not in qa_type.lower():
|
| 211 |
continue
|
| 212 |
|
| 213 |
-
|
|
|
|
| 214 |
video_id = record.get('video_id', record.get('metadata', {}).get('video_id', ''))
|
| 215 |
|
| 216 |
if dataset == 'Unknown' and video_id:
|
|
|
|
| 210 |
if 'stg' not in qa_type.lower():
|
| 211 |
continue
|
| 212 |
|
| 213 |
+
# Check data_source first (leaderboard format), then fall back to dataset/dataset_name
|
| 214 |
+
dataset = record.get('data_source', record.get('dataset', record.get('dataset_name', record.get('metadata', {}).get('dataset', 'Unknown'))))
|
| 215 |
video_id = record.get('video_id', record.get('metadata', {}).get('video_id', ''))
|
| 216 |
|
| 217 |
if dataset == 'Unknown' and video_id:
|
evaluation/eval_tal.py
CHANGED
|
@@ -139,7 +139,8 @@ def group_records_by_dataset(data):
|
|
| 139 |
|
| 140 |
# Detect dataset from video_id or other fields
|
| 141 |
video_id = record.get('video_id', '')
|
| 142 |
-
|
|
|
|
| 143 |
|
| 144 |
if dataset == 'Unknown' and video_id:
|
| 145 |
# Try to infer from video_id patterns
|
|
|
|
| 139 |
|
| 140 |
# Detect dataset from video_id or other fields
|
| 141 |
video_id = record.get('video_id', '')
|
| 142 |
+
# Check data_source first (used in leaderboard format), then fall back to dataset/dataset_name
|
| 143 |
+
dataset = record.get('data_source', record.get('dataset', record.get('dataset_name', 'Unknown')))
|
| 144 |
|
| 145 |
if dataset == 'Unknown' and video_id:
|
| 146 |
# Try to infer from video_id patterns
|
evaluation/evaluate_predictions.py
ADDED
|
@@ -0,0 +1,279 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Auto-detect prediction format and evaluate with ground-truth merging if needed."""
|
| 2 |
+
|
| 3 |
+
import json
|
| 4 |
+
import sys
|
| 5 |
+
import argparse
|
| 6 |
+
import os
|
| 7 |
+
from pathlib import Path
|
| 8 |
+
|
| 9 |
+
# Add evaluation directory to path to import evaluate_all_pai
|
| 10 |
+
eval_dir = Path(__file__).parent
|
| 11 |
+
sys.path.insert(0, str(eval_dir))
|
| 12 |
+
|
| 13 |
+
import evaluate_all_pai
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
def detect_has_ground_truth(data):
|
| 17 |
+
"""Detect if prediction file already contains ground-truth.
|
| 18 |
+
|
| 19 |
+
Args:
|
| 20 |
+
data: Loaded JSON data (dict or list)
|
| 21 |
+
|
| 22 |
+
Returns:
|
| 23 |
+
bool: True if ground-truth is present, False otherwise
|
| 24 |
+
"""
|
| 25 |
+
# Handle both dict and list formats
|
| 26 |
+
if isinstance(data, dict):
|
| 27 |
+
# Check first record
|
| 28 |
+
first_key = next(iter(data))
|
| 29 |
+
sample = data[first_key]
|
| 30 |
+
elif isinstance(data, list):
|
| 31 |
+
if not data:
|
| 32 |
+
return False
|
| 33 |
+
sample = data[0]
|
| 34 |
+
else:
|
| 35 |
+
return False
|
| 36 |
+
|
| 37 |
+
# Check for ground-truth indicators
|
| 38 |
+
# results.json format has: question, gnd, answer, struc_info, metadata, qa_type, data_source
|
| 39 |
+
has_question = 'question' in sample
|
| 40 |
+
has_gnd = 'gnd' in sample
|
| 41 |
+
has_struc_info = 'struc_info' in sample
|
| 42 |
+
has_metadata_dict = isinstance(sample.get('metadata'), dict)
|
| 43 |
+
|
| 44 |
+
# predictions_only.json format has: id, qa_type, prediction
|
| 45 |
+
has_id = 'id' in sample
|
| 46 |
+
has_prediction = 'prediction' in sample
|
| 47 |
+
|
| 48 |
+
# If it has id + prediction format, it's prediction-only
|
| 49 |
+
if has_id and has_prediction and not has_gnd:
|
| 50 |
+
return False
|
| 51 |
+
|
| 52 |
+
# If it has question + gnd + struc_info, it's already merged
|
| 53 |
+
if has_question and has_gnd and has_struc_info:
|
| 54 |
+
return True
|
| 55 |
+
|
| 56 |
+
# Default: assume needs merging if unclear
|
| 57 |
+
return False
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
def parse_id(id_str):
|
| 61 |
+
"""Parse ID string into components.
|
| 62 |
+
|
| 63 |
+
Format: video_id&&start_frame&&end_frame&&fps
|
| 64 |
+
Example: "kcOqlifSukA&&22425&&25124&&1.0"
|
| 65 |
+
|
| 66 |
+
Returns:
|
| 67 |
+
dict: {'video_id': str, 'input_video_start_frame': str,
|
| 68 |
+
'input_video_end_frame': str, 'fps': str}
|
| 69 |
+
"""
|
| 70 |
+
parts = id_str.split('&&')
|
| 71 |
+
if len(parts) != 4:
|
| 72 |
+
raise ValueError(f"Invalid ID format: {id_str}")
|
| 73 |
+
|
| 74 |
+
return {
|
| 75 |
+
'video_id': parts[0],
|
| 76 |
+
'input_video_start_frame': parts[1],
|
| 77 |
+
'input_video_end_frame': parts[2],
|
| 78 |
+
'fps': parts[3]
|
| 79 |
+
}
|
| 80 |
+
|
| 81 |
+
|
| 82 |
+
def merge_with_ground_truth(predictions_file, ground_truth_file):
|
| 83 |
+
"""Merge prediction-only file with ground-truth.
|
| 84 |
+
|
| 85 |
+
Args:
|
| 86 |
+
predictions_file: Path to predictions JSON (id, qa_type, prediction format)
|
| 87 |
+
ground_truth_file: Path to ground-truth JSON
|
| 88 |
+
|
| 89 |
+
Returns:
|
| 90 |
+
dict: Merged data in results.json format
|
| 91 |
+
"""
|
| 92 |
+
print(f"[EvaluationWrapper] Loading predictions from {predictions_file}")
|
| 93 |
+
with open(predictions_file, 'r') as f:
|
| 94 |
+
predictions = json.load(f)
|
| 95 |
+
|
| 96 |
+
print(f"[EvaluationWrapper] Loading ground-truth from {ground_truth_file}")
|
| 97 |
+
with open(ground_truth_file, 'r') as f:
|
| 98 |
+
ground_truth = json.load(f)
|
| 99 |
+
|
| 100 |
+
# Build lookup index for ground-truth
|
| 101 |
+
print("[EvaluationWrapper] Building ground-truth index...")
|
| 102 |
+
gt_index = {}
|
| 103 |
+
for record in ground_truth:
|
| 104 |
+
metadata = record.get('metadata', {})
|
| 105 |
+
# Create key from metadata
|
| 106 |
+
key = f"{metadata.get('video_id')}&&{metadata.get('input_video_start_frame')}&&{metadata.get('input_video_end_frame')}&&{metadata.get('fps')}"
|
| 107 |
+
gt_index[key] = record
|
| 108 |
+
|
| 109 |
+
print(f"[EvaluationWrapper] Ground-truth index size: {len(gt_index)} records")
|
| 110 |
+
print(f"[EvaluationWrapper] Predictions to merge: {len(predictions)} records")
|
| 111 |
+
|
| 112 |
+
# Merge predictions with ground-truth
|
| 113 |
+
merged = {}
|
| 114 |
+
matched_count = 0
|
| 115 |
+
unmatched_ids = []
|
| 116 |
+
|
| 117 |
+
for i, pred in enumerate(predictions):
|
| 118 |
+
pred_id = pred.get('id')
|
| 119 |
+
if not pred_id:
|
| 120 |
+
print(f"[EvaluationWrapper] β οΈ WARNING: Prediction {i} missing 'id' field, skipping")
|
| 121 |
+
continue
|
| 122 |
+
|
| 123 |
+
# Look up ground-truth
|
| 124 |
+
if pred_id not in gt_index:
|
| 125 |
+
unmatched_ids.append(pred_id)
|
| 126 |
+
continue
|
| 127 |
+
|
| 128 |
+
gt_record = gt_index[pred_id]
|
| 129 |
+
|
| 130 |
+
# Create merged record (ensure data_source is properly set)
|
| 131 |
+
data_source = gt_record.get('data_source', 'Unknown')
|
| 132 |
+
# Fallback to dataset_name if data_source is missing
|
| 133 |
+
if data_source == 'Unknown' or not data_source:
|
| 134 |
+
data_source = gt_record.get('dataset_name', 'Unknown')
|
| 135 |
+
|
| 136 |
+
merged_record = {
|
| 137 |
+
'metadata': gt_record.get('metadata', {}),
|
| 138 |
+
'qa_type': pred.get('qa_type'),
|
| 139 |
+
'struc_info': gt_record.get('struc_info', []),
|
| 140 |
+
'question': gt_record.get('question', ''),
|
| 141 |
+
'gnd': gt_record.get('answer', ''), # Ground-truth answer
|
| 142 |
+
'answer': pred.get('prediction', ''), # Model prediction
|
| 143 |
+
'data_source': data_source
|
| 144 |
+
}
|
| 145 |
+
|
| 146 |
+
# Use sequential keys like results.json
|
| 147 |
+
merged[str(i)] = merged_record
|
| 148 |
+
matched_count += 1
|
| 149 |
+
|
| 150 |
+
print(f"[EvaluationWrapper] β Successfully merged {matched_count}/{len(predictions)} predictions")
|
| 151 |
+
|
| 152 |
+
if unmatched_ids:
|
| 153 |
+
print(f"[EvaluationWrapper] β οΈ WARNING: {len(unmatched_ids)} predictions not found in ground-truth")
|
| 154 |
+
if len(unmatched_ids) <= 5:
|
| 155 |
+
print(f"[EvaluationWrapper] Unmatched IDs: {unmatched_ids}")
|
| 156 |
+
else:
|
| 157 |
+
print(f"[EvaluationWrapper] First 5 unmatched IDs: {unmatched_ids[:5]}")
|
| 158 |
+
|
| 159 |
+
return merged
|
| 160 |
+
|
| 161 |
+
|
| 162 |
+
def main():
|
| 163 |
+
"""Main function with command line interface."""
|
| 164 |
+
parser = argparse.ArgumentParser(
|
| 165 |
+
description="Evaluate predictions with automatic ground-truth merging"
|
| 166 |
+
)
|
| 167 |
+
parser.add_argument("predictions_file",
|
| 168 |
+
help="Path to predictions JSON file (can be merged or prediction-only format)")
|
| 169 |
+
parser.add_argument("--ground-truth",
|
| 170 |
+
default="/root/code/MedVidBench-Leaderboard/data/ground_truth.json",
|
| 171 |
+
help="Path to ground-truth JSON file (default: data/ground_truth.json)")
|
| 172 |
+
parser.add_argument("--tasks", nargs="+",
|
| 173 |
+
choices=["dvc", "tal", "next_action", "stg", "rc", "vs",
|
| 174 |
+
"skill_assessment", "cvs_assessment", "gemini_structured", "gpt_structured"],
|
| 175 |
+
help="Specific tasks to evaluate (default: all available tasks)")
|
| 176 |
+
parser.add_argument("--grouping", choices=["per-dataset", "overall"], default="per-dataset",
|
| 177 |
+
help="Grouping strategy: 'per-dataset' or 'overall' (default: per-dataset)")
|
| 178 |
+
parser.add_argument("--analyze-only", action="store_true",
|
| 179 |
+
help="Only analyze the file structure without running evaluations")
|
| 180 |
+
parser.add_argument("--skip-llm-judge", action="store_true",
|
| 181 |
+
help="Skip LLM judge evaluation for caption tasks (use when LLM scores are pre-computed)")
|
| 182 |
+
|
| 183 |
+
args = parser.parse_args()
|
| 184 |
+
|
| 185 |
+
# Load predictions
|
| 186 |
+
print(f"[EvaluationWrapper] Loading predictions from {args.predictions_file}")
|
| 187 |
+
with open(args.predictions_file, 'r') as f:
|
| 188 |
+
predictions_data = json.load(f)
|
| 189 |
+
|
| 190 |
+
# Auto-detect format
|
| 191 |
+
has_ground_truth = detect_has_ground_truth(predictions_data)
|
| 192 |
+
|
| 193 |
+
if has_ground_truth:
|
| 194 |
+
print("[EvaluationWrapper] β Detected: Predictions already contain ground-truth")
|
| 195 |
+
print("[EvaluationWrapper] Using predictions file directly for evaluation")
|
| 196 |
+
eval_file = args.predictions_file
|
| 197 |
+
else:
|
| 198 |
+
print("[EvaluationWrapper] β Detected: Prediction-only format (id, qa_type, prediction)")
|
| 199 |
+
print("[EvaluationWrapper] Merging with ground-truth...")
|
| 200 |
+
|
| 201 |
+
# Check ground-truth file exists
|
| 202 |
+
if not os.path.exists(args.ground_truth):
|
| 203 |
+
print(f"[EvaluationWrapper] β ERROR: Ground-truth file not found: {args.ground_truth}")
|
| 204 |
+
sys.exit(1)
|
| 205 |
+
|
| 206 |
+
# Merge predictions with ground-truth
|
| 207 |
+
merged_data = merge_with_ground_truth(args.predictions_file, args.ground_truth)
|
| 208 |
+
|
| 209 |
+
# Save merged data to temporary file
|
| 210 |
+
import tempfile
|
| 211 |
+
with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
|
| 212 |
+
json.dump(merged_data, f, indent=2)
|
| 213 |
+
eval_file = f.name
|
| 214 |
+
|
| 215 |
+
print(f"[EvaluationWrapper] β Merged data saved to temporary file: {eval_file}")
|
| 216 |
+
|
| 217 |
+
# Call evaluate_all_pai with the appropriate file
|
| 218 |
+
print(f"\n[EvaluationWrapper] {'='*80}")
|
| 219 |
+
print(f"[EvaluationWrapper] Starting evaluation with evaluate_all_pai.py")
|
| 220 |
+
print(f"[EvaluationWrapper] {'='*80}\n")
|
| 221 |
+
|
| 222 |
+
# Set sys.argv for evaluate_all_pai
|
| 223 |
+
eval_args = [eval_file]
|
| 224 |
+
if args.tasks:
|
| 225 |
+
eval_args.extend(["--tasks"] + args.tasks)
|
| 226 |
+
if args.grouping:
|
| 227 |
+
eval_args.extend(["--grouping", args.grouping])
|
| 228 |
+
if args.analyze_only:
|
| 229 |
+
eval_args.append("--analyze-only")
|
| 230 |
+
if args.skip_llm_judge:
|
| 231 |
+
eval_args.append("--skip-llm-judge")
|
| 232 |
+
|
| 233 |
+
original_argv = sys.argv
|
| 234 |
+
sys.argv = ["evaluate_all_pai.py"] + eval_args
|
| 235 |
+
|
| 236 |
+
try:
|
| 237 |
+
# Run evaluation
|
| 238 |
+
if args.analyze_only:
|
| 239 |
+
qa_type_counts, dataset_counts = evaluate_all_pai.analyze_output_file(eval_file)
|
| 240 |
+
# Determine available tasks
|
| 241 |
+
available_tasks = []
|
| 242 |
+
if any("dense_captioning" in qa_type or qa_type == "dc" for qa_type in qa_type_counts):
|
| 243 |
+
available_tasks.append("dvc")
|
| 244 |
+
if qa_type_counts.get("tal", 0) > 0:
|
| 245 |
+
available_tasks.append("tal")
|
| 246 |
+
if qa_type_counts.get("next_action", 0) > 0:
|
| 247 |
+
available_tasks.append("next_action")
|
| 248 |
+
if qa_type_counts.get("stg", 0) > 0:
|
| 249 |
+
available_tasks.append("stg")
|
| 250 |
+
if any("region_caption" in qa_type for qa_type in qa_type_counts):
|
| 251 |
+
available_tasks.append("rc")
|
| 252 |
+
if any("video_summary" in qa_type for qa_type in qa_type_counts):
|
| 253 |
+
available_tasks.append("vs")
|
| 254 |
+
if qa_type_counts.get("skill_assessment", 0) > 0:
|
| 255 |
+
available_tasks.append("skill_assessment")
|
| 256 |
+
if qa_type_counts.get("cvs_assessment", 0) > 0:
|
| 257 |
+
available_tasks.append("cvs_assessment")
|
| 258 |
+
|
| 259 |
+
evaluate_all_pai.print_evaluation_results_csv(eval_file, available_tasks)
|
| 260 |
+
else:
|
| 261 |
+
silent_eval = (args.grouping == "overall")
|
| 262 |
+
evaluate_all_pai.run_evaluation(
|
| 263 |
+
eval_file,
|
| 264 |
+
args.tasks,
|
| 265 |
+
grouping=args.grouping,
|
| 266 |
+
silent_eval=silent_eval,
|
| 267 |
+
skip_llm_judge=args.skip_llm_judge
|
| 268 |
+
)
|
| 269 |
+
finally:
|
| 270 |
+
sys.argv = original_argv
|
| 271 |
+
|
| 272 |
+
# Clean up temporary file if we created one
|
| 273 |
+
if not has_ground_truth and os.path.exists(eval_file):
|
| 274 |
+
os.unlink(eval_file)
|
| 275 |
+
print(f"\n[EvaluationWrapper] β Cleaned up temporary file: {eval_file}")
|
| 276 |
+
|
| 277 |
+
|
| 278 |
+
if __name__ == "__main__":
|
| 279 |
+
main()
|
evaluation/test_evaluation.sh
ADDED
|
@@ -0,0 +1,140 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/bin/bash
|
| 2 |
+
# Comprehensive test script for MedVidBench evaluation system
|
| 3 |
+
|
| 4 |
+
set -e # Exit on error
|
| 5 |
+
|
| 6 |
+
echo "============================================================"
|
| 7 |
+
echo "Testing MedVidBench Leaderboard Evaluation System"
|
| 8 |
+
echo "============================================================"
|
| 9 |
+
|
| 10 |
+
cd /root/code/MedVidBench-Leaderboard/evaluation
|
| 11 |
+
|
| 12 |
+
# Color codes
|
| 13 |
+
GREEN='\033[0;32m'
|
| 14 |
+
RED='\033[0;31m'
|
| 15 |
+
BLUE='\033[0;34m'
|
| 16 |
+
NC='\033[0m' # No Color
|
| 17 |
+
|
| 18 |
+
# Test 1: Analyze-only mode with results.json
|
| 19 |
+
echo -e "\n${BLUE}Test 1: Analyze-only mode (complete format)${NC}"
|
| 20 |
+
python evaluate_all_pai.py ../data/results.json --analyze-only > /dev/null 2>&1
|
| 21 |
+
if [ $? -eq 0 ]; then
|
| 22 |
+
echo -e "${GREEN}β PASSED: Analyze-only mode works${NC}"
|
| 23 |
+
else
|
| 24 |
+
echo -e "${RED}β FAILED: Analyze-only mode${NC}"
|
| 25 |
+
exit 1
|
| 26 |
+
fi
|
| 27 |
+
|
| 28 |
+
# Test 2: TAL evaluation (per-dataset)
|
| 29 |
+
echo -e "\n${BLUE}Test 2: TAL evaluation (per-dataset grouping)${NC}"
|
| 30 |
+
python evaluate_all_pai.py ../data/results.json --tasks tal --grouping per-dataset > /tmp/tal_per_dataset.log 2>&1
|
| 31 |
+
if [ $? -eq 0 ]; then
|
| 32 |
+
# Check if evaluation actually ran
|
| 33 |
+
if grep -q "recall@0.3" /tmp/tal_per_dataset.log; then
|
| 34 |
+
echo -e "${GREEN}β PASSED: TAL per-dataset evaluation${NC}"
|
| 35 |
+
else
|
| 36 |
+
echo -e "${RED}β FAILED: TAL evaluation did not produce results${NC}"
|
| 37 |
+
exit 1
|
| 38 |
+
fi
|
| 39 |
+
else
|
| 40 |
+
echo -e "${RED}β FAILED: TAL per-dataset evaluation${NC}"
|
| 41 |
+
exit 1
|
| 42 |
+
fi
|
| 43 |
+
|
| 44 |
+
# Test 3: STG evaluation (per-dataset)
|
| 45 |
+
echo -e "\n${BLUE}Test 3: STG evaluation (per-dataset grouping)${NC}"
|
| 46 |
+
python evaluate_all_pai.py ../data/results.json --tasks stg --grouping per-dataset > /tmp/stg_per_dataset.log 2>&1
|
| 47 |
+
if [ $? -eq 0 ]; then
|
| 48 |
+
echo -e "${GREEN}β PASSED: STG per-dataset evaluation${NC}"
|
| 49 |
+
else
|
| 50 |
+
echo -e "${RED}β FAILED: STG per-dataset evaluation${NC}"
|
| 51 |
+
exit 1
|
| 52 |
+
fi
|
| 53 |
+
|
| 54 |
+
# Test 4: TAL evaluation (overall grouping)
|
| 55 |
+
echo -e "\n${BLUE}Test 4: TAL evaluation (overall grouping)${NC}"
|
| 56 |
+
python evaluate_all_pai.py ../data/results.json --tasks tal --grouping overall > /tmp/tal_overall.log 2>&1
|
| 57 |
+
if [ $? -eq 0 ]; then
|
| 58 |
+
# Check for overall evaluation output
|
| 59 |
+
if grep -q "Overall Evaluation" /tmp/tal_overall.log; then
|
| 60 |
+
echo -e "${GREEN}β PASSED: TAL overall evaluation${NC}"
|
| 61 |
+
else
|
| 62 |
+
echo -e "${RED}β FAILED: TAL overall evaluation did not produce results${NC}"
|
| 63 |
+
exit 1
|
| 64 |
+
fi
|
| 65 |
+
else
|
| 66 |
+
echo -e "${RED}β FAILED: TAL overall evaluation${NC}"
|
| 67 |
+
exit 1
|
| 68 |
+
fi
|
| 69 |
+
|
| 70 |
+
# Test 5: Multiple tasks
|
| 71 |
+
echo -e "\n${BLUE}Test 5: Multiple tasks (TAL + STG)${NC}"
|
| 72 |
+
python evaluate_all_pai.py ../data/results.json --tasks tal stg --grouping per-dataset > /tmp/multi_tasks.log 2>&1
|
| 73 |
+
if [ $? -eq 0 ]; then
|
| 74 |
+
echo -e "${GREEN}β PASSED: Multiple tasks evaluation${NC}"
|
| 75 |
+
else
|
| 76 |
+
echo -e "${RED}β FAILED: Multiple tasks evaluation${NC}"
|
| 77 |
+
exit 1
|
| 78 |
+
fi
|
| 79 |
+
|
| 80 |
+
# Test 6: Auto-detection wrapper with merged format
|
| 81 |
+
echo -e "\n${BLUE}Test 6: Evaluate predictions wrapper (merged format)${NC}"
|
| 82 |
+
python evaluate_predictions.py ../data/results.json --tasks tal --analyze-only > /tmp/wrapper_merged.log 2>&1
|
| 83 |
+
if [ $? -eq 0 ]; then
|
| 84 |
+
# Check for detection message
|
| 85 |
+
if grep -q "already contain ground-truth" /tmp/wrapper_merged.log; then
|
| 86 |
+
echo -e "${GREEN}β PASSED: Wrapper correctly detected merged format${NC}"
|
| 87 |
+
else
|
| 88 |
+
echo -e "${RED}β FAILED: Wrapper did not detect merged format${NC}"
|
| 89 |
+
exit 1
|
| 90 |
+
fi
|
| 91 |
+
else
|
| 92 |
+
echo -e "${RED}β FAILED: Wrapper with merged format${NC}"
|
| 93 |
+
exit 1
|
| 94 |
+
fi
|
| 95 |
+
|
| 96 |
+
# Test 7: Auto-detection wrapper with prediction-only format
|
| 97 |
+
echo -e "\n${BLUE}Test 7: Evaluate predictions wrapper (prediction-only format)${NC}"
|
| 98 |
+
if [ -f ../data/sample_predictions.json ]; then
|
| 99 |
+
python evaluate_predictions.py ../data/sample_predictions.json --tasks tal > /tmp/wrapper_pred_only.log 2>&1
|
| 100 |
+
if [ $? -eq 0 ]; then
|
| 101 |
+
# Check for merging message
|
| 102 |
+
if grep -q "Merging with ground-truth" /tmp/wrapper_pred_only.log; then
|
| 103 |
+
echo -e "${GREEN}β PASSED: Wrapper correctly detected prediction-only format and merged${NC}"
|
| 104 |
+
else
|
| 105 |
+
echo -e "${RED}β FAILED: Wrapper did not detect prediction-only format${NC}"
|
| 106 |
+
exit 1
|
| 107 |
+
fi
|
| 108 |
+
else
|
| 109 |
+
echo -e "${RED}β FAILED: Wrapper with prediction-only format${NC}"
|
| 110 |
+
exit 1
|
| 111 |
+
fi
|
| 112 |
+
else
|
| 113 |
+
echo -e "${BLUE}β SKIPPED: sample_predictions.json not found${NC}"
|
| 114 |
+
fi
|
| 115 |
+
|
| 116 |
+
# Test 8: Dataset detection
|
| 117 |
+
echo -e "\n${BLUE}Test 8: Dataset detection (check for AVOS not Unknown)${NC}"
|
| 118 |
+
python evaluate_predictions.py ../data/results.json --tasks tal > /tmp/dataset_detection.log 2>&1
|
| 119 |
+
if grep -q "AVOS:" /tmp/dataset_detection.log && ! grep -q "Unknown:" /tmp/dataset_detection.log; then
|
| 120 |
+
echo -e "${GREEN}β PASSED: Datasets correctly detected (AVOS found, no Unknown)${NC}"
|
| 121 |
+
else
|
| 122 |
+
echo -e "${RED}β FAILED: Dataset detection issue (check for Unknown datasets)${NC}"
|
| 123 |
+
# This is a warning, not a failure
|
| 124 |
+
fi
|
| 125 |
+
|
| 126 |
+
# Summary
|
| 127 |
+
echo -e "\n============================================================"
|
| 128 |
+
echo -e "${GREEN}All Tests Passed!${NC}"
|
| 129 |
+
echo -e "============================================================"
|
| 130 |
+
echo ""
|
| 131 |
+
echo "Test logs saved to /tmp:"
|
| 132 |
+
echo " - tal_per_dataset.log"
|
| 133 |
+
echo " - stg_per_dataset.log"
|
| 134 |
+
echo " - tal_overall.log"
|
| 135 |
+
echo " - multi_tasks.log"
|
| 136 |
+
echo " - wrapper_merged.log"
|
| 137 |
+
echo " - wrapper_pred_only.log"
|
| 138 |
+
echo " - dataset_detection.log"
|
| 139 |
+
echo ""
|
| 140 |
+
echo "System is ready for user submissions!"
|