MedGRPO Team commited on
Commit
a605ebb
Β·
1 Parent(s): 04f5f37
evaluation/README.md ADDED
@@ -0,0 +1,409 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # MedVidBench Leaderboard Evaluation
2
+
3
+ Auto-detection wrapper for evaluating predictions with automatic ground-truth merging.
4
+
5
+ ## Overview
6
+
7
+ This evaluation system supports two input formats:
8
+
9
+ 1. **Merged format** (already contains ground-truth): Like `results.json`
10
+ 2. **Prediction-only format** (needs ground-truth): Like user submissions
11
+
12
+ The wrapper automatically detects which format you're using and handles ground-truth merging if needed.
13
+
14
+ ## Quick Start
15
+
16
+ ```bash
17
+ # Evaluate predictions (auto-detects format)
18
+ python evaluate_predictions.py <predictions_file>
19
+
20
+ # Evaluate specific tasks
21
+ python evaluate_predictions.py <predictions_file> --tasks tal stg
22
+
23
+ # Only analyze file structure
24
+ python evaluate_predictions.py <predictions_file> --analyze-only
25
+
26
+ # Use overall grouping (aggregate all datasets)
27
+ python evaluate_predictions.py <predictions_file> --grouping overall
28
+ ```
29
+
30
+ ## Input Formats
31
+
32
+ ### Format 1: Prediction-Only (User Submission Format)
33
+
34
+ ```json
35
+ [
36
+ {
37
+ "id": "kcOqlifSukA&&22425&&25124&&1.0",
38
+ "qa_type": "tal",
39
+ "prediction": "22.0-78.0, 89.0-94.0 seconds."
40
+ },
41
+ ...
42
+ ]
43
+ ```
44
+
45
+ **ID Format**: `video_id&&start_frame&&end_frame&&fps`
46
+
47
+ **Required fields**:
48
+ - `id`: Unique identifier matching ground-truth
49
+ - `qa_type`: Task type (tal, stg, dvc, next_action, rc, vs, skill_assessment, cvs_assessment)
50
+ - `prediction`: Model's prediction text
51
+
52
+ **What happens**: The wrapper automatically merges with `ground_truth.json` to create complete evaluation records.
53
+
54
+ ### Format 2: Merged (Complete Format)
55
+
56
+ ```json
57
+ {
58
+ "0": {
59
+ "metadata": {
60
+ "video_id": "kcOqlifSukA",
61
+ "fps": "1.0",
62
+ "input_video_start_frame": "22425",
63
+ "input_video_end_frame": "25124"
64
+ },
65
+ "qa_type": "tal",
66
+ "struc_info": [...],
67
+ "question": "...",
68
+ "gnd": "0.0-10.0 seconds.",
69
+ "answer": "22.0-78.0, 89.0-94.0 seconds.",
70
+ "data_source": "AVOS"
71
+ },
72
+ ...
73
+ }
74
+ ```
75
+
76
+ **Required fields**:
77
+ - `metadata`: Video metadata (video_id, fps, frame range)
78
+ - `qa_type`: Task type
79
+ - `struc_info`: Ground-truth structured information
80
+ - `question`: Question text
81
+ - `gnd`: Ground-truth answer
82
+ - `answer`: Model prediction
83
+ - `data_source`: Dataset name
84
+
85
+ **What happens**: The wrapper uses the file directly for evaluation.
86
+
87
+ ## Ground-Truth File
88
+
89
+ **Location**: `/root/code/MedVidBench-Leaderboard/data/ground_truth.json`
90
+
91
+ **Structure**: Array of records, each containing:
92
+ - Complete metadata (video_id, fps, frame range)
93
+ - `struc_info`: Structured ground-truth (spans for TAL/STG, boxes for RC, etc.)
94
+ - Ground-truth answer
95
+ - Dataset source
96
+
97
+ **Note**: This file is NOT public. Users submit prediction-only files, which are merged server-side.
98
+
99
+ ## Supported Tasks
100
+
101
+ | Task | qa_type | Metrics |
102
+ |------|---------|---------|
103
+ | **TAL** | `tal` | Recall@0.3/0.5, mIoU@0.3/0.5 |
104
+ | **STG** | `stg` | IoU@0.3/0.5/0.7, mIoU |
105
+ | **DVC** | `dense_captioning`, `dense_captioning_gpt`, `dense_captioning_gemini`, `dc` | CIDER, METEOR, Precision, Recall, F1, SODA_c |
106
+ | **Next Action** | `next_action` | Accuracy (per-dataset) |
107
+ | **RC** | `region_caption`, `region_caption_gpt`, `region_caption_gemini` | LLM Judge (GPT-4.1/Gemini) |
108
+ | **VS** | `video_summary`, `video_summary_gpt`, `video_summary_gemini` | LLM Judge (GPT-4.1/Gemini) |
109
+ | **Skill Assessment** | `skill_assessment` | Accuracy, Macro F1, Weighted F1 |
110
+ | **CVS Assessment** | `cvs_assessment` | Accuracy, Precision, Recall, F1 |
111
+
112
+ ## Usage Examples
113
+
114
+ ### Example 1: Evaluate User Submission (Prediction-Only)
115
+
116
+ ```bash
117
+ # User submits predictions in prediction-only format
118
+ python evaluate_predictions.py user_predictions.json
119
+
120
+ # Output:
121
+ # [EvaluationWrapper] βœ“ Detected: Prediction-only format (id, qa_type, prediction)
122
+ # [EvaluationWrapper] Merging with ground-truth...
123
+ # [EvaluationWrapper] βœ“ Successfully merged 6245/6245 predictions
124
+ # ... [evaluation results] ...
125
+ ```
126
+
127
+ ### Example 2: Evaluate Internal Results (Already Merged)
128
+
129
+ ```bash
130
+ # Internal evaluation with complete data
131
+ python evaluate_predictions.py results.json
132
+
133
+ # Output:
134
+ # [EvaluationWrapper] βœ“ Detected: Predictions already contain ground-truth
135
+ # [EvaluationWrapper] Using predictions file directly for evaluation
136
+ # ... [evaluation results] ...
137
+ ```
138
+
139
+ ### Example 3: Specific Tasks Only
140
+
141
+ ```bash
142
+ # Evaluate only TAL and STG tasks
143
+ python evaluate_predictions.py predictions.json --tasks tal stg
144
+
145
+ # Evaluate captioning tasks with LLM judge
146
+ python evaluate_predictions.py predictions.json --tasks rc vs dvc
147
+ ```
148
+
149
+ ### Example 4: Different Grouping Modes
150
+
151
+ ```bash
152
+ # Per-dataset results (default)
153
+ python evaluate_predictions.py predictions.json --grouping per-dataset
154
+
155
+ # Overall results (aggregate all datasets)
156
+ python evaluate_predictions.py predictions.json --grouping overall
157
+ ```
158
+
159
+ ### Example 5: Skip LLM Judge (Use Pre-computed Scores)
160
+
161
+ ```bash
162
+ # Skip LLM judge evaluation for caption tasks
163
+ # Useful when LLM scores are already pre-computed in the predictions
164
+ python evaluate_predictions.py predictions.json --skip-llm-judge
165
+ ```
166
+
167
+ ### Example 6: Analyze File Structure
168
+
169
+ ```bash
170
+ # Only analyze what tasks/datasets are present
171
+ python evaluate_predictions.py predictions.json --analyze-only
172
+
173
+ # Output:
174
+ # Found QA types:
175
+ # tal: 1637 records
176
+ # stg: 780 records
177
+ # ...
178
+ # Found datasets:
179
+ # AVOS: 321 records
180
+ # CholecT50: 409 records
181
+ # ...
182
+ ```
183
+
184
+ ## Command-Line Options
185
+
186
+ ```
187
+ python evaluate_predictions.py PREDICTIONS_FILE [OPTIONS]
188
+
189
+ Required:
190
+ PREDICTIONS_FILE Path to predictions JSON (merged or prediction-only format)
191
+
192
+ Optional:
193
+ --ground-truth PATH Path to ground-truth JSON (default: data/ground_truth.json)
194
+ --tasks TASK [TASK ...] Specific tasks to evaluate (default: all available)
195
+ Choices: dvc, tal, next_action, stg, rc, vs,
196
+ skill_assessment, cvs_assessment
197
+ --grouping {per-dataset,overall}
198
+ Grouping strategy (default: per-dataset)
199
+ - per-dataset: Results per dataset
200
+ - overall: Aggregate all datasets
201
+ --analyze-only Only analyze file structure, no evaluation
202
+ --skip-llm-judge Skip LLM judge for caption tasks (use pre-computed scores)
203
+ -h, --help Show help message
204
+ ```
205
+
206
+ ## Output Format
207
+
208
+ ### Per-Dataset Grouping (Default)
209
+
210
+ ```
211
+ ================================================================================
212
+ EVALUATION RESULTS - PER DATASET
213
+ ================================================================================
214
+
215
+ AVOS:
216
+ TAL:
217
+ recall@0.3: 0.45
218
+ meanIoU@0.3: 0.42
219
+ recall@0.5: 0.32
220
+ meanIoU@0.5: 0.28
221
+
222
+ CholecT50:
223
+ TAL:
224
+ recall@0.3: 0.52
225
+ ...
226
+ ```
227
+
228
+ ### Overall Grouping
229
+
230
+ ```
231
+ ================================================================================
232
+ EVALUATION RESULTS - OVERALL (Dataset-Agnostic)
233
+ ================================================================================
234
+
235
+ TAL - Overall Evaluation (All Datasets Combined)
236
+ Total samples: 1637
237
+
238
+ recall@0.3: 0.48
239
+ meanIoU@0.3: 0.45
240
+ recall@0.5: 0.35
241
+ meanIoU@0.5: 0.30
242
+ ```
243
+
244
+ ## Workflow for User Submissions
245
+
246
+ 1. **User downloads benchmark**: `/root/code/MedVidBench/cleaned_test_data_11_04.json`
247
+ - Contains questions but NO ground-truth (struc_info removed)
248
+
249
+ 2. **User runs inference**: Generates predictions for each sample
250
+
251
+ 3. **User submits predictions**: prediction-only format
252
+ ```json
253
+ [
254
+ {
255
+ "id": "<from benchmark>",
256
+ "qa_type": "<from benchmark>",
257
+ "prediction": "<model output>"
258
+ },
259
+ ...
260
+ ]
261
+ ```
262
+
263
+ 4. **Server evaluates**:
264
+ ```bash
265
+ python evaluate_predictions.py user_submission.json
266
+ ```
267
+ - Auto-detects format
268
+ - Merges with server-side ground-truth
269
+ - Runs evaluation
270
+ - Returns metrics
271
+
272
+ ## File Structure
273
+
274
+ ```
275
+ evaluation/
276
+ β”œβ”€β”€ README.md # This file
277
+ β”œβ”€β”€ evaluate_predictions.py # Main wrapper (auto-detection + merging)
278
+ β”œβ”€β”€ evaluate_all_pai.py # Core evaluation orchestrator
279
+ β”œβ”€β”€ eval_tal.py # TAL evaluation
280
+ β”œβ”€β”€ eval_stg.py # STG evaluation
281
+ β”œβ”€β”€ eval_dvc.py # Dense captioning evaluation
282
+ β”œβ”€β”€ eval_next_action.py # Next action evaluation
283
+ β”œβ”€β”€ eval_caption_llm_judge.py # RC/VS LLM judge evaluation
284
+ β”œβ”€β”€ eval_skill_assessment.py # Skill assessment evaluation
285
+ └── eval_cvs_assessment.py # CVS assessment evaluation
286
+ ```
287
+
288
+ ## Key Features
289
+
290
+ ### Auto-Detection Logic
291
+
292
+ The wrapper detects format by checking for these indicators:
293
+
294
+ **Prediction-only format**:
295
+ - Has `id` field (video_id&&start&&end&&fps)
296
+ - Has `prediction` field
297
+ - Missing `gnd` or `struc_info`
298
+
299
+ **Merged format**:
300
+ - Has `question` field
301
+ - Has `gnd` field (ground-truth)
302
+ - Has `struc_info` field (structured GT)
303
+ - Has `metadata` dict
304
+
305
+ ### Ground-Truth Merging
306
+
307
+ When prediction-only format is detected:
308
+
309
+ 1. Load predictions and ground-truth
310
+ 2. Build index: `{id -> ground_truth_record}`
311
+ 3. For each prediction:
312
+ - Look up ground-truth by ID
313
+ - Merge into complete record
314
+ - Add `data_source` from ground-truth
315
+ 4. Save to temporary file
316
+ 5. Run evaluation
317
+ 6. Clean up temporary file
318
+
319
+ ### Dataset Detection
320
+
321
+ Datasets are detected from:
322
+ 1. **`data_source` field** (primary, leaderboard format)
323
+ 2. `dataset` field (fallback)
324
+ 3. `dataset_name` field (fallback)
325
+ 4. Video ID patterns (last resort):
326
+ - YouTube IDs (11 chars with letters) β†’ AVOS
327
+ - `*_part*` pattern β†’ CoPESD
328
+ - `video*` pattern β†’ CholecT50
329
+
330
+ ## Error Handling
331
+
332
+ ### Missing Ground-Truth
333
+
334
+ ```bash
335
+ # If ground-truth file not found
336
+ [EvaluationWrapper] ❌ ERROR: Ground-truth file not found: /path/to/ground_truth.json
337
+ ```
338
+
339
+ **Solution**: Specify correct path with `--ground-truth`
340
+
341
+ ### Unmatched Predictions
342
+
343
+ ```bash
344
+ [EvaluationWrapper] ⚠️ WARNING: 10 predictions not found in ground-truth
345
+ [EvaluationWrapper] First 5 unmatched IDs: [...]
346
+ ```
347
+
348
+ **Cause**: Prediction IDs don't match ground-truth IDs
349
+
350
+ **Solution**: Check ID format (video_id&&start&&end&&fps must match exactly)
351
+
352
+ ### Invalid ID Format
353
+
354
+ ```bash
355
+ ValueError: Invalid ID format: <id_string>
356
+ ```
357
+
358
+ **Cause**: ID doesn't follow `video_id&&start&&end&&fps` format
359
+
360
+ **Solution**: Fix ID format in predictions
361
+
362
+ ## API Keys for LLM Judge
363
+
364
+ For RC/VS evaluation with LLM judge:
365
+
366
+ ```bash
367
+ export OPENAI_API_KEY="your-key" # For GPT-4.1
368
+ export GOOGLE_API_KEY="your-key" # For Gemini
369
+
370
+ # Then run evaluation
371
+ python evaluate_predictions.py predictions.json --tasks rc vs
372
+ ```
373
+
374
+ **Cost**: ~$0.012 per RC/VS sample (GPT-4.1)
375
+
376
+ ## Verification Checklist
377
+
378
+ Before evaluating submissions:
379
+
380
+ ```bash
381
+ # 1. Check file format
382
+ python evaluate_predictions.py submission.json --analyze-only
383
+
384
+ # 2. Verify ground-truth file exists
385
+ ls -lh /root/code/MedVidBench-Leaderboard/data/ground_truth.json
386
+
387
+ # 3. Run evaluation on sample (first 100 records)
388
+ head -100 submission.json > sample.json
389
+ python evaluate_predictions.py sample.json
390
+
391
+ # 4. If successful, run full evaluation
392
+ python evaluate_predictions.py submission.json
393
+ ```
394
+
395
+ ## Performance
396
+
397
+ - **Small files** (100 samples): ~5-10 seconds
398
+ - **Full benchmark** (6245 samples): ~2-5 minutes
399
+ - TAL/STG: ~30 seconds per dataset
400
+ - Next Action: ~20 seconds per dataset
401
+ - DVC: ~1-2 minutes (metric computation)
402
+ - RC/VS with LLM judge: ~5-10 minutes (API calls)
403
+
404
+ ## Notes
405
+
406
+ - Ground-truth file contains **1414 records** (subset for leaderboard testing)
407
+ - Full benchmark has **6245 records** across 8 datasets
408
+ - Temporary files are automatically cleaned up after evaluation
409
+ - LLM judge can be skipped with `--skip-llm-judge` if scores pre-computed
evaluation/eval_dvc.py CHANGED
@@ -112,7 +112,8 @@ def group_records_by_dataset(data):
112
  if not any(x in qa_type.lower() for x in ['dense_captioning', 'dense_caption', 'dc']):
113
  continue
114
 
115
- dataset = record.get('dataset', record.get('dataset_name', record.get('metadata', {}).get('dataset', 'Unknown')))
 
116
  video_id = record.get('video_id', record.get('metadata', {}).get('video_id', ''))
117
 
118
  if dataset == 'Unknown' and video_id:
 
112
  if not any(x in qa_type.lower() for x in ['dense_captioning', 'dense_caption', 'dc']):
113
  continue
114
 
115
+ # Check data_source first (leaderboard format), then fall back to dataset/dataset_name
116
+ dataset = record.get('data_source', record.get('dataset', record.get('dataset_name', record.get('metadata', {}).get('dataset', 'Unknown'))))
117
  video_id = record.get('video_id', record.get('metadata', {}).get('video_id', ''))
118
 
119
  if dataset == 'Unknown' and video_id:
evaluation/eval_next_action.py CHANGED
@@ -494,7 +494,8 @@ def group_records_by_dataset(data):
494
  if 'next_action' not in qa_type.lower():
495
  continue
496
 
497
- dataset = record.get('dataset', record.get('dataset_name', record.get('metadata', {}).get('dataset', 'Unknown')))
 
498
  video_id = record.get('video_id', record.get('metadata', {}).get('video_id', ''))
499
 
500
  if dataset == 'Unknown' and video_id:
 
494
  if 'next_action' not in qa_type.lower():
495
  continue
496
 
497
+ # Check data_source first (leaderboard format), then fall back to dataset/dataset_name
498
+ dataset = record.get('data_source', record.get('dataset', record.get('dataset_name', record.get('metadata', {}).get('dataset', 'Unknown'))))
499
  video_id = record.get('video_id', record.get('metadata', {}).get('video_id', ''))
500
 
501
  if dataset == 'Unknown' and video_id:
evaluation/eval_stg.py CHANGED
@@ -210,7 +210,8 @@ def group_records_by_dataset(data):
210
  if 'stg' not in qa_type.lower():
211
  continue
212
 
213
- dataset = record.get('dataset', record.get('dataset_name', record.get('metadata', {}).get('dataset', 'Unknown')))
 
214
  video_id = record.get('video_id', record.get('metadata', {}).get('video_id', ''))
215
 
216
  if dataset == 'Unknown' and video_id:
 
210
  if 'stg' not in qa_type.lower():
211
  continue
212
 
213
+ # Check data_source first (leaderboard format), then fall back to dataset/dataset_name
214
+ dataset = record.get('data_source', record.get('dataset', record.get('dataset_name', record.get('metadata', {}).get('dataset', 'Unknown'))))
215
  video_id = record.get('video_id', record.get('metadata', {}).get('video_id', ''))
216
 
217
  if dataset == 'Unknown' and video_id:
evaluation/eval_tal.py CHANGED
@@ -139,7 +139,8 @@ def group_records_by_dataset(data):
139
 
140
  # Detect dataset from video_id or other fields
141
  video_id = record.get('video_id', '')
142
- dataset = record.get('dataset', record.get('dataset_name', 'Unknown'))
 
143
 
144
  if dataset == 'Unknown' and video_id:
145
  # Try to infer from video_id patterns
 
139
 
140
  # Detect dataset from video_id or other fields
141
  video_id = record.get('video_id', '')
142
+ # Check data_source first (used in leaderboard format), then fall back to dataset/dataset_name
143
+ dataset = record.get('data_source', record.get('dataset', record.get('dataset_name', 'Unknown')))
144
 
145
  if dataset == 'Unknown' and video_id:
146
  # Try to infer from video_id patterns
evaluation/evaluate_predictions.py ADDED
@@ -0,0 +1,279 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Auto-detect prediction format and evaluate with ground-truth merging if needed."""
2
+
3
+ import json
4
+ import sys
5
+ import argparse
6
+ import os
7
+ from pathlib import Path
8
+
9
+ # Add evaluation directory to path to import evaluate_all_pai
10
+ eval_dir = Path(__file__).parent
11
+ sys.path.insert(0, str(eval_dir))
12
+
13
+ import evaluate_all_pai
14
+
15
+
16
+ def detect_has_ground_truth(data):
17
+ """Detect if prediction file already contains ground-truth.
18
+
19
+ Args:
20
+ data: Loaded JSON data (dict or list)
21
+
22
+ Returns:
23
+ bool: True if ground-truth is present, False otherwise
24
+ """
25
+ # Handle both dict and list formats
26
+ if isinstance(data, dict):
27
+ # Check first record
28
+ first_key = next(iter(data))
29
+ sample = data[first_key]
30
+ elif isinstance(data, list):
31
+ if not data:
32
+ return False
33
+ sample = data[0]
34
+ else:
35
+ return False
36
+
37
+ # Check for ground-truth indicators
38
+ # results.json format has: question, gnd, answer, struc_info, metadata, qa_type, data_source
39
+ has_question = 'question' in sample
40
+ has_gnd = 'gnd' in sample
41
+ has_struc_info = 'struc_info' in sample
42
+ has_metadata_dict = isinstance(sample.get('metadata'), dict)
43
+
44
+ # predictions_only.json format has: id, qa_type, prediction
45
+ has_id = 'id' in sample
46
+ has_prediction = 'prediction' in sample
47
+
48
+ # If it has id + prediction format, it's prediction-only
49
+ if has_id and has_prediction and not has_gnd:
50
+ return False
51
+
52
+ # If it has question + gnd + struc_info, it's already merged
53
+ if has_question and has_gnd and has_struc_info:
54
+ return True
55
+
56
+ # Default: assume needs merging if unclear
57
+ return False
58
+
59
+
60
+ def parse_id(id_str):
61
+ """Parse ID string into components.
62
+
63
+ Format: video_id&&start_frame&&end_frame&&fps
64
+ Example: "kcOqlifSukA&&22425&&25124&&1.0"
65
+
66
+ Returns:
67
+ dict: {'video_id': str, 'input_video_start_frame': str,
68
+ 'input_video_end_frame': str, 'fps': str}
69
+ """
70
+ parts = id_str.split('&&')
71
+ if len(parts) != 4:
72
+ raise ValueError(f"Invalid ID format: {id_str}")
73
+
74
+ return {
75
+ 'video_id': parts[0],
76
+ 'input_video_start_frame': parts[1],
77
+ 'input_video_end_frame': parts[2],
78
+ 'fps': parts[3]
79
+ }
80
+
81
+
82
+ def merge_with_ground_truth(predictions_file, ground_truth_file):
83
+ """Merge prediction-only file with ground-truth.
84
+
85
+ Args:
86
+ predictions_file: Path to predictions JSON (id, qa_type, prediction format)
87
+ ground_truth_file: Path to ground-truth JSON
88
+
89
+ Returns:
90
+ dict: Merged data in results.json format
91
+ """
92
+ print(f"[EvaluationWrapper] Loading predictions from {predictions_file}")
93
+ with open(predictions_file, 'r') as f:
94
+ predictions = json.load(f)
95
+
96
+ print(f"[EvaluationWrapper] Loading ground-truth from {ground_truth_file}")
97
+ with open(ground_truth_file, 'r') as f:
98
+ ground_truth = json.load(f)
99
+
100
+ # Build lookup index for ground-truth
101
+ print("[EvaluationWrapper] Building ground-truth index...")
102
+ gt_index = {}
103
+ for record in ground_truth:
104
+ metadata = record.get('metadata', {})
105
+ # Create key from metadata
106
+ key = f"{metadata.get('video_id')}&&{metadata.get('input_video_start_frame')}&&{metadata.get('input_video_end_frame')}&&{metadata.get('fps')}"
107
+ gt_index[key] = record
108
+
109
+ print(f"[EvaluationWrapper] Ground-truth index size: {len(gt_index)} records")
110
+ print(f"[EvaluationWrapper] Predictions to merge: {len(predictions)} records")
111
+
112
+ # Merge predictions with ground-truth
113
+ merged = {}
114
+ matched_count = 0
115
+ unmatched_ids = []
116
+
117
+ for i, pred in enumerate(predictions):
118
+ pred_id = pred.get('id')
119
+ if not pred_id:
120
+ print(f"[EvaluationWrapper] ⚠️ WARNING: Prediction {i} missing 'id' field, skipping")
121
+ continue
122
+
123
+ # Look up ground-truth
124
+ if pred_id not in gt_index:
125
+ unmatched_ids.append(pred_id)
126
+ continue
127
+
128
+ gt_record = gt_index[pred_id]
129
+
130
+ # Create merged record (ensure data_source is properly set)
131
+ data_source = gt_record.get('data_source', 'Unknown')
132
+ # Fallback to dataset_name if data_source is missing
133
+ if data_source == 'Unknown' or not data_source:
134
+ data_source = gt_record.get('dataset_name', 'Unknown')
135
+
136
+ merged_record = {
137
+ 'metadata': gt_record.get('metadata', {}),
138
+ 'qa_type': pred.get('qa_type'),
139
+ 'struc_info': gt_record.get('struc_info', []),
140
+ 'question': gt_record.get('question', ''),
141
+ 'gnd': gt_record.get('answer', ''), # Ground-truth answer
142
+ 'answer': pred.get('prediction', ''), # Model prediction
143
+ 'data_source': data_source
144
+ }
145
+
146
+ # Use sequential keys like results.json
147
+ merged[str(i)] = merged_record
148
+ matched_count += 1
149
+
150
+ print(f"[EvaluationWrapper] βœ“ Successfully merged {matched_count}/{len(predictions)} predictions")
151
+
152
+ if unmatched_ids:
153
+ print(f"[EvaluationWrapper] ⚠️ WARNING: {len(unmatched_ids)} predictions not found in ground-truth")
154
+ if len(unmatched_ids) <= 5:
155
+ print(f"[EvaluationWrapper] Unmatched IDs: {unmatched_ids}")
156
+ else:
157
+ print(f"[EvaluationWrapper] First 5 unmatched IDs: {unmatched_ids[:5]}")
158
+
159
+ return merged
160
+
161
+
162
+ def main():
163
+ """Main function with command line interface."""
164
+ parser = argparse.ArgumentParser(
165
+ description="Evaluate predictions with automatic ground-truth merging"
166
+ )
167
+ parser.add_argument("predictions_file",
168
+ help="Path to predictions JSON file (can be merged or prediction-only format)")
169
+ parser.add_argument("--ground-truth",
170
+ default="/root/code/MedVidBench-Leaderboard/data/ground_truth.json",
171
+ help="Path to ground-truth JSON file (default: data/ground_truth.json)")
172
+ parser.add_argument("--tasks", nargs="+",
173
+ choices=["dvc", "tal", "next_action", "stg", "rc", "vs",
174
+ "skill_assessment", "cvs_assessment", "gemini_structured", "gpt_structured"],
175
+ help="Specific tasks to evaluate (default: all available tasks)")
176
+ parser.add_argument("--grouping", choices=["per-dataset", "overall"], default="per-dataset",
177
+ help="Grouping strategy: 'per-dataset' or 'overall' (default: per-dataset)")
178
+ parser.add_argument("--analyze-only", action="store_true",
179
+ help="Only analyze the file structure without running evaluations")
180
+ parser.add_argument("--skip-llm-judge", action="store_true",
181
+ help="Skip LLM judge evaluation for caption tasks (use when LLM scores are pre-computed)")
182
+
183
+ args = parser.parse_args()
184
+
185
+ # Load predictions
186
+ print(f"[EvaluationWrapper] Loading predictions from {args.predictions_file}")
187
+ with open(args.predictions_file, 'r') as f:
188
+ predictions_data = json.load(f)
189
+
190
+ # Auto-detect format
191
+ has_ground_truth = detect_has_ground_truth(predictions_data)
192
+
193
+ if has_ground_truth:
194
+ print("[EvaluationWrapper] βœ“ Detected: Predictions already contain ground-truth")
195
+ print("[EvaluationWrapper] Using predictions file directly for evaluation")
196
+ eval_file = args.predictions_file
197
+ else:
198
+ print("[EvaluationWrapper] βœ“ Detected: Prediction-only format (id, qa_type, prediction)")
199
+ print("[EvaluationWrapper] Merging with ground-truth...")
200
+
201
+ # Check ground-truth file exists
202
+ if not os.path.exists(args.ground_truth):
203
+ print(f"[EvaluationWrapper] ❌ ERROR: Ground-truth file not found: {args.ground_truth}")
204
+ sys.exit(1)
205
+
206
+ # Merge predictions with ground-truth
207
+ merged_data = merge_with_ground_truth(args.predictions_file, args.ground_truth)
208
+
209
+ # Save merged data to temporary file
210
+ import tempfile
211
+ with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
212
+ json.dump(merged_data, f, indent=2)
213
+ eval_file = f.name
214
+
215
+ print(f"[EvaluationWrapper] βœ“ Merged data saved to temporary file: {eval_file}")
216
+
217
+ # Call evaluate_all_pai with the appropriate file
218
+ print(f"\n[EvaluationWrapper] {'='*80}")
219
+ print(f"[EvaluationWrapper] Starting evaluation with evaluate_all_pai.py")
220
+ print(f"[EvaluationWrapper] {'='*80}\n")
221
+
222
+ # Set sys.argv for evaluate_all_pai
223
+ eval_args = [eval_file]
224
+ if args.tasks:
225
+ eval_args.extend(["--tasks"] + args.tasks)
226
+ if args.grouping:
227
+ eval_args.extend(["--grouping", args.grouping])
228
+ if args.analyze_only:
229
+ eval_args.append("--analyze-only")
230
+ if args.skip_llm_judge:
231
+ eval_args.append("--skip-llm-judge")
232
+
233
+ original_argv = sys.argv
234
+ sys.argv = ["evaluate_all_pai.py"] + eval_args
235
+
236
+ try:
237
+ # Run evaluation
238
+ if args.analyze_only:
239
+ qa_type_counts, dataset_counts = evaluate_all_pai.analyze_output_file(eval_file)
240
+ # Determine available tasks
241
+ available_tasks = []
242
+ if any("dense_captioning" in qa_type or qa_type == "dc" for qa_type in qa_type_counts):
243
+ available_tasks.append("dvc")
244
+ if qa_type_counts.get("tal", 0) > 0:
245
+ available_tasks.append("tal")
246
+ if qa_type_counts.get("next_action", 0) > 0:
247
+ available_tasks.append("next_action")
248
+ if qa_type_counts.get("stg", 0) > 0:
249
+ available_tasks.append("stg")
250
+ if any("region_caption" in qa_type for qa_type in qa_type_counts):
251
+ available_tasks.append("rc")
252
+ if any("video_summary" in qa_type for qa_type in qa_type_counts):
253
+ available_tasks.append("vs")
254
+ if qa_type_counts.get("skill_assessment", 0) > 0:
255
+ available_tasks.append("skill_assessment")
256
+ if qa_type_counts.get("cvs_assessment", 0) > 0:
257
+ available_tasks.append("cvs_assessment")
258
+
259
+ evaluate_all_pai.print_evaluation_results_csv(eval_file, available_tasks)
260
+ else:
261
+ silent_eval = (args.grouping == "overall")
262
+ evaluate_all_pai.run_evaluation(
263
+ eval_file,
264
+ args.tasks,
265
+ grouping=args.grouping,
266
+ silent_eval=silent_eval,
267
+ skip_llm_judge=args.skip_llm_judge
268
+ )
269
+ finally:
270
+ sys.argv = original_argv
271
+
272
+ # Clean up temporary file if we created one
273
+ if not has_ground_truth and os.path.exists(eval_file):
274
+ os.unlink(eval_file)
275
+ print(f"\n[EvaluationWrapper] βœ“ Cleaned up temporary file: {eval_file}")
276
+
277
+
278
+ if __name__ == "__main__":
279
+ main()
evaluation/test_evaluation.sh ADDED
@@ -0,0 +1,140 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ # Comprehensive test script for MedVidBench evaluation system
3
+
4
+ set -e # Exit on error
5
+
6
+ echo "============================================================"
7
+ echo "Testing MedVidBench Leaderboard Evaluation System"
8
+ echo "============================================================"
9
+
10
+ cd /root/code/MedVidBench-Leaderboard/evaluation
11
+
12
+ # Color codes
13
+ GREEN='\033[0;32m'
14
+ RED='\033[0;31m'
15
+ BLUE='\033[0;34m'
16
+ NC='\033[0m' # No Color
17
+
18
+ # Test 1: Analyze-only mode with results.json
19
+ echo -e "\n${BLUE}Test 1: Analyze-only mode (complete format)${NC}"
20
+ python evaluate_all_pai.py ../data/results.json --analyze-only > /dev/null 2>&1
21
+ if [ $? -eq 0 ]; then
22
+ echo -e "${GREEN}βœ“ PASSED: Analyze-only mode works${NC}"
23
+ else
24
+ echo -e "${RED}βœ— FAILED: Analyze-only mode${NC}"
25
+ exit 1
26
+ fi
27
+
28
+ # Test 2: TAL evaluation (per-dataset)
29
+ echo -e "\n${BLUE}Test 2: TAL evaluation (per-dataset grouping)${NC}"
30
+ python evaluate_all_pai.py ../data/results.json --tasks tal --grouping per-dataset > /tmp/tal_per_dataset.log 2>&1
31
+ if [ $? -eq 0 ]; then
32
+ # Check if evaluation actually ran
33
+ if grep -q "recall@0.3" /tmp/tal_per_dataset.log; then
34
+ echo -e "${GREEN}βœ“ PASSED: TAL per-dataset evaluation${NC}"
35
+ else
36
+ echo -e "${RED}βœ— FAILED: TAL evaluation did not produce results${NC}"
37
+ exit 1
38
+ fi
39
+ else
40
+ echo -e "${RED}βœ— FAILED: TAL per-dataset evaluation${NC}"
41
+ exit 1
42
+ fi
43
+
44
+ # Test 3: STG evaluation (per-dataset)
45
+ echo -e "\n${BLUE}Test 3: STG evaluation (per-dataset grouping)${NC}"
46
+ python evaluate_all_pai.py ../data/results.json --tasks stg --grouping per-dataset > /tmp/stg_per_dataset.log 2>&1
47
+ if [ $? -eq 0 ]; then
48
+ echo -e "${GREEN}βœ“ PASSED: STG per-dataset evaluation${NC}"
49
+ else
50
+ echo -e "${RED}βœ— FAILED: STG per-dataset evaluation${NC}"
51
+ exit 1
52
+ fi
53
+
54
+ # Test 4: TAL evaluation (overall grouping)
55
+ echo -e "\n${BLUE}Test 4: TAL evaluation (overall grouping)${NC}"
56
+ python evaluate_all_pai.py ../data/results.json --tasks tal --grouping overall > /tmp/tal_overall.log 2>&1
57
+ if [ $? -eq 0 ]; then
58
+ # Check for overall evaluation output
59
+ if grep -q "Overall Evaluation" /tmp/tal_overall.log; then
60
+ echo -e "${GREEN}βœ“ PASSED: TAL overall evaluation${NC}"
61
+ else
62
+ echo -e "${RED}βœ— FAILED: TAL overall evaluation did not produce results${NC}"
63
+ exit 1
64
+ fi
65
+ else
66
+ echo -e "${RED}βœ— FAILED: TAL overall evaluation${NC}"
67
+ exit 1
68
+ fi
69
+
70
+ # Test 5: Multiple tasks
71
+ echo -e "\n${BLUE}Test 5: Multiple tasks (TAL + STG)${NC}"
72
+ python evaluate_all_pai.py ../data/results.json --tasks tal stg --grouping per-dataset > /tmp/multi_tasks.log 2>&1
73
+ if [ $? -eq 0 ]; then
74
+ echo -e "${GREEN}βœ“ PASSED: Multiple tasks evaluation${NC}"
75
+ else
76
+ echo -e "${RED}βœ— FAILED: Multiple tasks evaluation${NC}"
77
+ exit 1
78
+ fi
79
+
80
+ # Test 6: Auto-detection wrapper with merged format
81
+ echo -e "\n${BLUE}Test 6: Evaluate predictions wrapper (merged format)${NC}"
82
+ python evaluate_predictions.py ../data/results.json --tasks tal --analyze-only > /tmp/wrapper_merged.log 2>&1
83
+ if [ $? -eq 0 ]; then
84
+ # Check for detection message
85
+ if grep -q "already contain ground-truth" /tmp/wrapper_merged.log; then
86
+ echo -e "${GREEN}βœ“ PASSED: Wrapper correctly detected merged format${NC}"
87
+ else
88
+ echo -e "${RED}βœ— FAILED: Wrapper did not detect merged format${NC}"
89
+ exit 1
90
+ fi
91
+ else
92
+ echo -e "${RED}βœ— FAILED: Wrapper with merged format${NC}"
93
+ exit 1
94
+ fi
95
+
96
+ # Test 7: Auto-detection wrapper with prediction-only format
97
+ echo -e "\n${BLUE}Test 7: Evaluate predictions wrapper (prediction-only format)${NC}"
98
+ if [ -f ../data/sample_predictions.json ]; then
99
+ python evaluate_predictions.py ../data/sample_predictions.json --tasks tal > /tmp/wrapper_pred_only.log 2>&1
100
+ if [ $? -eq 0 ]; then
101
+ # Check for merging message
102
+ if grep -q "Merging with ground-truth" /tmp/wrapper_pred_only.log; then
103
+ echo -e "${GREEN}βœ“ PASSED: Wrapper correctly detected prediction-only format and merged${NC}"
104
+ else
105
+ echo -e "${RED}βœ— FAILED: Wrapper did not detect prediction-only format${NC}"
106
+ exit 1
107
+ fi
108
+ else
109
+ echo -e "${RED}βœ— FAILED: Wrapper with prediction-only format${NC}"
110
+ exit 1
111
+ fi
112
+ else
113
+ echo -e "${BLUE}⊘ SKIPPED: sample_predictions.json not found${NC}"
114
+ fi
115
+
116
+ # Test 8: Dataset detection
117
+ echo -e "\n${BLUE}Test 8: Dataset detection (check for AVOS not Unknown)${NC}"
118
+ python evaluate_predictions.py ../data/results.json --tasks tal > /tmp/dataset_detection.log 2>&1
119
+ if grep -q "AVOS:" /tmp/dataset_detection.log && ! grep -q "Unknown:" /tmp/dataset_detection.log; then
120
+ echo -e "${GREEN}βœ“ PASSED: Datasets correctly detected (AVOS found, no Unknown)${NC}"
121
+ else
122
+ echo -e "${RED}βœ— FAILED: Dataset detection issue (check for Unknown datasets)${NC}"
123
+ # This is a warning, not a failure
124
+ fi
125
+
126
+ # Summary
127
+ echo -e "\n============================================================"
128
+ echo -e "${GREEN}All Tests Passed!${NC}"
129
+ echo -e "============================================================"
130
+ echo ""
131
+ echo "Test logs saved to /tmp:"
132
+ echo " - tal_per_dataset.log"
133
+ echo " - stg_per_dataset.log"
134
+ echo " - tal_overall.log"
135
+ echo " - multi_tasks.log"
136
+ echo " - wrapper_merged.log"
137
+ echo " - wrapper_pred_only.log"
138
+ echo " - dataset_detection.log"
139
+ echo ""
140
+ echo "System is ready for user submissions!"