MedGRPO Team
update
a605ebb
# MedVidBench Leaderboard Evaluation
Auto-detection wrapper for evaluating predictions with automatic ground-truth merging.
## Overview
This evaluation system supports two input formats:
1. **Merged format** (already contains ground-truth): Like `results.json`
2. **Prediction-only format** (needs ground-truth): Like user submissions
The wrapper automatically detects which format you're using and handles ground-truth merging if needed.
## Quick Start
```bash
# Evaluate predictions (auto-detects format)
python evaluate_predictions.py <predictions_file>
# Evaluate specific tasks
python evaluate_predictions.py <predictions_file> --tasks tal stg
# Only analyze file structure
python evaluate_predictions.py <predictions_file> --analyze-only
# Use overall grouping (aggregate all datasets)
python evaluate_predictions.py <predictions_file> --grouping overall
```
## Input Formats
### Format 1: Prediction-Only (User Submission Format)
```json
[
{
"id": "kcOqlifSukA&&22425&&25124&&1.0",
"qa_type": "tal",
"prediction": "22.0-78.0, 89.0-94.0 seconds."
},
...
]
```
**ID Format**: `video_id&&start_frame&&end_frame&&fps`
**Required fields**:
- `id`: Unique identifier matching ground-truth
- `qa_type`: Task type (tal, stg, dvc, next_action, rc, vs, skill_assessment, cvs_assessment)
- `prediction`: Model's prediction text
**What happens**: The wrapper automatically merges with `ground_truth.json` to create complete evaluation records.
### Format 2: Merged (Complete Format)
```json
{
"0": {
"metadata": {
"video_id": "kcOqlifSukA",
"fps": "1.0",
"input_video_start_frame": "22425",
"input_video_end_frame": "25124"
},
"qa_type": "tal",
"struc_info": [...],
"question": "...",
"gnd": "0.0-10.0 seconds.",
"answer": "22.0-78.0, 89.0-94.0 seconds.",
"data_source": "AVOS"
},
...
}
```
**Required fields**:
- `metadata`: Video metadata (video_id, fps, frame range)
- `qa_type`: Task type
- `struc_info`: Ground-truth structured information
- `question`: Question text
- `gnd`: Ground-truth answer
- `answer`: Model prediction
- `data_source`: Dataset name
**What happens**: The wrapper uses the file directly for evaluation.
## Ground-Truth File
**Location**: `/root/code/MedVidBench-Leaderboard/data/ground_truth.json`
**Structure**: Array of records, each containing:
- Complete metadata (video_id, fps, frame range)
- `struc_info`: Structured ground-truth (spans for TAL/STG, boxes for RC, etc.)
- Ground-truth answer
- Dataset source
**Note**: This file is NOT public. Users submit prediction-only files, which are merged server-side.
## Supported Tasks
| Task | qa_type | Metrics |
|------|---------|---------|
| **TAL** | `tal` | Recall@0.3/0.5, mIoU@0.3/0.5 |
| **STG** | `stg` | IoU@0.3/0.5/0.7, mIoU |
| **DVC** | `dense_captioning`, `dense_captioning_gpt`, `dense_captioning_gemini`, `dc` | CIDER, METEOR, Precision, Recall, F1, SODA_c |
| **Next Action** | `next_action` | Accuracy (per-dataset) |
| **RC** | `region_caption`, `region_caption_gpt`, `region_caption_gemini` | LLM Judge (GPT-4.1/Gemini) |
| **VS** | `video_summary`, `video_summary_gpt`, `video_summary_gemini` | LLM Judge (GPT-4.1/Gemini) |
| **Skill Assessment** | `skill_assessment` | Accuracy, Macro F1, Weighted F1 |
| **CVS Assessment** | `cvs_assessment` | Accuracy, Precision, Recall, F1 |
## Usage Examples
### Example 1: Evaluate User Submission (Prediction-Only)
```bash
# User submits predictions in prediction-only format
python evaluate_predictions.py user_predictions.json
# Output:
# [EvaluationWrapper] ✓ Detected: Prediction-only format (id, qa_type, prediction)
# [EvaluationWrapper] Merging with ground-truth...
# [EvaluationWrapper] ✓ Successfully merged 6245/6245 predictions
# ... [evaluation results] ...
```
### Example 2: Evaluate Internal Results (Already Merged)
```bash
# Internal evaluation with complete data
python evaluate_predictions.py results.json
# Output:
# [EvaluationWrapper] ✓ Detected: Predictions already contain ground-truth
# [EvaluationWrapper] Using predictions file directly for evaluation
# ... [evaluation results] ...
```
### Example 3: Specific Tasks Only
```bash
# Evaluate only TAL and STG tasks
python evaluate_predictions.py predictions.json --tasks tal stg
# Evaluate captioning tasks with LLM judge
python evaluate_predictions.py predictions.json --tasks rc vs dvc
```
### Example 4: Different Grouping Modes
```bash
# Per-dataset results (default)
python evaluate_predictions.py predictions.json --grouping per-dataset
# Overall results (aggregate all datasets)
python evaluate_predictions.py predictions.json --grouping overall
```
### Example 5: Skip LLM Judge (Use Pre-computed Scores)
```bash
# Skip LLM judge evaluation for caption tasks
# Useful when LLM scores are already pre-computed in the predictions
python evaluate_predictions.py predictions.json --skip-llm-judge
```
### Example 6: Analyze File Structure
```bash
# Only analyze what tasks/datasets are present
python evaluate_predictions.py predictions.json --analyze-only
# Output:
# Found QA types:
# tal: 1637 records
# stg: 780 records
# ...
# Found datasets:
# AVOS: 321 records
# CholecT50: 409 records
# ...
```
## Command-Line Options
```
python evaluate_predictions.py PREDICTIONS_FILE [OPTIONS]
Required:
PREDICTIONS_FILE Path to predictions JSON (merged or prediction-only format)
Optional:
--ground-truth PATH Path to ground-truth JSON (default: data/ground_truth.json)
--tasks TASK [TASK ...] Specific tasks to evaluate (default: all available)
Choices: dvc, tal, next_action, stg, rc, vs,
skill_assessment, cvs_assessment
--grouping {per-dataset,overall}
Grouping strategy (default: per-dataset)
- per-dataset: Results per dataset
- overall: Aggregate all datasets
--analyze-only Only analyze file structure, no evaluation
--skip-llm-judge Skip LLM judge for caption tasks (use pre-computed scores)
-h, --help Show help message
```
## Output Format
### Per-Dataset Grouping (Default)
```
================================================================================
EVALUATION RESULTS - PER DATASET
================================================================================
AVOS:
TAL:
recall@0.3: 0.45
meanIoU@0.3: 0.42
recall@0.5: 0.32
meanIoU@0.5: 0.28
CholecT50:
TAL:
recall@0.3: 0.52
...
```
### Overall Grouping
```
================================================================================
EVALUATION RESULTS - OVERALL (Dataset-Agnostic)
================================================================================
TAL - Overall Evaluation (All Datasets Combined)
Total samples: 1637
recall@0.3: 0.48
meanIoU@0.3: 0.45
recall@0.5: 0.35
meanIoU@0.5: 0.30
```
## Workflow for User Submissions
1. **User downloads benchmark**: `/root/code/MedVidBench/cleaned_test_data_11_04.json`
- Contains questions but NO ground-truth (struc_info removed)
2. **User runs inference**: Generates predictions for each sample
3. **User submits predictions**: prediction-only format
```json
[
{
"id": "<from benchmark>",
"qa_type": "<from benchmark>",
"prediction": "<model output>"
},
...
]
```
4. **Server evaluates**:
```bash
python evaluate_predictions.py user_submission.json
```
- Auto-detects format
- Merges with server-side ground-truth
- Runs evaluation
- Returns metrics
## File Structure
```
evaluation/
├── README.md # This file
├── evaluate_predictions.py # Main wrapper (auto-detection + merging)
├── evaluate_all_pai.py # Core evaluation orchestrator
├── eval_tal.py # TAL evaluation
├── eval_stg.py # STG evaluation
├── eval_dvc.py # Dense captioning evaluation
├── eval_next_action.py # Next action evaluation
├── eval_caption_llm_judge.py # RC/VS LLM judge evaluation
├── eval_skill_assessment.py # Skill assessment evaluation
└── eval_cvs_assessment.py # CVS assessment evaluation
```
## Key Features
### Auto-Detection Logic
The wrapper detects format by checking for these indicators:
**Prediction-only format**:
- Has `id` field (video_id&&start&&end&&fps)
- Has `prediction` field
- Missing `gnd` or `struc_info`
**Merged format**:
- Has `question` field
- Has `gnd` field (ground-truth)
- Has `struc_info` field (structured GT)
- Has `metadata` dict
### Ground-Truth Merging
When prediction-only format is detected:
1. Load predictions and ground-truth
2. Build index: `{id -> ground_truth_record}`
3. For each prediction:
- Look up ground-truth by ID
- Merge into complete record
- Add `data_source` from ground-truth
4. Save to temporary file
5. Run evaluation
6. Clean up temporary file
### Dataset Detection
Datasets are detected from:
1. **`data_source` field** (primary, leaderboard format)
2. `dataset` field (fallback)
3. `dataset_name` field (fallback)
4. Video ID patterns (last resort):
- YouTube IDs (11 chars with letters) → AVOS
- `*_part*` pattern → CoPESD
- `video*` pattern → CholecT50
## Error Handling
### Missing Ground-Truth
```bash
# If ground-truth file not found
[EvaluationWrapper] ❌ ERROR: Ground-truth file not found: /path/to/ground_truth.json
```
**Solution**: Specify correct path with `--ground-truth`
### Unmatched Predictions
```bash
[EvaluationWrapper] ⚠️ WARNING: 10 predictions not found in ground-truth
[EvaluationWrapper] First 5 unmatched IDs: [...]
```
**Cause**: Prediction IDs don't match ground-truth IDs
**Solution**: Check ID format (video_id&&start&&end&&fps must match exactly)
### Invalid ID Format
```bash
ValueError: Invalid ID format: <id_string>
```
**Cause**: ID doesn't follow `video_id&&start&&end&&fps` format
**Solution**: Fix ID format in predictions
## API Keys for LLM Judge
For RC/VS evaluation with LLM judge:
```bash
export OPENAI_API_KEY="your-key" # For GPT-4.1
export GOOGLE_API_KEY="your-key" # For Gemini
# Then run evaluation
python evaluate_predictions.py predictions.json --tasks rc vs
```
**Cost**: ~$0.012 per RC/VS sample (GPT-4.1)
## Verification Checklist
Before evaluating submissions:
```bash
# 1. Check file format
python evaluate_predictions.py submission.json --analyze-only
# 2. Verify ground-truth file exists
ls -lh /root/code/MedVidBench-Leaderboard/data/ground_truth.json
# 3. Run evaluation on sample (first 100 records)
head -100 submission.json > sample.json
python evaluate_predictions.py sample.json
# 4. If successful, run full evaluation
python evaluate_predictions.py submission.json
```
## Performance
- **Small files** (100 samples): ~5-10 seconds
- **Full benchmark** (6245 samples): ~2-5 minutes
- TAL/STG: ~30 seconds per dataset
- Next Action: ~20 seconds per dataset
- DVC: ~1-2 minutes (metric computation)
- RC/VS with LLM judge: ~5-10 minutes (API calls)
## Notes
- Ground-truth file contains **1414 records** (subset for leaderboard testing)
- Full benchmark has **6245 records** across 8 datasets
- Temporary files are automatically cleaned up after evaluation
- LLM judge can be skipped with `--skip-llm-judge` if scores pre-computed