Spaces:

UIIAmerica
/

MedVidBench-Leaderboard

Running

App Files Files Community

MedVidBench-Leaderboard / evaluation /README.md

MedGRPO Team

update

a605ebb 9 days ago

preview code

raw

history blame contribute delete

11.5 kB

	# MedVidBench Leaderboard Evaluation

	Auto-detection wrapper for evaluating predictions with automatic ground-truth merging.

	## Overview

	This evaluation system supports two input formats:

	1. Merged format (already contains ground-truth): Like `results.json`
	2. Prediction-only format (needs ground-truth): Like user submissions

	The wrapper automatically detects which format you're using and handles ground-truth merging if needed.

	## Quick Start

	```bash
	# Evaluate predictions (auto-detects format)
	python evaluate_predictions.py <predictions_file>

	# Evaluate specific tasks
	python evaluate_predictions.py <predictions_file> --tasks tal stg

	# Only analyze file structure
	python evaluate_predictions.py <predictions_file> --analyze-only

	# Use overall grouping (aggregate all datasets)
	python evaluate_predictions.py <predictions_file> --grouping overall
	```

	## Input Formats

	### Format 1: Prediction-Only (User Submission Format)

	```json
	[
	{
	"id": "kcOqlifSukA&&22425&&25124&&1.0",
	"qa_type": "tal",
	"prediction": "22.0-78.0, 89.0-94.0 seconds."
	},
	...
	]
	```

	ID Format: `video_id&&start_frame&&end_frame&&fps`

	Required fields:
	- `id`: Unique identifier matching ground-truth
	- `qa_type`: Task type (tal, stg, dvc, next_action, rc, vs, skill_assessment, cvs_assessment)
	- `prediction`: Model's prediction text

	What happens: The wrapper automatically merges with `ground_truth.json` to create complete evaluation records.

	### Format 2: Merged (Complete Format)

	```json
	{
	"0": {
	"metadata": {
	"video_id": "kcOqlifSukA",
	"fps": "1.0",
	"input_video_start_frame": "22425",
	"input_video_end_frame": "25124"
	},
	"qa_type": "tal",
	"struc_info": [...],
	"question": "...",
	"gnd": "0.0-10.0 seconds.",
	"answer": "22.0-78.0, 89.0-94.0 seconds.",
	"data_source": "AVOS"
	},
	...
	}
	```

	Required fields:
	- `metadata`: Video metadata (video_id, fps, frame range)
	- `qa_type`: Task type
	- `struc_info`: Ground-truth structured information
	- `question`: Question text
	- `gnd`: Ground-truth answer
	- `answer`: Model prediction
	- `data_source`: Dataset name

	What happens: The wrapper uses the file directly for evaluation.

	## Ground-Truth File

	Location: `/root/code/MedVidBench-Leaderboard/data/ground_truth.json`

	Structure: Array of records, each containing:
	- Complete metadata (video_id, fps, frame range)
	- `struc_info`: Structured ground-truth (spans for TAL/STG, boxes for RC, etc.)
	- Ground-truth answer
	- Dataset source

	Note: This file is NOT public. Users submit prediction-only files, which are merged server-side.

	## Supported Tasks

	\| Task \| qa_type \| Metrics \|
	\|------\|---------\|---------\|
	\| TAL \| `tal` \| Recall@0.3/0.5, mIoU@0.3/0.5 \|
	\| STG \| `stg` \| IoU@0.3/0.5/0.7, mIoU \|
	\| DVC \| `dense_captioning`, `dense_captioning_gpt`, `dense_captioning_gemini`, `dc` \| CIDER, METEOR, Precision, Recall, F1, SODA_c \|
	\| Next Action \| `next_action` \| Accuracy (per-dataset) \|
	\| RC \| `region_caption`, `region_caption_gpt`, `region_caption_gemini` \| LLM Judge (GPT-4.1/Gemini) \|
	\| VS \| `video_summary`, `video_summary_gpt`, `video_summary_gemini` \| LLM Judge (GPT-4.1/Gemini) \|
	\| Skill Assessment \| `skill_assessment` \| Accuracy, Macro F1, Weighted F1 \|
	\| CVS Assessment \| `cvs_assessment` \| Accuracy, Precision, Recall, F1 \|

	## Usage Examples

	### Example 1: Evaluate User Submission (Prediction-Only)

	```bash
	# User submits predictions in prediction-only format
	python evaluate_predictions.py user_predictions.json

	# Output:
	# [EvaluationWrapper] ✓ Detected: Prediction-only format (id, qa_type, prediction)
	# [EvaluationWrapper] Merging with ground-truth...
	# [EvaluationWrapper] ✓ Successfully merged 6245/6245 predictions
	# ... [evaluation results] ...
	```

	### Example 2: Evaluate Internal Results (Already Merged)

	```bash
	# Internal evaluation with complete data
	python evaluate_predictions.py results.json

	# Output:
	# [EvaluationWrapper] ✓ Detected: Predictions already contain ground-truth
	# [EvaluationWrapper] Using predictions file directly for evaluation
	# ... [evaluation results] ...
	```

	### Example 3: Specific Tasks Only

	```bash
	# Evaluate only TAL and STG tasks
	python evaluate_predictions.py predictions.json --tasks tal stg

	# Evaluate captioning tasks with LLM judge
	python evaluate_predictions.py predictions.json --tasks rc vs dvc
	```

	### Example 4: Different Grouping Modes

	```bash
	# Per-dataset results (default)
	python evaluate_predictions.py predictions.json --grouping per-dataset

	# Overall results (aggregate all datasets)
	python evaluate_predictions.py predictions.json --grouping overall
	```

	### Example 5: Skip LLM Judge (Use Pre-computed Scores)

	```bash
	# Skip LLM judge evaluation for caption tasks
	# Useful when LLM scores are already pre-computed in the predictions
	python evaluate_predictions.py predictions.json --skip-llm-judge
	```

	### Example 6: Analyze File Structure

	```bash
	# Only analyze what tasks/datasets are present
	python evaluate_predictions.py predictions.json --analyze-only

	# Output:
	# Found QA types:
	# tal: 1637 records
	# stg: 780 records
	# ...
	# Found datasets:
	# AVOS: 321 records
	# CholecT50: 409 records
	# ...
	```

	## Command-Line Options

	```
	python evaluate_predictions.py PREDICTIONS_FILE [OPTIONS]

	Required:
	PREDICTIONS_FILE Path to predictions JSON (merged or prediction-only format)

	Optional:
	--ground-truth PATH Path to ground-truth JSON (default: data/ground_truth.json)
	--tasks TASK [TASK ...] Specific tasks to evaluate (default: all available)
	Choices: dvc, tal, next_action, stg, rc, vs,
	skill_assessment, cvs_assessment
	--grouping {per-dataset,overall}
	Grouping strategy (default: per-dataset)
	- per-dataset: Results per dataset
	- overall: Aggregate all datasets
	--analyze-only Only analyze file structure, no evaluation
	--skip-llm-judge Skip LLM judge for caption tasks (use pre-computed scores)
	-h, --help Show help message
	```

	## Output Format

	### Per-Dataset Grouping (Default)

	```
	================================================================================
	EVALUATION RESULTS - PER DATASET
	================================================================================

	AVOS:
	TAL:
	recall@0.3: 0.45
	meanIoU@0.3: 0.42
	recall@0.5: 0.32
	meanIoU@0.5: 0.28

	CholecT50:
	TAL:
	recall@0.3: 0.52
	...
	```

	### Overall Grouping

	```
	================================================================================
	EVALUATION RESULTS - OVERALL (Dataset-Agnostic)
	================================================================================

	TAL - Overall Evaluation (All Datasets Combined)
	Total samples: 1637

	recall@0.3: 0.48
	meanIoU@0.3: 0.45
	recall@0.5: 0.35
	meanIoU@0.5: 0.30
	```

	## Workflow for User Submissions

	1. User downloads benchmark: `/root/code/MedVidBench/cleaned_test_data_11_04.json`
	- Contains questions but NO ground-truth (struc_info removed)

	2. User runs inference: Generates predictions for each sample

	3. User submits predictions: prediction-only format
	```json
	[
	{
	"id": "<from benchmark>",
	"qa_type": "<from benchmark>",
	"prediction": "<model output>"
	},
	...
	]
	```

	4. Server evaluates:
	```bash
	python evaluate_predictions.py user_submission.json
	```
	- Auto-detects format
	- Merges with server-side ground-truth
	- Runs evaluation
	- Returns metrics

	## File Structure

	```
	evaluation/
	├── README.md # This file
	├── evaluate_predictions.py # Main wrapper (auto-detection + merging)
	├── evaluate_all_pai.py # Core evaluation orchestrator
	├── eval_tal.py # TAL evaluation
	├── eval_stg.py # STG evaluation
	├── eval_dvc.py # Dense captioning evaluation
	├── eval_next_action.py # Next action evaluation
	├── eval_caption_llm_judge.py # RC/VS LLM judge evaluation
	├── eval_skill_assessment.py # Skill assessment evaluation
	└── eval_cvs_assessment.py # CVS assessment evaluation
	```

	## Key Features

	### Auto-Detection Logic

	The wrapper detects format by checking for these indicators:

	Prediction-only format:
	- Has `id` field (video_id&&start&&end&&fps)
	- Has `prediction` field
	- Missing `gnd` or `struc_info`

	Merged format:
	- Has `question` field
	- Has `gnd` field (ground-truth)
	- Has `struc_info` field (structured GT)
	- Has `metadata` dict

	### Ground-Truth Merging

	When prediction-only format is detected:

	1. Load predictions and ground-truth
	2. Build index: `{id -> ground_truth_record}`
	3. For each prediction:
	- Look up ground-truth by ID
	- Merge into complete record
	- Add `data_source` from ground-truth
	4. Save to temporary file
	5. Run evaluation
	6. Clean up temporary file

	### Dataset Detection

	Datasets are detected from:
	1. `data_source` field (primary, leaderboard format)
	2. `dataset` field (fallback)
	3. `dataset_name` field (fallback)
	4. Video ID patterns (last resort):
	- YouTube IDs (11 chars with letters) → AVOS
	- `_part` pattern → CoPESD
	- `video*` pattern → CholecT50

	## Error Handling

	### Missing Ground-Truth

	```bash
	# If ground-truth file not found
	[EvaluationWrapper] ❌ ERROR: Ground-truth file not found: /path/to/ground_truth.json
	```

	Solution: Specify correct path with `--ground-truth`

	### Unmatched Predictions

	```bash
	[EvaluationWrapper] ⚠️ WARNING: 10 predictions not found in ground-truth
	[EvaluationWrapper] First 5 unmatched IDs: [...]
	```

	Cause: Prediction IDs don't match ground-truth IDs

	Solution: Check ID format (video_id&&start&&end&&fps must match exactly)

	### Invalid ID Format

	```bash
	ValueError: Invalid ID format: <id_string>
	```

	Cause: ID doesn't follow `video_id&&start&&end&&fps` format

	Solution: Fix ID format in predictions

	## API Keys for LLM Judge

	For RC/VS evaluation with LLM judge:

	```bash
	export OPENAI_API_KEY="your-key" # For GPT-4.1
	export GOOGLE_API_KEY="your-key" # For Gemini

	# Then run evaluation
	python evaluate_predictions.py predictions.json --tasks rc vs
	```

	Cost: ~$0.012 per RC/VS sample (GPT-4.1)

	## Verification Checklist

	Before evaluating submissions:

	```bash
	# 1. Check file format
	python evaluate_predictions.py submission.json --analyze-only

	# 2. Verify ground-truth file exists
	ls -lh /root/code/MedVidBench-Leaderboard/data/ground_truth.json

	# 3. Run evaluation on sample (first 100 records)
	head -100 submission.json > sample.json
	python evaluate_predictions.py sample.json

	# 4. If successful, run full evaluation
	python evaluate_predictions.py submission.json
	```

	## Performance

	- Small files (100 samples): ~5-10 seconds
	- Full benchmark (6245 samples): ~2-5 minutes
	- TAL/STG: ~30 seconds per dataset
	- Next Action: ~20 seconds per dataset
	- DVC: ~1-2 minutes (metric computation)
	- RC/VS with LLM judge: ~5-10 minutes (API calls)

	## Notes

	- Ground-truth file contains 1414 records (subset for leaderboard testing)
	- Full benchmark has 6245 records across 8 datasets
	- Temporary files are automatically cleaned up after evaluation
	- LLM judge can be skipped with `--skip-llm-judge` if scores pre-computed