# MedVidBench Leaderboard Evaluation

Auto-detection wrapper for evaluating predictions with automatic ground-truth merging.

## Overview

This evaluation system supports two input formats:

1. **Merged format** (already contains ground-truth): Like `results.json`
2. **Prediction-only format** (needs ground-truth): Like user submissions

The wrapper automatically detects which format you're using and handles ground-truth merging if needed.

## Quick Start

```bash
# Evaluate predictions (auto-detects format)
python evaluate_predictions.py <predictions_file>

# Evaluate specific tasks
python evaluate_predictions.py <predictions_file> --tasks tal stg

# Only analyze file structure
python evaluate_predictions.py <predictions_file> --analyze-only

# Use overall grouping (aggregate all datasets)
python evaluate_predictions.py <predictions_file> --grouping overall
```

## Input Formats

### Format 1: Prediction-Only (User Submission Format)

```json
[
  {
    "id": "kcOqlifSukA&&22425&&25124&&1.0",
    "qa_type": "tal",
    "prediction": "22.0-78.0, 89.0-94.0 seconds."
  },
  ...
]
```

**ID Format**: `video_id&&start_frame&&end_frame&&fps`

**Required fields**:
- `id`: Unique identifier matching ground-truth
- `qa_type`: Task type (tal, stg, dvc, next_action, rc, vs, skill_assessment, cvs_assessment)
- `prediction`: Model's prediction text

**What happens**: The wrapper automatically merges with `ground_truth.json` to create complete evaluation records.

### Format 2: Merged (Complete Format)

```json
{
  "0": {
    "metadata": {
      "video_id": "kcOqlifSukA",
      "fps": "1.0",
      "input_video_start_frame": "22425",
      "input_video_end_frame": "25124"
    },
    "qa_type": "tal",
    "struc_info": [...],
    "question": "...",
    "gnd": "0.0-10.0 seconds.",
    "answer": "22.0-78.0, 89.0-94.0 seconds.",
    "data_source": "AVOS"
  },
  ...
}
```

**Required fields**:
- `metadata`: Video metadata (video_id, fps, frame range)
- `qa_type`: Task type
- `struc_info`: Ground-truth structured information
- `question`: Question text
- `gnd`: Ground-truth answer
- `answer`: Model prediction
- `data_source`: Dataset name

**What happens**: The wrapper uses the file directly for evaluation.

## Ground-Truth File

**Location**: `/root/code/MedVidBench-Leaderboard/data/ground_truth.json`

**Structure**: Array of records, each containing:
- Complete metadata (video_id, fps, frame range)
- `struc_info`: Structured ground-truth (spans for TAL/STG, boxes for RC, etc.)
- Ground-truth answer
- Dataset source

**Note**: This file is NOT public. Users submit prediction-only files, which are merged server-side.

## Supported Tasks

| Task | qa_type | Metrics |
|------|---------|---------|
| **TAL** | `tal` | Recall@0.3/0.5, mIoU@0.3/0.5 |
| **STG** | `stg` | IoU@0.3/0.5/0.7, mIoU |
| **DVC** | `dense_captioning`, `dense_captioning_gpt`, `dense_captioning_gemini`, `dc` | CIDER, METEOR, Precision, Recall, F1, SODA_c |
| **Next Action** | `next_action` | Accuracy (per-dataset) |
| **RC** | `region_caption`, `region_caption_gpt`, `region_caption_gemini` | LLM Judge (GPT-4.1/Gemini) |
| **VS** | `video_summary`, `video_summary_gpt`, `video_summary_gemini` | LLM Judge (GPT-4.1/Gemini) |
| **Skill Assessment** | `skill_assessment` | Accuracy, Macro F1, Weighted F1 |
| **CVS Assessment** | `cvs_assessment` | Accuracy, Precision, Recall, F1 |

## Usage Examples

### Example 1: Evaluate User Submission (Prediction-Only)

```bash
# User submits predictions in prediction-only format
python evaluate_predictions.py user_predictions.json

# Output:
# [EvaluationWrapper] ✓ Detected: Prediction-only format (id, qa_type, prediction)
# [EvaluationWrapper] Merging with ground-truth...
# [EvaluationWrapper] ✓ Successfully merged 6245/6245 predictions
# ... [evaluation results] ...
```

### Example 2: Evaluate Internal Results (Already Merged)

```bash
# Internal evaluation with complete data
python evaluate_predictions.py results.json

# Output:
# [EvaluationWrapper] ✓ Detected: Predictions already contain ground-truth
# [EvaluationWrapper] Using predictions file directly for evaluation
# ... [evaluation results] ...
```

### Example 3: Specific Tasks Only

```bash
# Evaluate only TAL and STG tasks
python evaluate_predictions.py predictions.json --tasks tal stg

# Evaluate captioning tasks with LLM judge
python evaluate_predictions.py predictions.json --tasks rc vs dvc
```

### Example 4: Different Grouping Modes

```bash
# Per-dataset results (default)
python evaluate_predictions.py predictions.json --grouping per-dataset

# Overall results (aggregate all datasets)
python evaluate_predictions.py predictions.json --grouping overall
```

### Example 5: Skip LLM Judge (Use Pre-computed Scores)

```bash
# Skip LLM judge evaluation for caption tasks
# Useful when LLM scores are already pre-computed in the predictions
python evaluate_predictions.py predictions.json --skip-llm-judge
```

### Example 6: Analyze File Structure

```bash
# Only analyze what tasks/datasets are present
python evaluate_predictions.py predictions.json --analyze-only

# Output:
# Found QA types:
#   tal: 1637 records
#   stg: 780 records
#   ...
# Found datasets:
#   AVOS: 321 records
#   CholecT50: 409 records
#   ...
```

## Command-Line Options

```
python evaluate_predictions.py PREDICTIONS_FILE [OPTIONS]

Required:
  PREDICTIONS_FILE          Path to predictions JSON (merged or prediction-only format)

Optional:
  --ground-truth PATH       Path to ground-truth JSON (default: data/ground_truth.json)
  --tasks TASK [TASK ...]  Specific tasks to evaluate (default: all available)
                           Choices: dvc, tal, next_action, stg, rc, vs,
                                   skill_assessment, cvs_assessment
  --grouping {per-dataset,overall}
                           Grouping strategy (default: per-dataset)
                           - per-dataset: Results per dataset
                           - overall: Aggregate all datasets
  --analyze-only           Only analyze file structure, no evaluation
  --skip-llm-judge         Skip LLM judge for caption tasks (use pre-computed scores)
  -h, --help              Show help message
```

## Output Format

### Per-Dataset Grouping (Default)

```
================================================================================
EVALUATION RESULTS - PER DATASET
================================================================================

AVOS:
  TAL:
    recall@0.3: 0.45
    meanIoU@0.3: 0.42
    recall@0.5: 0.32
    meanIoU@0.5: 0.28

CholecT50:
  TAL:
    recall@0.3: 0.52
    ...
```

### Overall Grouping

```
================================================================================
EVALUATION RESULTS - OVERALL (Dataset-Agnostic)
================================================================================

TAL - Overall Evaluation (All Datasets Combined)
Total samples: 1637

  recall@0.3: 0.48
  meanIoU@0.3: 0.45
  recall@0.5: 0.35
  meanIoU@0.5: 0.30
```

## Workflow for User Submissions

1. **User downloads benchmark**: `/root/code/MedVidBench/cleaned_test_data_11_04.json`
   - Contains questions but NO ground-truth (struc_info removed)

2. **User runs inference**: Generates predictions for each sample

3. **User submits predictions**: prediction-only format
   ```json
   [
     {
       "id": "<from benchmark>",
       "qa_type": "<from benchmark>",
       "prediction": "<model output>"
     },
     ...
   ]
   ```

4. **Server evaluates**:
   ```bash
   python evaluate_predictions.py user_submission.json
   ```
   - Auto-detects format
   - Merges with server-side ground-truth
   - Runs evaluation
   - Returns metrics

## File Structure

```
evaluation/
├── README.md                        # This file
├── evaluate_predictions.py          # Main wrapper (auto-detection + merging)
├── evaluate_all_pai.py             # Core evaluation orchestrator
├── eval_tal.py                     # TAL evaluation
├── eval_stg.py                     # STG evaluation
├── eval_dvc.py                     # Dense captioning evaluation
├── eval_next_action.py             # Next action evaluation
├── eval_caption_llm_judge.py       # RC/VS LLM judge evaluation
├── eval_skill_assessment.py        # Skill assessment evaluation
└── eval_cvs_assessment.py          # CVS assessment evaluation
```

## Key Features

### Auto-Detection Logic

The wrapper detects format by checking for these indicators:

**Prediction-only format**:
- Has `id` field (video_id&&start&&end&&fps)
- Has `prediction` field
- Missing `gnd` or `struc_info`

**Merged format**:
- Has `question` field
- Has `gnd` field (ground-truth)
- Has `struc_info` field (structured GT)
- Has `metadata` dict

### Ground-Truth Merging

When prediction-only format is detected:

1. Load predictions and ground-truth
2. Build index: `{id -> ground_truth_record}`
3. For each prediction:
   - Look up ground-truth by ID
   - Merge into complete record
   - Add `data_source` from ground-truth
4. Save to temporary file
5. Run evaluation
6. Clean up temporary file

### Dataset Detection

Datasets are detected from:
1. **`data_source` field** (primary, leaderboard format)
2. `dataset` field (fallback)
3. `dataset_name` field (fallback)
4. Video ID patterns (last resort):
   - YouTube IDs (11 chars with letters) → AVOS
   - `*_part*` pattern → CoPESD
   - `video*` pattern → CholecT50

## Error Handling

### Missing Ground-Truth

```bash
# If ground-truth file not found
[EvaluationWrapper] ❌ ERROR: Ground-truth file not found: /path/to/ground_truth.json
```

**Solution**: Specify correct path with `--ground-truth`

### Unmatched Predictions

```bash
[EvaluationWrapper] ⚠️  WARNING: 10 predictions not found in ground-truth
[EvaluationWrapper]   First 5 unmatched IDs: [...]
```

**Cause**: Prediction IDs don't match ground-truth IDs

**Solution**: Check ID format (video_id&&start&&end&&fps must match exactly)

### Invalid ID Format

```bash
ValueError: Invalid ID format: <id_string>
```

**Cause**: ID doesn't follow `video_id&&start&&end&&fps` format

**Solution**: Fix ID format in predictions

## API Keys for LLM Judge

For RC/VS evaluation with LLM judge:

```bash
export OPENAI_API_KEY="your-key"      # For GPT-4.1
export GOOGLE_API_KEY="your-key"      # For Gemini

# Then run evaluation
python evaluate_predictions.py predictions.json --tasks rc vs
```

**Cost**: ~$0.012 per RC/VS sample (GPT-4.1)

## Verification Checklist

Before evaluating submissions:

```bash
# 1. Check file format
python evaluate_predictions.py submission.json --analyze-only

# 2. Verify ground-truth file exists
ls -lh /root/code/MedVidBench-Leaderboard/data/ground_truth.json

# 3. Run evaluation on sample (first 100 records)
head -100 submission.json > sample.json
python evaluate_predictions.py sample.json

# 4. If successful, run full evaluation
python evaluate_predictions.py submission.json
```

## Performance

- **Small files** (100 samples): ~5-10 seconds
- **Full benchmark** (6245 samples): ~2-5 minutes
  - TAL/STG: ~30 seconds per dataset
  - Next Action: ~20 seconds per dataset
  - DVC: ~1-2 minutes (metric computation)
  - RC/VS with LLM judge: ~5-10 minutes (API calls)

## Notes

- Ground-truth file contains **1414 records** (subset for leaderboard testing)
- Full benchmark has **6245 records** across 8 datasets
- Temporary files are automatically cleaned up after evaluation
- LLM judge can be skipped with `--skip-llm-judge` if scores pre-computed