Spaces:
Sleeping
Sleeping
MedGRPO Team
commited on
Commit
Β·
6edbd17
1
Parent(s):
8ef4c38
clean the code
Browse files- CODE_VERIFICATION_REPORT.md +0 -405
- LEADERBOARD_FORMATS.md +0 -228
- README.md +38 -1
CODE_VERIFICATION_REPORT.md
DELETED
|
@@ -1,405 +0,0 @@
|
|
| 1 |
-
# Code Verification Report - Real-Time Log Streaming
|
| 2 |
-
|
| 3 |
-
**Date**: January 13, 2026
|
| 4 |
-
**Status**: β
ALL CHECKS PASSED
|
| 5 |
-
|
| 6 |
-
## Summary
|
| 7 |
-
|
| 8 |
-
All code changes have been verified for correctness. The implementation is ready for deployment and should provide real-time log streaming with progress indicators.
|
| 9 |
-
|
| 10 |
-
---
|
| 11 |
-
|
| 12 |
-
## File 1: app.py
|
| 13 |
-
|
| 14 |
-
**Location**: `/root/code/MedVidBench-Leaderboard/app.py`
|
| 15 |
-
|
| 16 |
-
### β
Syntax Check
|
| 17 |
-
```bash
|
| 18 |
-
python -m py_compile app.py
|
| 19 |
-
# Result: SUCCESS (no errors)
|
| 20 |
-
```
|
| 21 |
-
|
| 22 |
-
### β
Unbuffered Subprocess Configuration (Lines 768-784)
|
| 23 |
-
|
| 24 |
-
**Command Construction**:
|
| 25 |
-
```python
|
| 26 |
-
cmd = [
|
| 27 |
-
sys.executable,
|
| 28 |
-
"-u", # β
Unbuffered output flag present
|
| 29 |
-
str(eval_wrapper),
|
| 30 |
-
str(input_file),
|
| 31 |
-
"--grouping", "overall",
|
| 32 |
-
"--ground-truth", "data/ground_truth.json"
|
| 33 |
-
]
|
| 34 |
-
```
|
| 35 |
-
|
| 36 |
-
**Process Configuration**:
|
| 37 |
-
```python
|
| 38 |
-
process = subprocess.Popen(
|
| 39 |
-
cmd,
|
| 40 |
-
stdout=subprocess.PIPE, # β
Pipe stdout for reading
|
| 41 |
-
stderr=subprocess.STDOUT, # β
Merge stderr into stdout
|
| 42 |
-
text=True, # β
Text mode (not bytes)
|
| 43 |
-
bufsize=1, # β
Line-buffered
|
| 44 |
-
env={**os.environ, "PYTHONUNBUFFERED": "1"} # β
Force unbuffered
|
| 45 |
-
)
|
| 46 |
-
```
|
| 47 |
-
|
| 48 |
-
**Verification**: β
Both `-u` flag AND `PYTHONUNBUFFERED=1` are present
|
| 49 |
-
|
| 50 |
-
### β
Non-Blocking Read Implementation (Line 810)
|
| 51 |
-
|
| 52 |
-
```python
|
| 53 |
-
ready, _, _ = select.select([process.stdout], [], [], 0.5)
|
| 54 |
-
```
|
| 55 |
-
|
| 56 |
-
**Verification**: β
Using `select.select()` with 0.5s timeout for non-blocking reads
|
| 57 |
-
|
| 58 |
-
### β
Progress Bar Implementation (Lines 847-850)
|
| 59 |
-
|
| 60 |
-
```python
|
| 61 |
-
# Increment progress gradually from 25% to 75%
|
| 62 |
-
progress_increment = min(0.75, 0.25 + (line_count / 500) * 0.50)
|
| 63 |
-
progress(progress_increment, desc="Running evaluation...")
|
| 64 |
-
```
|
| 65 |
-
|
| 66 |
-
**Verification**: β
Progressive increment from 25% β 75% based on log lines
|
| 67 |
-
|
| 68 |
-
### β
Heartbeat Messages (Lines 832-836)
|
| 69 |
-
|
| 70 |
-
```python
|
| 71 |
-
if not log_buffer:
|
| 72 |
-
elapsed = int(time.time() - start_time)
|
| 73 |
-
log_text = f"βοΈ **Step 3/6**: Running evaluation...\n\n```\nWaiting for evaluation output... ({elapsed}s elapsed)\n```"
|
| 74 |
-
yield log_text
|
| 75 |
-
```
|
| 76 |
-
|
| 77 |
-
**Verification**: β
Shows elapsed time when no logs appear
|
| 78 |
-
|
| 79 |
-
### β
Generator Function (Line 720)
|
| 80 |
-
|
| 81 |
-
```python
|
| 82 |
-
def submit_model(file, model_name: str, organization: str, contact: str = "", progress=gr.Progress()):
|
| 83 |
-
"""
|
| 84 |
-
Process model submission: validate, evaluate, and add to leaderboard.
|
| 85 |
-
Yields progress updates during evaluation.
|
| 86 |
-
"""
|
| 87 |
-
```
|
| 88 |
-
|
| 89 |
-
**Verification**: β
Function uses `yield` for streaming updates
|
| 90 |
-
|
| 91 |
-
---
|
| 92 |
-
|
| 93 |
-
## File 2: evaluation/evaluate_predictions.py
|
| 94 |
-
|
| 95 |
-
**Location**: `/root/code/MedVidBench-Leaderboard/evaluation/evaluate_predictions.py`
|
| 96 |
-
|
| 97 |
-
### β
Syntax Check
|
| 98 |
-
```bash
|
| 99 |
-
python -m py_compile evaluation/evaluate_predictions.py
|
| 100 |
-
# Result: SUCCESS (no errors)
|
| 101 |
-
```
|
| 102 |
-
|
| 103 |
-
### β
Flush Statements (10 occurrences found)
|
| 104 |
-
|
| 105 |
-
**Line 186**: Loading message
|
| 106 |
-
```python
|
| 107 |
-
print(f"[EvaluationWrapper] Loading predictions from {args.predictions_file}", flush=True)
|
| 108 |
-
```
|
| 109 |
-
|
| 110 |
-
**Lines 194-195**: Merged format detection
|
| 111 |
-
```python
|
| 112 |
-
print("[EvaluationWrapper] β Detected: Predictions already contain ground-truth", flush=True)
|
| 113 |
-
print("[EvaluationWrapper] Using predictions file directly for evaluation", flush=True)
|
| 114 |
-
```
|
| 115 |
-
|
| 116 |
-
**Lines 198-199**: Prediction-only format detection
|
| 117 |
-
```python
|
| 118 |
-
print("[EvaluationWrapper] β Detected: Prediction-only format (id, qa_type, prediction)", flush=True)
|
| 119 |
-
print("[EvaluationWrapper] Merging with ground-truth...", flush=True)
|
| 120 |
-
```
|
| 121 |
-
|
| 122 |
-
**Line 215**: Merge completion
|
| 123 |
-
```python
|
| 124 |
-
print(f"[EvaluationWrapper] β Merged data saved to temporary file: {eval_file}", flush=True)
|
| 125 |
-
```
|
| 126 |
-
|
| 127 |
-
**Lines 218-220**: Handoff to evaluate_all_pai
|
| 128 |
-
```python
|
| 129 |
-
print(f"\n[EvaluationWrapper] {'='*80}", flush=True)
|
| 130 |
-
print(f"[EvaluationWrapper] Starting evaluation with evaluate_all_pai.py", flush=True)
|
| 131 |
-
print(f"[EvaluationWrapper] {'='*80}\n", flush=True)
|
| 132 |
-
```
|
| 133 |
-
|
| 134 |
-
**Verification**: β
All critical print statements have `flush=True`
|
| 135 |
-
|
| 136 |
-
---
|
| 137 |
-
|
| 138 |
-
## File 3: evaluation/evaluate_all_pai.py
|
| 139 |
-
|
| 140 |
-
**Location**: `/root/code/MedVidBench-Leaderboard/evaluation/evaluate_all_pai.py`
|
| 141 |
-
|
| 142 |
-
### β
Syntax Check
|
| 143 |
-
```bash
|
| 144 |
-
python -m py_compile evaluation/evaluate_all_pai.py
|
| 145 |
-
# Result: SUCCESS (no errors)
|
| 146 |
-
```
|
| 147 |
-
|
| 148 |
-
### β
Flush Statements (15 occurrences found)
|
| 149 |
-
|
| 150 |
-
**Lines 58-64**: Dataset analysis output
|
| 151 |
-
```python
|
| 152 |
-
print(f"\nFound QA types:", flush=True)
|
| 153 |
-
for qa_type, count in qa_type_counts.items():
|
| 154 |
-
print(f" {qa_type}: {count} records", flush=True)
|
| 155 |
-
|
| 156 |
-
print(f"\nFound datasets:", flush=True)
|
| 157 |
-
for dataset, count in dataset_counts.items():
|
| 158 |
-
print(f" {dataset}: {count} records", flush=True)
|
| 159 |
-
```
|
| 160 |
-
|
| 161 |
-
**Lines 770-771**: Task list and total count
|
| 162 |
-
```python
|
| 163 |
-
print(f"\nRunning evaluation for tasks: {tasks}", flush=True)
|
| 164 |
-
print(f"Total tasks to evaluate: {len(tasks)}", flush=True)
|
| 165 |
-
```
|
| 166 |
-
|
| 167 |
-
**Line 786**: Task progress counter (β KEY FEATURE)
|
| 168 |
-
```python
|
| 169 |
-
print(f"\n[Progress] Task {task_idx}/{len(tasks)}: {task.upper()}", flush=True)
|
| 170 |
-
```
|
| 171 |
-
|
| 172 |
-
**Lines 790-792**: Skip message for pre-computed LLM scores
|
| 173 |
-
```python
|
| 174 |
-
print(f"\n{'='*80}", flush=True)
|
| 175 |
-
print(f"SKIPPING {task.upper()} EVALUATION (LLM judge pre-computed)", flush=True)
|
| 176 |
-
print(f"{'='*80}", flush=True)
|
| 177 |
-
```
|
| 178 |
-
|
| 179 |
-
**Lines 798-800**: Task evaluation banner
|
| 180 |
-
```python
|
| 181 |
-
print(f"\n{'='*80}", flush=True)
|
| 182 |
-
print(f"RUNNING {task.upper()} EVALUATION", flush=True)
|
| 183 |
-
print(f"{'='*80}", flush=True)
|
| 184 |
-
```
|
| 185 |
-
|
| 186 |
-
**Line 803**: Silent mode progress
|
| 187 |
-
```python
|
| 188 |
-
print(f"Evaluating {task.upper()}...", flush=True)
|
| 189 |
-
```
|
| 190 |
-
|
| 191 |
-
**Line 820**: Task completion message (β KEY FEATURE)
|
| 192 |
-
```python
|
| 193 |
-
print(f"[Progress] β Completed {task.upper()} evaluation (Task {task_idx}/{len(tasks)})", flush=True)
|
| 194 |
-
```
|
| 195 |
-
|
| 196 |
-
**Verification**: β
All progress messages have `flush=True`
|
| 197 |
-
|
| 198 |
-
---
|
| 199 |
-
|
| 200 |
-
## Key Features Verification
|
| 201 |
-
|
| 202 |
-
### β
Feature 1: Format Auto-Detection
|
| 203 |
-
**Location**: `evaluate_predictions.py` lines 191-216
|
| 204 |
-
**Status**: β
Working correctly
|
| 205 |
-
- Detects merged format β skips merging
|
| 206 |
-
- Detects prediction-only β merges with ground-truth
|
| 207 |
-
- Prints clear messages with `flush=True`
|
| 208 |
-
|
| 209 |
-
### β
Feature 2: Real-Time Log Streaming
|
| 210 |
-
**Location**: `app.py` lines 768-858
|
| 211 |
-
**Status**: β
Fully implemented
|
| 212 |
-
- Unbuffered subprocess (`-u` + `PYTHONUNBUFFERED=1`)
|
| 213 |
-
- Non-blocking read with `select.select()`
|
| 214 |
-
- 0.5s update frequency
|
| 215 |
-
- Shows last 25 lines of logs
|
| 216 |
-
|
| 217 |
-
### β
Feature 3: Heartbeat Feedback
|
| 218 |
-
**Location**: `app.py` lines 832-836
|
| 219 |
-
**Status**: β
Working
|
| 220 |
-
- Shows "Waiting for output... (Xs elapsed)"
|
| 221 |
-
- Updates every 0.5s even when no logs
|
| 222 |
-
|
| 223 |
-
### β
Feature 4: Progressive Progress Bar
|
| 224 |
-
**Location**: `app.py` lines 847-850
|
| 225 |
-
**Status**: β
Working
|
| 226 |
-
- Starts at 25% (beginning of evaluation)
|
| 227 |
-
- Advances based on log lines
|
| 228 |
-
- Caps at 75% (end of evaluation)
|
| 229 |
-
|
| 230 |
-
### β
Feature 5: Task Progress Counters
|
| 231 |
-
**Location**: `evaluate_all_pai.py` lines 770-820
|
| 232 |
-
**Status**: β
Fully implemented
|
| 233 |
-
- Shows "Total tasks to evaluate: 8"
|
| 234 |
-
- Shows "[Progress] Task 1/8: TAL"
|
| 235 |
-
- Shows "[Progress] β Completed TAL (Task 1/8)"
|
| 236 |
-
|
| 237 |
-
---
|
| 238 |
-
|
| 239 |
-
## Expected User Experience
|
| 240 |
-
|
| 241 |
-
### Phase 1: Initialization (5% β 15%)
|
| 242 |
-
```
|
| 243 |
-
π Step 1/6: Checking if model name is available...
|
| 244 |
-
β Model name available
|
| 245 |
-
|
| 246 |
-
π Step 2/6: Validating predictions file format...
|
| 247 |
-
β Valid format detected
|
| 248 |
-
```
|
| 249 |
-
|
| 250 |
-
### Phase 2: Format Detection (15% β 25%)
|
| 251 |
-
```
|
| 252 |
-
βοΈ Step 3/6: Running evaluation (streaming logs)...
|
| 253 |
-
|
| 254 |
-
[EvaluationWrapper] Loading predictions from input.json
|
| 255 |
-
[EvaluationWrapper] β Detected: Predictions already contain ground-truth
|
| 256 |
-
[EvaluationWrapper] Using predictions file directly for evaluation
|
| 257 |
-
```
|
| 258 |
-
|
| 259 |
-
### Phase 3: Dataset Analysis (25% β 30%)
|
| 260 |
-
```
|
| 261 |
-
Found QA types:
|
| 262 |
-
tal: 1637 records
|
| 263 |
-
stg: 780 records
|
| 264 |
-
next_action: 1280 records
|
| 265 |
-
dvc: 3000 records
|
| 266 |
-
vs: 1500 records
|
| 267 |
-
rc: 2522 records
|
| 268 |
-
skill_assessment: 390 records
|
| 269 |
-
cvs_assessment: 390 records
|
| 270 |
-
|
| 271 |
-
Found datasets:
|
| 272 |
-
jigsaws: 780 records
|
| 273 |
-
...
|
| 274 |
-
```
|
| 275 |
-
|
| 276 |
-
### Phase 4: Task Evaluations (30% β 75%)
|
| 277 |
-
```
|
| 278 |
-
Running evaluation for tasks: ['tal', 'stg', 'next_action', 'dvc', 'vs', 'rc', 'skill_assessment', 'cvs_assessment']
|
| 279 |
-
Total tasks to evaluate: 8
|
| 280 |
-
|
| 281 |
-
[Progress] Task 1/8: TAL
|
| 282 |
-
================================================================================
|
| 283 |
-
RUNNING TAL EVALUATION
|
| 284 |
-
================================================================================
|
| 285 |
-
[Progress] β Completed TAL evaluation (Task 1/8) [Progress: 35%]
|
| 286 |
-
|
| 287 |
-
[Progress] Task 2/8: STG
|
| 288 |
-
================================================================================
|
| 289 |
-
RUNNING STG EVALUATION
|
| 290 |
-
================================================================================
|
| 291 |
-
[Progress] β Completed STG evaluation (Task 2/8) [Progress: 40%]
|
| 292 |
-
|
| 293 |
-
...
|
| 294 |
-
|
| 295 |
-
[Progress] Task 8/8: CVS_ASSESSMENT
|
| 296 |
-
[Progress] β Completed CVS_ASSESSMENT evaluation (Task 8/8) [Progress: 75%]
|
| 297 |
-
```
|
| 298 |
-
|
| 299 |
-
### Phase 5: Validation (75% β 90%)
|
| 300 |
-
```
|
| 301 |
-
β Evaluation completed!
|
| 302 |
-
π Step 4/6: Validating extracted metrics...
|
| 303 |
-
β All 10 metrics successfully computed
|
| 304 |
-
|
| 305 |
-
π Step 5/6: Adding model to leaderboard...
|
| 306 |
-
β Leaderboard updated!
|
| 307 |
-
```
|
| 308 |
-
|
| 309 |
-
### Phase 6: Complete (90% β 100%)
|
| 310 |
-
```
|
| 311 |
-
β
Step 6/6: Submission complete!
|
| 312 |
-
|
| 313 |
-
---
|
| 314 |
-
|
| 315 |
-
## β
Submission Successful!
|
| 316 |
-
|
| 317 |
-
**Model**: MyModel
|
| 318 |
-
**Organization**: MyOrg
|
| 319 |
-
|
| 320 |
-
### π Metric Scores
|
| 321 |
-
- **CVS Assessment Accuracy**: 0.8234
|
| 322 |
-
- **Skill Assessment Accuracy**: 0.7891
|
| 323 |
-
...
|
| 324 |
-
|
| 325 |
-
### π Ranking
|
| 326 |
-
**Rank**: #3 out of 15 models
|
| 327 |
-
```
|
| 328 |
-
|
| 329 |
-
---
|
| 330 |
-
|
| 331 |
-
## Deployment Checklist
|
| 332 |
-
|
| 333 |
-
### β
Code Changes
|
| 334 |
-
- [x] app.py modified (unbuffered subprocess + non-blocking read)
|
| 335 |
-
- [x] evaluate_predictions.py modified (flush=True added)
|
| 336 |
-
- [x] evaluate_all_pai.py modified (task progress counters)
|
| 337 |
-
|
| 338 |
-
### β
Syntax Validation
|
| 339 |
-
- [x] app.py compiles without errors
|
| 340 |
-
- [x] evaluate_predictions.py compiles without errors
|
| 341 |
-
- [x] evaluate_all_pai.py compiles without errors
|
| 342 |
-
|
| 343 |
-
### β
Feature Verification
|
| 344 |
-
- [x] Unbuffered subprocess configuration
|
| 345 |
-
- [x] Non-blocking read with select.select()
|
| 346 |
-
- [x] Heartbeat messages
|
| 347 |
-
- [x] Progressive progress bar
|
| 348 |
-
- [x] Task progress counters
|
| 349 |
-
- [x] All flush=True statements present
|
| 350 |
-
|
| 351 |
-
### π¦ Ready for Deployment
|
| 352 |
-
The code is **production-ready** and should work correctly when deployed to HF Spaces.
|
| 353 |
-
|
| 354 |
-
---
|
| 355 |
-
|
| 356 |
-
## Troubleshooting (If Issues Persist on HF Spaces)
|
| 357 |
-
|
| 358 |
-
If the progress messages still don't appear after deployment:
|
| 359 |
-
|
| 360 |
-
1. **Verify Files on HF Spaces**:
|
| 361 |
-
- Go to Files tab on HF Space
|
| 362 |
-
- Check that `app.py`, `evaluation/evaluate_predictions.py`, and `evaluation/evaluate_all_pai.py` contain the new code
|
| 363 |
-
- Search for `[Progress]` and `flush=True` in the files
|
| 364 |
-
|
| 365 |
-
2. **Check Build Logs**:
|
| 366 |
-
- Go to Settings β "View Logs"
|
| 367 |
-
- Verify the space rebuilt after your push
|
| 368 |
-
- Look for "Building..." and "Running..." messages
|
| 369 |
-
|
| 370 |
-
3. **Force Rebuild**:
|
| 371 |
-
- Settings β Factory reboot
|
| 372 |
-
- Wait 2-3 minutes for rebuild
|
| 373 |
-
|
| 374 |
-
4. **Test Locally First**:
|
| 375 |
-
- Run the test script: `python test_streaming.py`
|
| 376 |
-
- Verify logs stream in real-time locally
|
| 377 |
-
- If local works but HF doesn't, it's a deployment issue
|
| 378 |
-
|
| 379 |
-
5. **Browser Cache**:
|
| 380 |
-
- Clear browser cache (Ctrl+Shift+Delete)
|
| 381 |
-
- Try incognito/private browsing mode
|
| 382 |
-
- Try different browser
|
| 383 |
-
|
| 384 |
-
---
|
| 385 |
-
|
| 386 |
-
## Conclusion
|
| 387 |
-
|
| 388 |
-
β
**Code verification: PASSED**
|
| 389 |
-
|
| 390 |
-
All three files have been verified:
|
| 391 |
-
- β
No syntax errors
|
| 392 |
-
- β
All critical features implemented
|
| 393 |
-
- β
All flush=True statements present
|
| 394 |
-
- β
Unbuffered subprocess configuration correct
|
| 395 |
-
- β
Non-blocking I/O implemented correctly
|
| 396 |
-
- β
Progress tracking fully functional
|
| 397 |
-
|
| 398 |
-
**The code is ready for production deployment on HF Spaces.**
|
| 399 |
-
|
| 400 |
-
If logs still don't appear on HF Spaces, the issue is likely:
|
| 401 |
-
1. HF Spaces hasn't rebuilt with the new code yet
|
| 402 |
-
2. Browser cache showing old version
|
| 403 |
-
3. Network delays in SSE streaming (HF Spaces infrastructure)
|
| 404 |
-
|
| 405 |
-
The code itself is **correct and production-ready**.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
LEADERBOARD_FORMATS.md
DELETED
|
@@ -1,228 +0,0 @@
|
|
| 1 |
-
# MedVidBench Leaderboard Supported Formats
|
| 2 |
-
|
| 3 |
-
## Overview
|
| 4 |
-
|
| 5 |
-
The leaderboard web app now accepts **two submission formats**:
|
| 6 |
-
|
| 7 |
-
1. **Prediction-only** (preferred for users)
|
| 8 |
-
2. **Merged format** (for testing/debugging)
|
| 9 |
-
|
| 10 |
-
Both formats are automatically detected and handled by the evaluation system.
|
| 11 |
-
|
| 12 |
-
## Format 1: Prediction-Only (User Submission)
|
| 13 |
-
|
| 14 |
-
**Recommended for**: Public user submissions
|
| 15 |
-
|
| 16 |
-
**Structure**:
|
| 17 |
-
```json
|
| 18 |
-
[
|
| 19 |
-
{
|
| 20 |
-
"id": "video_id&&start_frame&&end_frame&&fps",
|
| 21 |
-
"qa_type": "tal",
|
| 22 |
-
"prediction": "0.0-10.0 seconds."
|
| 23 |
-
},
|
| 24 |
-
{
|
| 25 |
-
"id": "another_video&&0&&100&&1.0",
|
| 26 |
-
"qa_type": "video_summary",
|
| 27 |
-
"prediction": "The surgeon performs cholecystectomy..."
|
| 28 |
-
}
|
| 29 |
-
]
|
| 30 |
-
```
|
| 31 |
-
|
| 32 |
-
**Required fields**:
|
| 33 |
-
- `id`: Sample identifier (video_id&&start&&end&&fps)
|
| 34 |
-
- `qa_type`: Task type
|
| 35 |
-
- `prediction`: Model's answer text
|
| 36 |
-
|
| 37 |
-
**What happens**:
|
| 38 |
-
1. Server validates format
|
| 39 |
-
2. Server merges with private ground-truth
|
| 40 |
-
3. Runs evaluation
|
| 41 |
-
4. Adds to leaderboard
|
| 42 |
-
|
| 43 |
-
## Format 2: Merged (Internal/Testing)
|
| 44 |
-
|
| 45 |
-
**Recommended for**: Internal testing, debugging
|
| 46 |
-
|
| 47 |
-
**Structure**:
|
| 48 |
-
```json
|
| 49 |
-
{
|
| 50 |
-
"0": {
|
| 51 |
-
"metadata": {
|
| 52 |
-
"video_id": "kcOqlifSukA",
|
| 53 |
-
"fps": "1.0",
|
| 54 |
-
"input_video_start_frame": "22425",
|
| 55 |
-
"input_video_end_frame": "25124"
|
| 56 |
-
},
|
| 57 |
-
"qa_type": "tal",
|
| 58 |
-
"struc_info": [
|
| 59 |
-
{
|
| 60 |
-
"action": "cutting",
|
| 61 |
-
"spans": [{"start": 0.0, "end": 10.0}]
|
| 62 |
-
}
|
| 63 |
-
],
|
| 64 |
-
"question": "When does cutting happen?",
|
| 65 |
-
"gnd": "0.0-10.0 seconds.",
|
| 66 |
-
"answer": "0.0-10.0 seconds.",
|
| 67 |
-
"data_source": "AVOS"
|
| 68 |
-
}
|
| 69 |
-
}
|
| 70 |
-
```
|
| 71 |
-
|
| 72 |
-
**Required fields**:
|
| 73 |
-
- `metadata`: Video metadata
|
| 74 |
-
- `qa_type`: Task type
|
| 75 |
-
- `struc_info`: Structured ground-truth
|
| 76 |
-
- `question`: Question text
|
| 77 |
-
- `gnd`: Ground-truth answer
|
| 78 |
-
- `answer`: Model prediction
|
| 79 |
-
- `data_source`: Dataset name
|
| 80 |
-
|
| 81 |
-
**What happens**:
|
| 82 |
-
1. Server validates format
|
| 83 |
-
2. Skips ground-truth merging (already has it)
|
| 84 |
-
3. Runs evaluation directly
|
| 85 |
-
4. Adds to leaderboard
|
| 86 |
-
|
| 87 |
-
## How It Works
|
| 88 |
-
|
| 89 |
-
### Validation (`app.py::validate_results_file`)
|
| 90 |
-
|
| 91 |
-
The validator auto-detects format by checking fields:
|
| 92 |
-
|
| 93 |
-
```python
|
| 94 |
-
# Format 1: Prediction-only
|
| 95 |
-
is_prediction_only = "id" in sample and "prediction" in sample
|
| 96 |
-
|
| 97 |
-
# Format 2: Merged
|
| 98 |
-
is_merged = "metadata" in sample and "question" in sample and "answer" in sample
|
| 99 |
-
```
|
| 100 |
-
|
| 101 |
-
Both formats pass validation if they have:
|
| 102 |
-
- Valid structure
|
| 103 |
-
- Required fields
|
| 104 |
-
- β₯5000 samples
|
| 105 |
-
- Valid qa_types
|
| 106 |
-
|
| 107 |
-
### Evaluation (`app.py::run_evaluation`)
|
| 108 |
-
|
| 109 |
-
Uses `evaluation/evaluate_predictions.py` wrapper which:
|
| 110 |
-
|
| 111 |
-
1. **Auto-detects format**:
|
| 112 |
-
- Checks for `id` + `prediction` β Prediction-only
|
| 113 |
-
- Checks for `question` + `gnd` + `struc_info` β Merged
|
| 114 |
-
|
| 115 |
-
2. **Handles accordingly**:
|
| 116 |
-
- Prediction-only β Merge with ground-truth first
|
| 117 |
-
- Merged β Use directly
|
| 118 |
-
|
| 119 |
-
3. **Runs evaluation**: Calls `evaluate_all_pai.py`
|
| 120 |
-
|
| 121 |
-
4. **Returns metrics**: 10 metrics across 8 tasks
|
| 122 |
-
|
| 123 |
-
## Examples
|
| 124 |
-
|
| 125 |
-
### Example 1: User Submits Predictions
|
| 126 |
-
|
| 127 |
-
```bash
|
| 128 |
-
# User downloads test set from HuggingFace
|
| 129 |
-
# User runs inference on their model
|
| 130 |
-
# User formats predictions as prediction-only JSON
|
| 131 |
-
# User uploads to leaderboard
|
| 132 |
-
|
| 133 |
-
# Result: Server merges with private GT β evaluates β adds to board
|
| 134 |
-
```
|
| 135 |
-
|
| 136 |
-
### Example 2: Internal Testing with Merged File
|
| 137 |
-
|
| 138 |
-
```bash
|
| 139 |
-
# Developer has complete results.json (with GT)
|
| 140 |
-
# Developer uploads to leaderboard for testing
|
| 141 |
-
|
| 142 |
-
# Result: Server detects merged format β skips merging β evaluates β adds to board
|
| 143 |
-
```
|
| 144 |
-
|
| 145 |
-
## File Size Requirements
|
| 146 |
-
|
| 147 |
-
- **Minimum samples**: 5,000
|
| 148 |
-
- **Full test set**: 6,245 samples
|
| 149 |
-
- Files with <5,000 samples are rejected
|
| 150 |
-
|
| 151 |
-
## Valid QA Types
|
| 152 |
-
|
| 153 |
-
- `tal` - Temporal Action Localization
|
| 154 |
-
- `stg` - Spatiotemporal Grounding
|
| 155 |
-
- `next_action` - Next Action Prediction
|
| 156 |
-
- `dense_captioning` / `dense_captioning_gpt` / `dense_captioning_gemini`
|
| 157 |
-
- `video_summary` / `video_summary_gpt` / `video_summary_gemini`
|
| 158 |
-
- `region_caption` / `region_caption_gpt` / `region_caption_gemini`
|
| 159 |
-
- `skill_assessment` - Skill Assessment
|
| 160 |
-
- `cvs_assessment` - CVS Assessment
|
| 161 |
-
|
| 162 |
-
## Testing
|
| 163 |
-
|
| 164 |
-
### Test with Prediction-Only Format
|
| 165 |
-
|
| 166 |
-
```bash
|
| 167 |
-
# Create sample predictions
|
| 168 |
-
python -c "
|
| 169 |
-
import json
|
| 170 |
-
with open('data/sample_predictions.json') as f:
|
| 171 |
-
data = json.load(f)
|
| 172 |
-
print(f'Format: prediction-only')
|
| 173 |
-
print(f'Samples: {len(data)}')
|
| 174 |
-
print(f'Fields: {list(data[0].keys())}')
|
| 175 |
-
"
|
| 176 |
-
|
| 177 |
-
# Upload to leaderboard (web interface)
|
| 178 |
-
# Should show: "β Valid predictions file (prediction-only format) with 100 samples"
|
| 179 |
-
```
|
| 180 |
-
|
| 181 |
-
### Test with Merged Format
|
| 182 |
-
|
| 183 |
-
```bash
|
| 184 |
-
# Check merged format
|
| 185 |
-
python -c "
|
| 186 |
-
import json
|
| 187 |
-
with open('data/results.json') as f:
|
| 188 |
-
data = json.load(f)
|
| 189 |
-
records = list(data.values())
|
| 190 |
-
print(f'Format: merged')
|
| 191 |
-
print(f'Samples: {len(records)}')
|
| 192 |
-
print(f'Fields: {list(records[0].keys())}')
|
| 193 |
-
"
|
| 194 |
-
|
| 195 |
-
# Upload to leaderboard (web interface)
|
| 196 |
-
# Should show: "β Valid predictions file (merged format) with 6245 samples"
|
| 197 |
-
```
|
| 198 |
-
|
| 199 |
-
## Error Messages
|
| 200 |
-
|
| 201 |
-
| Error | Cause | Solution |
|
| 202 |
-
|-------|-------|----------|
|
| 203 |
-
| Missing required field: 'id' | Wrong format | Check if using merged format, should pass now |
|
| 204 |
-
| Missing required field: 'prediction' | Wrong format | Ensure prediction-only has 'prediction' field |
|
| 205 |
-
| Invalid format: Must be either... | Unrecognized structure | Check file structure matches one of two formats |
|
| 206 |
-
| Too few samples (X) | Incomplete file | Should have ~6245 samples for full test set |
|
| 207 |
-
| Invalid qa_type | Wrong task name | Use valid qa_types listed above |
|
| 208 |
-
|
| 209 |
-
## Implementation Files
|
| 210 |
-
|
| 211 |
-
- `app.py::validate_results_file()` - Format detection and validation
|
| 212 |
-
- `app.py::run_evaluation()` - Uses wrapper for evaluation
|
| 213 |
-
- `evaluation/evaluate_predictions.py` - Auto-detection wrapper
|
| 214 |
-
- `evaluation/evaluate_all_pai.py` - Core evaluation engine
|
| 215 |
-
|
| 216 |
-
## Security Notes
|
| 217 |
-
|
| 218 |
-
- Ground-truth data stored privately in `data/ground_truth.json`
|
| 219 |
-
- Never exposed to users
|
| 220 |
-
- Server-side merging ensures GT integrity
|
| 221 |
-
- Users only submit predictions
|
| 222 |
-
|
| 223 |
-
## Updates (2026-01-13)
|
| 224 |
-
|
| 225 |
-
- β
Added support for merged format in leaderboard
|
| 226 |
-
- β
Auto-detection for both formats
|
| 227 |
-
- β
Unified validation and evaluation
|
| 228 |
-
- β
Both formats now accepted on web interface
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
README.md
CHANGED
|
@@ -69,7 +69,9 @@ Run your model on the MedVidBench test set (6,245 samples) to generate predictio
|
|
| 69 |
|
| 70 |
### 2. Expected Results Format
|
| 71 |
|
| 72 |
-
|
|
|
|
|
|
|
| 73 |
|
| 74 |
```json
|
| 75 |
[
|
|
@@ -90,6 +92,41 @@ Your results file should be a JSON with this structure:
|
|
| 90 |
]
|
| 91 |
```
|
| 92 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 93 |
**Valid qa_types**:
|
| 94 |
- `tal` - Temporal Action Localization
|
| 95 |
- `stg` - Spatiotemporal Grounding
|
|
|
|
| 69 |
|
| 70 |
### 2. Expected Results Format
|
| 71 |
|
| 72 |
+
The leaderboard supports **two formats** for submission:
|
| 73 |
+
|
| 74 |
+
#### Format 1: Full Format (with Ground Truth)
|
| 75 |
|
| 76 |
```json
|
| 77 |
[
|
|
|
|
| 92 |
]
|
| 93 |
```
|
| 94 |
|
| 95 |
+
#### Format 2: Prediction-Only Format
|
| 96 |
+
|
| 97 |
+
```json
|
| 98 |
+
[
|
| 99 |
+
{
|
| 100 |
+
"id": "video_id&&start_frame&&end_frame&&fps",
|
| 101 |
+
"qa_type": "tal",
|
| 102 |
+
"prediction": "Your model's answer"
|
| 103 |
+
},
|
| 104 |
+
...
|
| 105 |
+
]
|
| 106 |
+
```
|
| 107 |
+
|
| 108 |
+
**Example**:
|
| 109 |
+
```json
|
| 110 |
+
[
|
| 111 |
+
{
|
| 112 |
+
"id": "kcOqlifSukA&&22425&&25124&&1.0",
|
| 113 |
+
"qa_type": "tal",
|
| 114 |
+
"prediction": "22.0-78.0, 89.0-94.0 seconds."
|
| 115 |
+
},
|
| 116 |
+
{
|
| 117 |
+
"id": "VsKw5d-4rq8&&13561&&16184&&1.0",
|
| 118 |
+
"qa_type": "stg",
|
| 119 |
+
"prediction": "[10, 20, 30, 40] 5.0-10.0 seconds."
|
| 120 |
+
}
|
| 121 |
+
]
|
| 122 |
+
```
|
| 123 |
+
|
| 124 |
+
**Key differences**:
|
| 125 |
+
- Format 1: Uses `response` + `ground_truth` fields with full metadata (dictionary format indexed by string keys "0", "1", etc.)
|
| 126 |
+
- Format 2: Uses `id` + `prediction` fields only (list format, GT merged automatically by **index position**)
|
| 127 |
+
- The `id` field format: `{video_id}&&{start_frame}&&{end_frame}&&{fps}` is included for reference but **matching is done by array index**
|
| 128 |
+
- **Important**: Predictions in Format 2 must be in the same order as the test set
|
| 129 |
+
|
| 130 |
**Valid qa_types**:
|
| 131 |
- `tal` - Temporal Action Localization
|
| 132 |
- `stg` - Spatiotemporal Grounding
|