MedGRPO Team commited on
Commit
6edbd17
Β·
1 Parent(s): 8ef4c38

clean the code

Browse files
Files changed (3) hide show
  1. CODE_VERIFICATION_REPORT.md +0 -405
  2. LEADERBOARD_FORMATS.md +0 -228
  3. README.md +38 -1
CODE_VERIFICATION_REPORT.md DELETED
@@ -1,405 +0,0 @@
1
- # Code Verification Report - Real-Time Log Streaming
2
-
3
- **Date**: January 13, 2026
4
- **Status**: βœ… ALL CHECKS PASSED
5
-
6
- ## Summary
7
-
8
- All code changes have been verified for correctness. The implementation is ready for deployment and should provide real-time log streaming with progress indicators.
9
-
10
- ---
11
-
12
- ## File 1: app.py
13
-
14
- **Location**: `/root/code/MedVidBench-Leaderboard/app.py`
15
-
16
- ### βœ… Syntax Check
17
- ```bash
18
- python -m py_compile app.py
19
- # Result: SUCCESS (no errors)
20
- ```
21
-
22
- ### βœ… Unbuffered Subprocess Configuration (Lines 768-784)
23
-
24
- **Command Construction**:
25
- ```python
26
- cmd = [
27
- sys.executable,
28
- "-u", # βœ… Unbuffered output flag present
29
- str(eval_wrapper),
30
- str(input_file),
31
- "--grouping", "overall",
32
- "--ground-truth", "data/ground_truth.json"
33
- ]
34
- ```
35
-
36
- **Process Configuration**:
37
- ```python
38
- process = subprocess.Popen(
39
- cmd,
40
- stdout=subprocess.PIPE, # βœ… Pipe stdout for reading
41
- stderr=subprocess.STDOUT, # βœ… Merge stderr into stdout
42
- text=True, # βœ… Text mode (not bytes)
43
- bufsize=1, # βœ… Line-buffered
44
- env={**os.environ, "PYTHONUNBUFFERED": "1"} # βœ… Force unbuffered
45
- )
46
- ```
47
-
48
- **Verification**: βœ… Both `-u` flag AND `PYTHONUNBUFFERED=1` are present
49
-
50
- ### βœ… Non-Blocking Read Implementation (Line 810)
51
-
52
- ```python
53
- ready, _, _ = select.select([process.stdout], [], [], 0.5)
54
- ```
55
-
56
- **Verification**: βœ… Using `select.select()` with 0.5s timeout for non-blocking reads
57
-
58
- ### βœ… Progress Bar Implementation (Lines 847-850)
59
-
60
- ```python
61
- # Increment progress gradually from 25% to 75%
62
- progress_increment = min(0.75, 0.25 + (line_count / 500) * 0.50)
63
- progress(progress_increment, desc="Running evaluation...")
64
- ```
65
-
66
- **Verification**: βœ… Progressive increment from 25% β†’ 75% based on log lines
67
-
68
- ### βœ… Heartbeat Messages (Lines 832-836)
69
-
70
- ```python
71
- if not log_buffer:
72
- elapsed = int(time.time() - start_time)
73
- log_text = f"βš™οΈ **Step 3/6**: Running evaluation...\n\n```\nWaiting for evaluation output... ({elapsed}s elapsed)\n```"
74
- yield log_text
75
- ```
76
-
77
- **Verification**: βœ… Shows elapsed time when no logs appear
78
-
79
- ### βœ… Generator Function (Line 720)
80
-
81
- ```python
82
- def submit_model(file, model_name: str, organization: str, contact: str = "", progress=gr.Progress()):
83
- """
84
- Process model submission: validate, evaluate, and add to leaderboard.
85
- Yields progress updates during evaluation.
86
- """
87
- ```
88
-
89
- **Verification**: βœ… Function uses `yield` for streaming updates
90
-
91
- ---
92
-
93
- ## File 2: evaluation/evaluate_predictions.py
94
-
95
- **Location**: `/root/code/MedVidBench-Leaderboard/evaluation/evaluate_predictions.py`
96
-
97
- ### βœ… Syntax Check
98
- ```bash
99
- python -m py_compile evaluation/evaluate_predictions.py
100
- # Result: SUCCESS (no errors)
101
- ```
102
-
103
- ### βœ… Flush Statements (10 occurrences found)
104
-
105
- **Line 186**: Loading message
106
- ```python
107
- print(f"[EvaluationWrapper] Loading predictions from {args.predictions_file}", flush=True)
108
- ```
109
-
110
- **Lines 194-195**: Merged format detection
111
- ```python
112
- print("[EvaluationWrapper] βœ“ Detected: Predictions already contain ground-truth", flush=True)
113
- print("[EvaluationWrapper] Using predictions file directly for evaluation", flush=True)
114
- ```
115
-
116
- **Lines 198-199**: Prediction-only format detection
117
- ```python
118
- print("[EvaluationWrapper] βœ“ Detected: Prediction-only format (id, qa_type, prediction)", flush=True)
119
- print("[EvaluationWrapper] Merging with ground-truth...", flush=True)
120
- ```
121
-
122
- **Line 215**: Merge completion
123
- ```python
124
- print(f"[EvaluationWrapper] βœ“ Merged data saved to temporary file: {eval_file}", flush=True)
125
- ```
126
-
127
- **Lines 218-220**: Handoff to evaluate_all_pai
128
- ```python
129
- print(f"\n[EvaluationWrapper] {'='*80}", flush=True)
130
- print(f"[EvaluationWrapper] Starting evaluation with evaluate_all_pai.py", flush=True)
131
- print(f"[EvaluationWrapper] {'='*80}\n", flush=True)
132
- ```
133
-
134
- **Verification**: βœ… All critical print statements have `flush=True`
135
-
136
- ---
137
-
138
- ## File 3: evaluation/evaluate_all_pai.py
139
-
140
- **Location**: `/root/code/MedVidBench-Leaderboard/evaluation/evaluate_all_pai.py`
141
-
142
- ### βœ… Syntax Check
143
- ```bash
144
- python -m py_compile evaluation/evaluate_all_pai.py
145
- # Result: SUCCESS (no errors)
146
- ```
147
-
148
- ### βœ… Flush Statements (15 occurrences found)
149
-
150
- **Lines 58-64**: Dataset analysis output
151
- ```python
152
- print(f"\nFound QA types:", flush=True)
153
- for qa_type, count in qa_type_counts.items():
154
- print(f" {qa_type}: {count} records", flush=True)
155
-
156
- print(f"\nFound datasets:", flush=True)
157
- for dataset, count in dataset_counts.items():
158
- print(f" {dataset}: {count} records", flush=True)
159
- ```
160
-
161
- **Lines 770-771**: Task list and total count
162
- ```python
163
- print(f"\nRunning evaluation for tasks: {tasks}", flush=True)
164
- print(f"Total tasks to evaluate: {len(tasks)}", flush=True)
165
- ```
166
-
167
- **Line 786**: Task progress counter (⭐ KEY FEATURE)
168
- ```python
169
- print(f"\n[Progress] Task {task_idx}/{len(tasks)}: {task.upper()}", flush=True)
170
- ```
171
-
172
- **Lines 790-792**: Skip message for pre-computed LLM scores
173
- ```python
174
- print(f"\n{'='*80}", flush=True)
175
- print(f"SKIPPING {task.upper()} EVALUATION (LLM judge pre-computed)", flush=True)
176
- print(f"{'='*80}", flush=True)
177
- ```
178
-
179
- **Lines 798-800**: Task evaluation banner
180
- ```python
181
- print(f"\n{'='*80}", flush=True)
182
- print(f"RUNNING {task.upper()} EVALUATION", flush=True)
183
- print(f"{'='*80}", flush=True)
184
- ```
185
-
186
- **Line 803**: Silent mode progress
187
- ```python
188
- print(f"Evaluating {task.upper()}...", flush=True)
189
- ```
190
-
191
- **Line 820**: Task completion message (⭐ KEY FEATURE)
192
- ```python
193
- print(f"[Progress] βœ“ Completed {task.upper()} evaluation (Task {task_idx}/{len(tasks)})", flush=True)
194
- ```
195
-
196
- **Verification**: βœ… All progress messages have `flush=True`
197
-
198
- ---
199
-
200
- ## Key Features Verification
201
-
202
- ### βœ… Feature 1: Format Auto-Detection
203
- **Location**: `evaluate_predictions.py` lines 191-216
204
- **Status**: βœ… Working correctly
205
- - Detects merged format β†’ skips merging
206
- - Detects prediction-only β†’ merges with ground-truth
207
- - Prints clear messages with `flush=True`
208
-
209
- ### βœ… Feature 2: Real-Time Log Streaming
210
- **Location**: `app.py` lines 768-858
211
- **Status**: βœ… Fully implemented
212
- - Unbuffered subprocess (`-u` + `PYTHONUNBUFFERED=1`)
213
- - Non-blocking read with `select.select()`
214
- - 0.5s update frequency
215
- - Shows last 25 lines of logs
216
-
217
- ### βœ… Feature 3: Heartbeat Feedback
218
- **Location**: `app.py` lines 832-836
219
- **Status**: βœ… Working
220
- - Shows "Waiting for output... (Xs elapsed)"
221
- - Updates every 0.5s even when no logs
222
-
223
- ### βœ… Feature 4: Progressive Progress Bar
224
- **Location**: `app.py` lines 847-850
225
- **Status**: βœ… Working
226
- - Starts at 25% (beginning of evaluation)
227
- - Advances based on log lines
228
- - Caps at 75% (end of evaluation)
229
-
230
- ### βœ… Feature 5: Task Progress Counters
231
- **Location**: `evaluate_all_pai.py` lines 770-820
232
- **Status**: βœ… Fully implemented
233
- - Shows "Total tasks to evaluate: 8"
234
- - Shows "[Progress] Task 1/8: TAL"
235
- - Shows "[Progress] βœ“ Completed TAL (Task 1/8)"
236
-
237
- ---
238
-
239
- ## Expected User Experience
240
-
241
- ### Phase 1: Initialization (5% β†’ 15%)
242
- ```
243
- πŸ” Step 1/6: Checking if model name is available...
244
- βœ“ Model name available
245
-
246
- πŸ“‹ Step 2/6: Validating predictions file format...
247
- βœ“ Valid format detected
248
- ```
249
-
250
- ### Phase 2: Format Detection (15% β†’ 25%)
251
- ```
252
- βš™οΈ Step 3/6: Running evaluation (streaming logs)...
253
-
254
- [EvaluationWrapper] Loading predictions from input.json
255
- [EvaluationWrapper] βœ“ Detected: Predictions already contain ground-truth
256
- [EvaluationWrapper] Using predictions file directly for evaluation
257
- ```
258
-
259
- ### Phase 3: Dataset Analysis (25% β†’ 30%)
260
- ```
261
- Found QA types:
262
- tal: 1637 records
263
- stg: 780 records
264
- next_action: 1280 records
265
- dvc: 3000 records
266
- vs: 1500 records
267
- rc: 2522 records
268
- skill_assessment: 390 records
269
- cvs_assessment: 390 records
270
-
271
- Found datasets:
272
- jigsaws: 780 records
273
- ...
274
- ```
275
-
276
- ### Phase 4: Task Evaluations (30% β†’ 75%)
277
- ```
278
- Running evaluation for tasks: ['tal', 'stg', 'next_action', 'dvc', 'vs', 'rc', 'skill_assessment', 'cvs_assessment']
279
- Total tasks to evaluate: 8
280
-
281
- [Progress] Task 1/8: TAL
282
- ================================================================================
283
- RUNNING TAL EVALUATION
284
- ================================================================================
285
- [Progress] βœ“ Completed TAL evaluation (Task 1/8) [Progress: 35%]
286
-
287
- [Progress] Task 2/8: STG
288
- ================================================================================
289
- RUNNING STG EVALUATION
290
- ================================================================================
291
- [Progress] βœ“ Completed STG evaluation (Task 2/8) [Progress: 40%]
292
-
293
- ...
294
-
295
- [Progress] Task 8/8: CVS_ASSESSMENT
296
- [Progress] βœ“ Completed CVS_ASSESSMENT evaluation (Task 8/8) [Progress: 75%]
297
- ```
298
-
299
- ### Phase 5: Validation (75% β†’ 90%)
300
- ```
301
- βœ“ Evaluation completed!
302
- πŸ” Step 4/6: Validating extracted metrics...
303
- βœ“ All 10 metrics successfully computed
304
-
305
- πŸ“Š Step 5/6: Adding model to leaderboard...
306
- βœ“ Leaderboard updated!
307
- ```
308
-
309
- ### Phase 6: Complete (90% β†’ 100%)
310
- ```
311
- βœ… Step 6/6: Submission complete!
312
-
313
- ---
314
-
315
- ## βœ… Submission Successful!
316
-
317
- **Model**: MyModel
318
- **Organization**: MyOrg
319
-
320
- ### πŸ“ˆ Metric Scores
321
- - **CVS Assessment Accuracy**: 0.8234
322
- - **Skill Assessment Accuracy**: 0.7891
323
- ...
324
-
325
- ### πŸ† Ranking
326
- **Rank**: #3 out of 15 models
327
- ```
328
-
329
- ---
330
-
331
- ## Deployment Checklist
332
-
333
- ### βœ… Code Changes
334
- - [x] app.py modified (unbuffered subprocess + non-blocking read)
335
- - [x] evaluate_predictions.py modified (flush=True added)
336
- - [x] evaluate_all_pai.py modified (task progress counters)
337
-
338
- ### βœ… Syntax Validation
339
- - [x] app.py compiles without errors
340
- - [x] evaluate_predictions.py compiles without errors
341
- - [x] evaluate_all_pai.py compiles without errors
342
-
343
- ### βœ… Feature Verification
344
- - [x] Unbuffered subprocess configuration
345
- - [x] Non-blocking read with select.select()
346
- - [x] Heartbeat messages
347
- - [x] Progressive progress bar
348
- - [x] Task progress counters
349
- - [x] All flush=True statements present
350
-
351
- ### πŸ“¦ Ready for Deployment
352
- The code is **production-ready** and should work correctly when deployed to HF Spaces.
353
-
354
- ---
355
-
356
- ## Troubleshooting (If Issues Persist on HF Spaces)
357
-
358
- If the progress messages still don't appear after deployment:
359
-
360
- 1. **Verify Files on HF Spaces**:
361
- - Go to Files tab on HF Space
362
- - Check that `app.py`, `evaluation/evaluate_predictions.py`, and `evaluation/evaluate_all_pai.py` contain the new code
363
- - Search for `[Progress]` and `flush=True` in the files
364
-
365
- 2. **Check Build Logs**:
366
- - Go to Settings β†’ "View Logs"
367
- - Verify the space rebuilt after your push
368
- - Look for "Building..." and "Running..." messages
369
-
370
- 3. **Force Rebuild**:
371
- - Settings β†’ Factory reboot
372
- - Wait 2-3 minutes for rebuild
373
-
374
- 4. **Test Locally First**:
375
- - Run the test script: `python test_streaming.py`
376
- - Verify logs stream in real-time locally
377
- - If local works but HF doesn't, it's a deployment issue
378
-
379
- 5. **Browser Cache**:
380
- - Clear browser cache (Ctrl+Shift+Delete)
381
- - Try incognito/private browsing mode
382
- - Try different browser
383
-
384
- ---
385
-
386
- ## Conclusion
387
-
388
- βœ… **Code verification: PASSED**
389
-
390
- All three files have been verified:
391
- - βœ… No syntax errors
392
- - βœ… All critical features implemented
393
- - βœ… All flush=True statements present
394
- - βœ… Unbuffered subprocess configuration correct
395
- - βœ… Non-blocking I/O implemented correctly
396
- - βœ… Progress tracking fully functional
397
-
398
- **The code is ready for production deployment on HF Spaces.**
399
-
400
- If logs still don't appear on HF Spaces, the issue is likely:
401
- 1. HF Spaces hasn't rebuilt with the new code yet
402
- 2. Browser cache showing old version
403
- 3. Network delays in SSE streaming (HF Spaces infrastructure)
404
-
405
- The code itself is **correct and production-ready**.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
LEADERBOARD_FORMATS.md DELETED
@@ -1,228 +0,0 @@
1
- # MedVidBench Leaderboard Supported Formats
2
-
3
- ## Overview
4
-
5
- The leaderboard web app now accepts **two submission formats**:
6
-
7
- 1. **Prediction-only** (preferred for users)
8
- 2. **Merged format** (for testing/debugging)
9
-
10
- Both formats are automatically detected and handled by the evaluation system.
11
-
12
- ## Format 1: Prediction-Only (User Submission)
13
-
14
- **Recommended for**: Public user submissions
15
-
16
- **Structure**:
17
- ```json
18
- [
19
- {
20
- "id": "video_id&&start_frame&&end_frame&&fps",
21
- "qa_type": "tal",
22
- "prediction": "0.0-10.0 seconds."
23
- },
24
- {
25
- "id": "another_video&&0&&100&&1.0",
26
- "qa_type": "video_summary",
27
- "prediction": "The surgeon performs cholecystectomy..."
28
- }
29
- ]
30
- ```
31
-
32
- **Required fields**:
33
- - `id`: Sample identifier (video_id&&start&&end&&fps)
34
- - `qa_type`: Task type
35
- - `prediction`: Model's answer text
36
-
37
- **What happens**:
38
- 1. Server validates format
39
- 2. Server merges with private ground-truth
40
- 3. Runs evaluation
41
- 4. Adds to leaderboard
42
-
43
- ## Format 2: Merged (Internal/Testing)
44
-
45
- **Recommended for**: Internal testing, debugging
46
-
47
- **Structure**:
48
- ```json
49
- {
50
- "0": {
51
- "metadata": {
52
- "video_id": "kcOqlifSukA",
53
- "fps": "1.0",
54
- "input_video_start_frame": "22425",
55
- "input_video_end_frame": "25124"
56
- },
57
- "qa_type": "tal",
58
- "struc_info": [
59
- {
60
- "action": "cutting",
61
- "spans": [{"start": 0.0, "end": 10.0}]
62
- }
63
- ],
64
- "question": "When does cutting happen?",
65
- "gnd": "0.0-10.0 seconds.",
66
- "answer": "0.0-10.0 seconds.",
67
- "data_source": "AVOS"
68
- }
69
- }
70
- ```
71
-
72
- **Required fields**:
73
- - `metadata`: Video metadata
74
- - `qa_type`: Task type
75
- - `struc_info`: Structured ground-truth
76
- - `question`: Question text
77
- - `gnd`: Ground-truth answer
78
- - `answer`: Model prediction
79
- - `data_source`: Dataset name
80
-
81
- **What happens**:
82
- 1. Server validates format
83
- 2. Skips ground-truth merging (already has it)
84
- 3. Runs evaluation directly
85
- 4. Adds to leaderboard
86
-
87
- ## How It Works
88
-
89
- ### Validation (`app.py::validate_results_file`)
90
-
91
- The validator auto-detects format by checking fields:
92
-
93
- ```python
94
- # Format 1: Prediction-only
95
- is_prediction_only = "id" in sample and "prediction" in sample
96
-
97
- # Format 2: Merged
98
- is_merged = "metadata" in sample and "question" in sample and "answer" in sample
99
- ```
100
-
101
- Both formats pass validation if they have:
102
- - Valid structure
103
- - Required fields
104
- - β‰₯5000 samples
105
- - Valid qa_types
106
-
107
- ### Evaluation (`app.py::run_evaluation`)
108
-
109
- Uses `evaluation/evaluate_predictions.py` wrapper which:
110
-
111
- 1. **Auto-detects format**:
112
- - Checks for `id` + `prediction` β†’ Prediction-only
113
- - Checks for `question` + `gnd` + `struc_info` β†’ Merged
114
-
115
- 2. **Handles accordingly**:
116
- - Prediction-only β†’ Merge with ground-truth first
117
- - Merged β†’ Use directly
118
-
119
- 3. **Runs evaluation**: Calls `evaluate_all_pai.py`
120
-
121
- 4. **Returns metrics**: 10 metrics across 8 tasks
122
-
123
- ## Examples
124
-
125
- ### Example 1: User Submits Predictions
126
-
127
- ```bash
128
- # User downloads test set from HuggingFace
129
- # User runs inference on their model
130
- # User formats predictions as prediction-only JSON
131
- # User uploads to leaderboard
132
-
133
- # Result: Server merges with private GT β†’ evaluates β†’ adds to board
134
- ```
135
-
136
- ### Example 2: Internal Testing with Merged File
137
-
138
- ```bash
139
- # Developer has complete results.json (with GT)
140
- # Developer uploads to leaderboard for testing
141
-
142
- # Result: Server detects merged format β†’ skips merging β†’ evaluates β†’ adds to board
143
- ```
144
-
145
- ## File Size Requirements
146
-
147
- - **Minimum samples**: 5,000
148
- - **Full test set**: 6,245 samples
149
- - Files with <5,000 samples are rejected
150
-
151
- ## Valid QA Types
152
-
153
- - `tal` - Temporal Action Localization
154
- - `stg` - Spatiotemporal Grounding
155
- - `next_action` - Next Action Prediction
156
- - `dense_captioning` / `dense_captioning_gpt` / `dense_captioning_gemini`
157
- - `video_summary` / `video_summary_gpt` / `video_summary_gemini`
158
- - `region_caption` / `region_caption_gpt` / `region_caption_gemini`
159
- - `skill_assessment` - Skill Assessment
160
- - `cvs_assessment` - CVS Assessment
161
-
162
- ## Testing
163
-
164
- ### Test with Prediction-Only Format
165
-
166
- ```bash
167
- # Create sample predictions
168
- python -c "
169
- import json
170
- with open('data/sample_predictions.json') as f:
171
- data = json.load(f)
172
- print(f'Format: prediction-only')
173
- print(f'Samples: {len(data)}')
174
- print(f'Fields: {list(data[0].keys())}')
175
- "
176
-
177
- # Upload to leaderboard (web interface)
178
- # Should show: "βœ“ Valid predictions file (prediction-only format) with 100 samples"
179
- ```
180
-
181
- ### Test with Merged Format
182
-
183
- ```bash
184
- # Check merged format
185
- python -c "
186
- import json
187
- with open('data/results.json') as f:
188
- data = json.load(f)
189
- records = list(data.values())
190
- print(f'Format: merged')
191
- print(f'Samples: {len(records)}')
192
- print(f'Fields: {list(records[0].keys())}')
193
- "
194
-
195
- # Upload to leaderboard (web interface)
196
- # Should show: "βœ“ Valid predictions file (merged format) with 6245 samples"
197
- ```
198
-
199
- ## Error Messages
200
-
201
- | Error | Cause | Solution |
202
- |-------|-------|----------|
203
- | Missing required field: 'id' | Wrong format | Check if using merged format, should pass now |
204
- | Missing required field: 'prediction' | Wrong format | Ensure prediction-only has 'prediction' field |
205
- | Invalid format: Must be either... | Unrecognized structure | Check file structure matches one of two formats |
206
- | Too few samples (X) | Incomplete file | Should have ~6245 samples for full test set |
207
- | Invalid qa_type | Wrong task name | Use valid qa_types listed above |
208
-
209
- ## Implementation Files
210
-
211
- - `app.py::validate_results_file()` - Format detection and validation
212
- - `app.py::run_evaluation()` - Uses wrapper for evaluation
213
- - `evaluation/evaluate_predictions.py` - Auto-detection wrapper
214
- - `evaluation/evaluate_all_pai.py` - Core evaluation engine
215
-
216
- ## Security Notes
217
-
218
- - Ground-truth data stored privately in `data/ground_truth.json`
219
- - Never exposed to users
220
- - Server-side merging ensures GT integrity
221
- - Users only submit predictions
222
-
223
- ## Updates (2026-01-13)
224
-
225
- - βœ… Added support for merged format in leaderboard
226
- - βœ… Auto-detection for both formats
227
- - βœ… Unified validation and evaluation
228
- - βœ… Both formats now accepted on web interface
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README.md CHANGED
@@ -69,7 +69,9 @@ Run your model on the MedVidBench test set (6,245 samples) to generate predictio
69
 
70
  ### 2. Expected Results Format
71
 
72
- Your results file should be a JSON with this structure:
 
 
73
 
74
  ```json
75
  [
@@ -90,6 +92,41 @@ Your results file should be a JSON with this structure:
90
  ]
91
  ```
92
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
93
  **Valid qa_types**:
94
  - `tal` - Temporal Action Localization
95
  - `stg` - Spatiotemporal Grounding
 
69
 
70
  ### 2. Expected Results Format
71
 
72
+ The leaderboard supports **two formats** for submission:
73
+
74
+ #### Format 1: Full Format (with Ground Truth)
75
 
76
  ```json
77
  [
 
92
  ]
93
  ```
94
 
95
+ #### Format 2: Prediction-Only Format
96
+
97
+ ```json
98
+ [
99
+ {
100
+ "id": "video_id&&start_frame&&end_frame&&fps",
101
+ "qa_type": "tal",
102
+ "prediction": "Your model's answer"
103
+ },
104
+ ...
105
+ ]
106
+ ```
107
+
108
+ **Example**:
109
+ ```json
110
+ [
111
+ {
112
+ "id": "kcOqlifSukA&&22425&&25124&&1.0",
113
+ "qa_type": "tal",
114
+ "prediction": "22.0-78.0, 89.0-94.0 seconds."
115
+ },
116
+ {
117
+ "id": "VsKw5d-4rq8&&13561&&16184&&1.0",
118
+ "qa_type": "stg",
119
+ "prediction": "[10, 20, 30, 40] 5.0-10.0 seconds."
120
+ }
121
+ ]
122
+ ```
123
+
124
+ **Key differences**:
125
+ - Format 1: Uses `response` + `ground_truth` fields with full metadata (dictionary format indexed by string keys "0", "1", etc.)
126
+ - Format 2: Uses `id` + `prediction` fields only (list format, GT merged automatically by **index position**)
127
+ - The `id` field format: `{video_id}&&{start_frame}&&{end_frame}&&{fps}` is included for reference but **matching is done by array index**
128
+ - **Important**: Predictions in Format 2 must be in the same order as the test set
129
+
130
  **Valid qa_types**:
131
  - `tal` - Temporal Action Localization
132
  - `stg` - Spatiotemporal Grounding