Spaces:

UIIAmerica
/

MedVidBench-Leaderboard

Running

App Files Files Community

MedVidBench-Leaderboard / evaluation

Commit History

update name

9aa418f

MedGRPO Team commited on Jan 14

update

0562da7

MedGRPO Team commited on Jan 14

improve the prompt

758710c

MedGRPO Team commited on Jan 14

update

4752404

MedGRPO Team commited on Jan 14

upload prediction only

b28cd8f

MedGRPO Team commited on Jan 14

Fix evaluate_all_pai to pass --skip-llm-judge to task main() functions

18339c0

MedGRPO Team commited on Jan 13

Fix eval_dvc.py main() to support --skip-llm-judge flag

5f41159

MedGRPO Team commited on Jan 13

Fix DVC evaluation to compute temporal F1 when --skip-llm-judge is set

dd1b9c6

MedGRPO Team Claude Sonnet 4.5 commited on Jan 13

Add semantic similarity matching for Next Action evaluation

a66b9a4

MedGRPO Team Claude Sonnet 4.5 commited on Jan 13

Fix TAL overall metrics computation and extraction

c8f4cad

MedGRPO Team commited on Jan 13

Use STG overall metrics instead of placeholder

d1337a3

MedGRPO Team commited on Jan 13

Fix STG evaluation to extract bbox_dict from struc_info

0b29eca

MedGRPO Team commited on Jan 13

Add placeholder for STG metrics when evaluation returns empty

6d00cf8

MedGRPO Team commited on Jan 13

Fix STG and next_action metric printing in overall mode

049c07c

MedGRPO Team commited on Jan 13

Show evaluation metrics even in silent mode (--grouping overall)

2323031

MedGRPO Team commited on Jan 13

Filter DVC/VS/RC from tasks list when skip-llm-judge is set

2bd924c

MedGRPO Team commited on Jan 13

Remove debug messages - system working correctly

80e3e7d

MedGRPO Team commited on Jan 13

Add debug inside loop to verify entry

d53d0f7

MedGRPO Team commited on Jan 13

Add debug at line 741

d878ea1

MedGRPO Team commited on Jan 13

Add debug before task list print

c886e23

MedGRPO Team commited on Jan 13

Debug before/after imports

4e448a4

MedGRPO Team commited on Jan 13

Add debug at function entry and after analyze

c736ac8

MedGRPO Team commited on Jan 13

Add explicit flush debug

1c45d6c

MedGRPO Team commited on Jan 13

update

f0846a5

MedGRPO Team commited on Jan 13

update

1af117c

MedGRPO Team commited on Jan 13

update

a605ebb

MedGRPO Team commited on Jan 13

Implement secure ground truth with prediction-only submission format

fe743f5

MedGRPO Team commited on Jan 8

Add support for pre-computed LLM judge scores

31817d3

MedGRPO Team Claude Sonnet 4.5 commited on Jan 7

Update evaluation metrics and leaderboard display

a36b7fe

MedGRPO Team Claude Sonnet 4.5 commited on Jan 7

Complete evaluator fixes for all 8 tasks

331979f

MedGRPO Team commited on Jan 7

Fix syntax errors and add TAL wrapper functions

8c805bc

MedGRPO Team commited on Jan 7

Fix evaluate_all_pai.py to use eval_caption_llm_judge

58dd6d7

MedGRPO Team commited on Jan 7

Consolidate evaluation scripts and remove hardcoded paths

3ea8a3a

MedGRPO Team commited on Jan 7

Remove all captioning_metrics dependencies

5a5d9ce

MedGRPO Team commited on Jan 7

Remove unused captioning_metrics folder

a09374a

MedGRPO Team commited on Jan 7

Add server-side LLM judge for caption evaluation

da2e674

MedGRPO Team commited on Jan 7

Make evaluation scripts fully self-contained

82f81ab

MedGRPO Team commited on Jan 7

Optimize evaluation scripts - remove optional files

f986294

MedGRPO Team Claude Sonnet 4.5 commited on Jan 7

Copy evaluation scripts to leaderboard and clean up template code

ba8d0d4

MedGRPO Team Claude Sonnet 4.5 commited on Jan 7

Commit History

update name 9aa418f

update 0562da7

improve the prompt 758710c

update 4752404

upload prediction only b28cd8f

Fix evaluate_all_pai to pass --skip-llm-judge to task main() functions 18339c0

Fix eval_dvc.py main() to support --skip-llm-judge flag 5f41159

Fix DVC evaluation to compute temporal F1 when --skip-llm-judge is set dd1b9c6

Add semantic similarity matching for Next Action evaluation a66b9a4

Fix TAL overall metrics computation and extraction c8f4cad

Use STG overall metrics instead of placeholder d1337a3

Fix STG evaluation to extract bbox_dict from struc_info 0b29eca

Add placeholder for STG metrics when evaluation returns empty 6d00cf8

Fix STG and next_action metric printing in overall mode 049c07c

Show evaluation metrics even in silent mode (--grouping overall) 2323031

Filter DVC/VS/RC from tasks list when skip-llm-judge is set 2bd924c

Remove debug messages - system working correctly 80e3e7d

Add debug inside loop to verify entry d53d0f7

Add debug at line 741 d878ea1

Add debug before task list print c886e23

Debug before/after imports 4e448a4

Add debug at function entry and after analyze c736ac8

Add explicit flush debug 1c45d6c

update f0846a5

update 1af117c

update a605ebb

Implement secure ground truth with prediction-only submission format fe743f5

Add support for pre-computed LLM judge scores 31817d3

Update evaluation metrics and leaderboard display a36b7fe

Complete evaluator fixes for all 8 tasks 331979f

Fix syntax errors and add TAL wrapper functions 8c805bc

Fix evaluate_all_pai.py to use eval_caption_llm_judge 58dd6d7

Consolidate evaluation scripts and remove hardcoded paths 3ea8a3a

Remove all captioning_metrics dependencies 5a5d9ce

Remove unused captioning_metrics folder a09374a

Add server-side LLM judge for caption evaluation da2e674

Make evaluation scripts fully self-contained 82f81ab

Optimize evaluation scripts - remove optional files f986294

Copy evaluation scripts to leaderboard and clean up template code ba8d0d4

update name

9aa418f

update

0562da7

improve the prompt

758710c

update

4752404

upload prediction only

b28cd8f

Fix evaluate_all_pai to pass --skip-llm-judge to task main() functions

18339c0

Fix eval_dvc.py main() to support --skip-llm-judge flag

5f41159

Fix DVC evaluation to compute temporal F1 when --skip-llm-judge is set

dd1b9c6

Add semantic similarity matching for Next Action evaluation

a66b9a4

Fix TAL overall metrics computation and extraction

c8f4cad

Use STG overall metrics instead of placeholder

d1337a3

Fix STG evaluation to extract bbox_dict from struc_info

0b29eca

Add placeholder for STG metrics when evaluation returns empty

6d00cf8

Fix STG and next_action metric printing in overall mode

049c07c

Show evaluation metrics even in silent mode (--grouping overall)

2323031

Filter DVC/VS/RC from tasks list when skip-llm-judge is set

2bd924c

Remove debug messages - system working correctly

80e3e7d

Add debug inside loop to verify entry

d53d0f7

Add debug at line 741

d878ea1

Add debug before task list print

c886e23

Debug before/after imports

4e448a4

Add debug at function entry and after analyze

c736ac8

Add explicit flush debug

1c45d6c

update

f0846a5

update

1af117c

update

a605ebb

Implement secure ground truth with prediction-only submission format

fe743f5

Add support for pre-computed LLM judge scores

31817d3

Update evaluation metrics and leaderboard display

a36b7fe

Complete evaluator fixes for all 8 tasks

331979f

Fix syntax errors and add TAL wrapper functions

8c805bc

Fix evaluate_all_pai.py to use eval_caption_llm_judge

58dd6d7

Consolidate evaluation scripts and remove hardcoded paths

3ea8a3a

Remove all captioning_metrics dependencies

5a5d9ce

Remove unused captioning_metrics folder

a09374a

Add server-side LLM judge for caption evaluation

da2e674

Make evaluation scripts fully self-contained

82f81ab

Optimize evaluation scripts - remove optional files

f986294

Copy evaluation scripts to leaderboard and clean up template code

ba8d0d4