rl4phyx-backup / logs /eval_deepseek.log
YUNTA88's picture
Upload folder using huggingface_hub
3eee49d verified
============================================================
OPEN-ENDED EVAL: Rule-based + DeepSeek-V3 Judge
Results dir: /workspace/rl4phyx/RL4Phyx/SFT/sft_eval_footprint
============================================================
Testing DeepSeek API...
API OK: OK
============================================================
Scoring: Base (Qwen2.5-VL-3B) (1533 samples)
============================================================
[50/1533] acc=14.0% (rule=7, deepseek=0, uncertain_so_far=16)
[100/1533] acc=17.0% (rule=12, deepseek=5, uncertain_so_far=41)
[150/1533] acc=13.3% (rule=15, deepseek=5, uncertain_so_far=67)
[200/1533] acc=11.0% (rule=16, deepseek=6, uncertain_so_far=101)
[250/1533] acc=10.4% (rule=19, deepseek=7, uncertain_so_far=117)
[300/1533] acc=11.0% (rule=22, deepseek=11, uncertain_so_far=150)
[350/1533] acc=10.0% (rule=23, deepseek=12, uncertain_so_far=184)
[400/1533] acc=10.2% (rule=27, deepseek=14, uncertain_so_far=220)
[450/1533] acc=10.9% (rule=31, deepseek=18, uncertain_so_far=248)
[500/1533] acc=12.4% (rule=42, deepseek=20, uncertain_so_far=264)
[550/1533] acc=12.4% (rule=46, deepseek=22, uncertain_so_far=288)
[600/1533] acc=11.8% (rule=48, deepseek=23, uncertain_so_far=314)
[650/1533] acc=11.8% (rule=53, deepseek=24, uncertain_so_far=336)
[700/1533] acc=13.0% (rule=62, deepseek=29, uncertain_so_far=358)
[750/1533] acc=13.1% (rule=65, deepseek=33, uncertain_so_far=388)
[800/1533] acc=13.0% (rule=68, deepseek=36, uncertain_so_far=415)
[850/1533] acc=12.6% (rule=71, deepseek=36, uncertain_so_far=436)
[900/1533] acc=12.3% (rule=73, deepseek=38, uncertain_so_far=462)