| ============================================================ |
| OPEN-ENDED EVAL: Rule-based + DeepSeek-V3 Judge |
| Results dir: /workspace/rl4phyx/RL4Phyx/SFT/sft_eval_footprint |
| ============================================================ |
|
|
| Testing DeepSeek API... |
| API OK: OK |
|
|
| ============================================================ |
| Scoring: Base (Qwen2.5-VL-3B) (1533 samples) |
| ============================================================ |
| [50/1533] acc=14.0% (rule=7, deepseek=0, uncertain_so_far=16) |
| [100/1533] acc=17.0% (rule=12, deepseek=5, uncertain_so_far=41) |
| [150/1533] acc=13.3% (rule=15, deepseek=5, uncertain_so_far=67) |
| [200/1533] acc=11.0% (rule=16, deepseek=6, uncertain_so_far=101) |
| [250/1533] acc=10.4% (rule=19, deepseek=7, uncertain_so_far=117) |
| [300/1533] acc=11.0% (rule=22, deepseek=11, uncertain_so_far=150) |
| [350/1533] acc=10.0% (rule=23, deepseek=12, uncertain_so_far=184) |
| [400/1533] acc=10.2% (rule=27, deepseek=14, uncertain_so_far=220) |
| [450/1533] acc=10.9% (rule=31, deepseek=18, uncertain_so_far=248) |
| [500/1533] acc=12.4% (rule=42, deepseek=20, uncertain_so_far=264) |
| [550/1533] acc=12.4% (rule=46, deepseek=22, uncertain_so_far=288) |
| [600/1533] acc=11.8% (rule=48, deepseek=23, uncertain_so_far=314) |
| [650/1533] acc=11.8% (rule=53, deepseek=24, uncertain_so_far=336) |
| [700/1533] acc=13.0% (rule=62, deepseek=29, uncertain_so_far=358) |
| [750/1533] acc=13.1% (rule=65, deepseek=33, uncertain_so_far=388) |
| [800/1533] acc=13.0% (rule=68, deepseek=36, uncertain_so_far=415) |
| [850/1533] acc=12.6% (rule=71, deepseek=36, uncertain_so_far=436) |
| [900/1533] acc=12.3% (rule=73, deepseek=38, uncertain_so_far=462) |
|
|