File size: 1,659 Bytes
3eee49d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
============================================================
  OPEN-ENDED EVAL: Rule-based + DeepSeek-V3 Judge
  Results dir: /workspace/rl4phyx/RL4Phyx/SFT/sft_eval_footprint
============================================================

Testing DeepSeek API...
  API OK: OK

============================================================
  Scoring: Base (Qwen2.5-VL-3B) (1533 samples)
============================================================
  [50/1533] acc=14.0% (rule=7, deepseek=0, uncertain_so_far=16)
  [100/1533] acc=17.0% (rule=12, deepseek=5, uncertain_so_far=41)
  [150/1533] acc=13.3% (rule=15, deepseek=5, uncertain_so_far=67)
  [200/1533] acc=11.0% (rule=16, deepseek=6, uncertain_so_far=101)
  [250/1533] acc=10.4% (rule=19, deepseek=7, uncertain_so_far=117)
  [300/1533] acc=11.0% (rule=22, deepseek=11, uncertain_so_far=150)
  [350/1533] acc=10.0% (rule=23, deepseek=12, uncertain_so_far=184)
  [400/1533] acc=10.2% (rule=27, deepseek=14, uncertain_so_far=220)
  [450/1533] acc=10.9% (rule=31, deepseek=18, uncertain_so_far=248)
  [500/1533] acc=12.4% (rule=42, deepseek=20, uncertain_so_far=264)
  [550/1533] acc=12.4% (rule=46, deepseek=22, uncertain_so_far=288)
  [600/1533] acc=11.8% (rule=48, deepseek=23, uncertain_so_far=314)
  [650/1533] acc=11.8% (rule=53, deepseek=24, uncertain_so_far=336)
  [700/1533] acc=13.0% (rule=62, deepseek=29, uncertain_so_far=358)
  [750/1533] acc=13.1% (rule=65, deepseek=33, uncertain_so_far=388)
  [800/1533] acc=13.0% (rule=68, deepseek=36, uncertain_so_far=415)
  [850/1533] acc=12.6% (rule=71, deepseek=36, uncertain_so_far=436)
  [900/1533] acc=12.3% (rule=73, deepseek=38, uncertain_so_far=462)