results / 0413 /0413_multiple_model_evals.md
guanning's picture
Add 0413 multiple model eval table
d825e06 verified

Seven Setting Eval Tables

Last updated: 2026-04-13 UTC

Notes:

  • pass@1 is taken from accuracy/mean.
  • combined is only defined from pass@4 onward, so pass@1 and pass@2 are left blank.
  • Blank cells mean the number is not available yet or I intentionally left it blank because the desired eval is still pending.
  • For training runs, I pulled the metric from W&B history at the requested step (for example _step=400 or _step=1000).

Setting 1

Qwen2.5-0.5B-Instruct, GSM8K train 2000 step, GSM8K eval.

Variant Source N_VAL Note
Base m3ocmw3l 512 shared baseline
1-LoRA s4bxcc1l 512 resume global_step_2000
4-LoRA Single/Combined rk9ic9kk 2048 resume global_step_2000
MERL Single/Combined (pending) 2048 eval run not finished yet; left blank
k Base 1-LoRA 4-LoRA Single 4-LoRA Combined MERL Single MERL Combined
1 0.4661 0.6264 0.6151
2 0.5943 0.6898 0.6960
4 0.7012 0.7442 0.7594 0.8048
8 0.7878 0.7915 0.8118 0.8568
16 0.8560 0.8318 0.8544 0.8963
32 0.9065 0.8663 0.8885 0.9252
64 0.9417 0.8953 0.9157 0.9463
128 0.9651 0.9176 0.9365 0.9626
256 0.9799 0.9350 0.9516 0.9751
512 0.9909 0.9487 0.9622 0.9838

Setting 2

Qwen2.5-0.5B-Instruct, GSM8K train 200 step, GSM8K eval.

Variant Source N_VAL Note
Base m3ocmw3l 512 shared baseline
1-LoRA xw4w9c0u 512 resume global_step_200
4-LoRA Single/Combined 2rytl841 2048 resume global_step_200
MERL Single/Combined 0041qzrm 2048 resume global_step_200
k Base 1-LoRA 4-LoRA Single 4-LoRA Combined MERL Single MERL Combined
1 0.4661 0.5942 0.5703 0.5335
2 0.5943 0.6842 0.6656 0.6450
4 0.7012 0.7557 0.7438 0.7772 0.7374 0.7584
8 0.7878 0.8125 0.8069 0.8389 0.8116 0.8308
16 0.8560 0.8590 0.8572 0.8871 0.8694 0.8861
32 0.9065 0.8978 0.8969 0.9237 0.9127 0.9266
64 0.9417 0.9285 0.9271 0.9503 0.9437 0.9544
128 0.9651 0.9497 0.9491 0.9682 0.9646 0.9723
256 0.9799 0.9636 0.9647 0.9795 0.9785 0.9841
512 0.9909 0.9727 0.9754 0.9870 0.9880 0.9920

Setting 3

Qwen3-0.6B-Base, MATH train 400 step, Math eval.

Variant Source N_VAL Note
Base 1eidnqtd 512 base eval on Math500
Single Avg (pending) 2048 new eval launched in tmux 0:0; left blank for now
Combined (pending) 2048 new eval launched in tmux 0:0; left blank for now
k Base Single Avg Combined
1 0.2154
2 0.3370
4 0.4754
8 0.6065
16 0.7143
32 0.7946
64 0.8513
128 0.8916
256 0.9207
512 0.9416

Setting 4

Qwen2.5-0.5B-Instruct, MATH train 400 step, Math eval.

Variant Source N_VAL Note
Base ub2ua0fb 512 base eval on Math500
Single Avg bfgx3ra4 2048 resume global_step_400
Combined bfgx3ra4 2048 resume global_step_400
k Base Single Avg Combined
1 0.3081 0.3568
2 0.4144 0.4484
4 0.5162 0.5351 0.5514
8 0.6078 0.6140 0.6305
16 0.6890 0.6847 0.7014
32 0.7598 0.7463 0.7634
64 0.8180 0.7977 0.8141
128 0.8627 0.8398 0.8549
256 0.8956 0.8750 0.8883
512 0.9195 0.9054 0.9147

Setting 5

SmolLM2-360M-Instruct, GSM8K train 1000 step, GSM8K eval.

Variant Source N_VAL Note
Base (not found) no standalone base eval run found
Single Avg uw2s3olq @ _step=1000 2048 training-run history
Combined uw2s3olq @ _step=1000 2048 training-run history
k Base Single Avg Combined
1 0.2237
2 0.2939
4 0.3664 0.4218
8 0.4397 0.5067
16 0.5130 0.5902
32 0.5850 0.6704
64 0.6530 0.7439
128 0.7147 0.8064
256 0.7692 0.8564
512 0.8166 0.8968

Setting 6

SmolLM2-360M-Instruct, GSM8K train 200 step, GSM8K eval.

Variant Source N_VAL Note
Base (not found) no standalone base eval run found
Single Avg zv5xbryh 2048 resume global_step_200
Combined zv5xbryh 2048 resume global_step_200
k Base Single Avg Combined
1 0.1588
2 0.2213
4 0.2925 0.3359
8 0.3718 0.4268
16 0.4564 0.5222
32 0.5410 0.6159
64 0.6196 0.7016
128 0.6895 0.7739
256 0.7512 0.8315
512 0.8056 0.8767

Setting 7

Qwen3-0.6B-Base, GSM8K train 400 step, GSM8K eval.

Variant Source N_VAL Note
Base m2nt7fyg 512 base eval on GSM8K
Single Avg nqta9blp @ _step=400 2048 training-run history; checkpoint no longer on local disk
Combined nqta9blp @ _step=400 2048 training-run history; checkpoint no longer on local disk
k Base Single Avg Combined
1 0.2707 0.7743
2 0.4321 0.8348
4 0.6106 0.8782 0.9012
8 0.7616 0.9098 0.9302
16 0.8629 0.9330 0.9509
32 0.9222 0.9503 0.9655
64 0.9553 0.9628 0.9754
128 0.9741 0.9716 0.9826
256 0.9843 0.9778 0.9881
512 0.9901 0.9830 0.9921