Seven Setting Eval Tables
Last updated: 2026-04-13 UTC
Notes:
pass@1 is taken from accuracy/mean.
combined is only defined from pass@4 onward, so pass@1 and pass@2 are left blank.
- Blank cells mean the number is not available yet or I intentionally left it blank because the desired eval is still pending.
- For training runs, I pulled the metric from W&B history at the requested step (for example
_step=400 or _step=1000).
Setting 1
Qwen2.5-0.5B-Instruct, GSM8K train 2000 step, GSM8K eval.
| Variant |
Source |
N_VAL |
Note |
| Base |
m3ocmw3l |
512 |
shared baseline |
| 1-LoRA |
s4bxcc1l |
512 |
resume global_step_2000 |
| 4-LoRA Single/Combined |
rk9ic9kk |
2048 |
resume global_step_2000 |
| MERL Single/Combined |
(pending) |
2048 |
eval run not finished yet; left blank |
| k |
Base |
1-LoRA |
4-LoRA Single |
4-LoRA Combined |
MERL Single |
MERL Combined |
| 1 |
0.4661 |
0.6264 |
0.6151 |
|
|
|
| 2 |
0.5943 |
0.6898 |
0.6960 |
|
|
|
| 4 |
0.7012 |
0.7442 |
0.7594 |
0.8048 |
|
|
| 8 |
0.7878 |
0.7915 |
0.8118 |
0.8568 |
|
|
| 16 |
0.8560 |
0.8318 |
0.8544 |
0.8963 |
|
|
| 32 |
0.9065 |
0.8663 |
0.8885 |
0.9252 |
|
|
| 64 |
0.9417 |
0.8953 |
0.9157 |
0.9463 |
|
|
| 128 |
0.9651 |
0.9176 |
0.9365 |
0.9626 |
|
|
| 256 |
0.9799 |
0.9350 |
0.9516 |
0.9751 |
|
|
| 512 |
0.9909 |
0.9487 |
0.9622 |
0.9838 |
|
|
Setting 2
Qwen2.5-0.5B-Instruct, GSM8K train 200 step, GSM8K eval.
| Variant |
Source |
N_VAL |
Note |
| Base |
m3ocmw3l |
512 |
shared baseline |
| 1-LoRA |
xw4w9c0u |
512 |
resume global_step_200 |
| 4-LoRA Single/Combined |
2rytl841 |
2048 |
resume global_step_200 |
| MERL Single/Combined |
0041qzrm |
2048 |
resume global_step_200 |
| k |
Base |
1-LoRA |
4-LoRA Single |
4-LoRA Combined |
MERL Single |
MERL Combined |
| 1 |
0.4661 |
0.5942 |
0.5703 |
|
0.5335 |
|
| 2 |
0.5943 |
0.6842 |
0.6656 |
|
0.6450 |
|
| 4 |
0.7012 |
0.7557 |
0.7438 |
0.7772 |
0.7374 |
0.7584 |
| 8 |
0.7878 |
0.8125 |
0.8069 |
0.8389 |
0.8116 |
0.8308 |
| 16 |
0.8560 |
0.8590 |
0.8572 |
0.8871 |
0.8694 |
0.8861 |
| 32 |
0.9065 |
0.8978 |
0.8969 |
0.9237 |
0.9127 |
0.9266 |
| 64 |
0.9417 |
0.9285 |
0.9271 |
0.9503 |
0.9437 |
0.9544 |
| 128 |
0.9651 |
0.9497 |
0.9491 |
0.9682 |
0.9646 |
0.9723 |
| 256 |
0.9799 |
0.9636 |
0.9647 |
0.9795 |
0.9785 |
0.9841 |
| 512 |
0.9909 |
0.9727 |
0.9754 |
0.9870 |
0.9880 |
0.9920 |
Setting 3
Qwen3-0.6B-Base, MATH train 400 step, Math eval.
| Variant |
Source |
N_VAL |
Note |
| Base |
1eidnqtd |
512 |
base eval on Math500 |
| Single Avg |
(pending) |
2048 |
new eval launched in tmux 0:0; left blank for now |
| Combined |
(pending) |
2048 |
new eval launched in tmux 0:0; left blank for now |
| k |
Base |
Single Avg |
Combined |
| 1 |
0.2154 |
|
|
| 2 |
0.3370 |
|
|
| 4 |
0.4754 |
|
|
| 8 |
0.6065 |
|
|
| 16 |
0.7143 |
|
|
| 32 |
0.7946 |
|
|
| 64 |
0.8513 |
|
|
| 128 |
0.8916 |
|
|
| 256 |
0.9207 |
|
|
| 512 |
0.9416 |
|
|
Setting 4
Qwen2.5-0.5B-Instruct, MATH train 400 step, Math eval.
| Variant |
Source |
N_VAL |
Note |
| Base |
ub2ua0fb |
512 |
base eval on Math500 |
| Single Avg |
bfgx3ra4 |
2048 |
resume global_step_400 |
| Combined |
bfgx3ra4 |
2048 |
resume global_step_400 |
| k |
Base |
Single Avg |
Combined |
| 1 |
0.3081 |
0.3568 |
|
| 2 |
0.4144 |
0.4484 |
|
| 4 |
0.5162 |
0.5351 |
0.5514 |
| 8 |
0.6078 |
0.6140 |
0.6305 |
| 16 |
0.6890 |
0.6847 |
0.7014 |
| 32 |
0.7598 |
0.7463 |
0.7634 |
| 64 |
0.8180 |
0.7977 |
0.8141 |
| 128 |
0.8627 |
0.8398 |
0.8549 |
| 256 |
0.8956 |
0.8750 |
0.8883 |
| 512 |
0.9195 |
0.9054 |
0.9147 |
Setting 5
SmolLM2-360M-Instruct, GSM8K train 1000 step, GSM8K eval.
| Variant |
Source |
N_VAL |
Note |
| Base |
(not found) |
|
no standalone base eval run found |
| Single Avg |
uw2s3olq @ _step=1000 |
2048 |
training-run history |
| Combined |
uw2s3olq @ _step=1000 |
2048 |
training-run history |
| k |
Base |
Single Avg |
Combined |
| 1 |
|
0.2237 |
|
| 2 |
|
0.2939 |
|
| 4 |
|
0.3664 |
0.4218 |
| 8 |
|
0.4397 |
0.5067 |
| 16 |
|
0.5130 |
0.5902 |
| 32 |
|
0.5850 |
0.6704 |
| 64 |
|
0.6530 |
0.7439 |
| 128 |
|
0.7147 |
0.8064 |
| 256 |
|
0.7692 |
0.8564 |
| 512 |
|
0.8166 |
0.8968 |
Setting 6
SmolLM2-360M-Instruct, GSM8K train 200 step, GSM8K eval.
| Variant |
Source |
N_VAL |
Note |
| Base |
(not found) |
|
no standalone base eval run found |
| Single Avg |
zv5xbryh |
2048 |
resume global_step_200 |
| Combined |
zv5xbryh |
2048 |
resume global_step_200 |
| k |
Base |
Single Avg |
Combined |
| 1 |
|
0.1588 |
|
| 2 |
|
0.2213 |
|
| 4 |
|
0.2925 |
0.3359 |
| 8 |
|
0.3718 |
0.4268 |
| 16 |
|
0.4564 |
0.5222 |
| 32 |
|
0.5410 |
0.6159 |
| 64 |
|
0.6196 |
0.7016 |
| 128 |
|
0.6895 |
0.7739 |
| 256 |
|
0.7512 |
0.8315 |
| 512 |
|
0.8056 |
0.8767 |
Setting 7
Qwen3-0.6B-Base, GSM8K train 400 step, GSM8K eval.
| Variant |
Source |
N_VAL |
Note |
| Base |
m2nt7fyg |
512 |
base eval on GSM8K |
| Single Avg |
nqta9blp @ _step=400 |
2048 |
training-run history; checkpoint no longer on local disk |
| Combined |
nqta9blp @ _step=400 |
2048 |
training-run history; checkpoint no longer on local disk |
| k |
Base |
Single Avg |
Combined |
| 1 |
0.2707 |
0.7743 |
|
| 2 |
0.4321 |
0.8348 |
|
| 4 |
0.6106 |
0.8782 |
0.9012 |
| 8 |
0.7616 |
0.9098 |
0.9302 |
| 16 |
0.8629 |
0.9330 |
0.9509 |
| 32 |
0.9222 |
0.9503 |
0.9655 |
| 64 |
0.9553 |
0.9628 |
0.9754 |
| 128 |
0.9741 |
0.9716 |
0.9826 |
| 256 |
0.9843 |
0.9778 |
0.9881 |
| 512 |
0.9901 |
0.9830 |
0.9921 |