Tested out GRPO training on domain-specific adapters and then using a WAVE merge in a very ... uh, scattershot way.
It worked out apparently smarter than when I tried to trim down to the "best" adapters of the exploration, though, so scattershot it is.
I'm not sure if they're a "better" base model here; domain-wise they may have lost out slightly on other domains and improved mainly on Python code?
The lm-eval diagnostic tasks here look promising though.
| Task | Metric | Qwen3-4B-Base | GRPO-Merge | Δ Base | GRPO-Wave | Δ Base | Δ Merge |
|---|---|---|---|---|---|---|---|
| arc_easy | acc | 0.7891 | 0.7870 | -0.27% | 0.7912 | +0.27% | +0.53% |
| arc_easy | acc_norm | 0.7609 | 0.7605 | -0.05% | 0.7643 | +0.45% | +0.50% |
| lambada_openai | acc | 0.6912 | 0.6984 | +1.04% | 0.7006 | +1.36% | +0.31% |
| lambada_openai | perplexity ↓ | 4.2433 | 4.0490 | -4.58% | 3.9616 | -6.64% | -2.16% |
| openbookqa | acc | 0.3160 | 0.3180 | +0.63% | 0.3180 | +0.63% | ±0.00% |
| openbookqa | acc_norm | 0.4100 | 0.4120 | +0.49% | 0.4100 | ±0.00% | -0.49% |
| piqa | acc | 0.7797 | 0.7807 | +0.13% | 0.7813 | +0.21% | +0.08% |
| piqa | acc_norm | 0.7807 | 0.7807 | ±0.00% | 0.7813 | +0.08% | +0.08% |
This is a merge of pre-trained language models created using mergekit.
Merge Details
Merge Method
This model was merged using the WAVE merge method using Qwen/Qwen3-4B-Base as a base.
Models Merged
The following models were included in the merge:
- Lambent/Qwen3-4B-Base-Continued-GRPO + ../rlvr-envs/grpo-aware-5e6
- Lambent/Qwen3-4B-Base-Continued-GRPO + ../rlvr-envs/grpo-merge-llm-judge-ep2
- Lambent/Qwen3-4B-Base-Continued-GRPO + ../rlvr-envs/grpo-ao3-minilm
- Lambent/Qwen3-4B-Base-Continued-GRPO + ../rlvr-envs/grpo-ao3-qwen
- Lambent/Qwen3-4B-Base-Continued-GRPO + ../rlvr-envs/grpo-aware-test
- Lambent/Qwen3-4B-Base-Continued-GRPO-Merge
- Lambent/Qwen3-4B-Base-Continued-GRPO + ../rlvr-envs/grpo-python-creative
- Lambent/Qwen3-4B-Base-Continued-GRPO + ../rlvr-envs/grpo-bbc-qwen
Configuration
The following YAML configuration was used to produce this model:
models:
- model: Lambent/Qwen3-4B-Base-Continued-GRPO+../rlvr-envs/grpo-python-creative
- model: Lambent/Qwen3-4B-Base-Continued-GRPO+../rlvr-envs/grpo-ao3-minilm
- model: Lambent/Qwen3-4B-Base-Continued-GRPO+../rlvr-envs/grpo-aware-5e6
- model: Lambent/Qwen3-4B-Base-Continued-GRPO+../rlvr-envs/grpo-aware-test
- model: Lambent/Qwen3-4B-Base-Continued-GRPO+../rlvr-envs/grpo-merge-llm-judge-ep2
- model: Lambent/Qwen3-4B-Base-Continued-GRPO+../rlvr-envs/grpo-bbc-qwen
- model: Lambent/Qwen3-4B-Base-Continued-GRPO+../rlvr-envs/grpo-ao3-qwen
- model: Lambent/Qwen3-4B-Base-Continued-GRPO-Merge
merge_method: wave
base_model: Qwen/Qwen3-4B-Base
parameters:
synergy: 0.5 # 0.0 to 1.0. Higher = keep more "controversial" high-variance parameters
entropy: 0.1 # Adds slight noise to break ties/prevent overfitting
dtype: bfloat16
tokenizer_source: Lambent/Qwen3-4B-Base-Continued-GRPO-Merge
- Downloads last month
- 9
Model tree for Lambent/Qwen3-4B-Base-Continued-GRPO-Wave
Merge model
this model