Tested out GRPO training on domain-specific adapters and then using a WAVE merge in a very ... uh, scattershot way.

It worked out apparently smarter than when I tried to trim down to the "best" adapters of the exploration, though, so scattershot it is.

I'm not sure if they're a "better" base model here; domain-wise they may have lost out slightly on other domains and improved mainly on Python code?

The lm-eval diagnostic tasks here look promising though.

Task	Metric	Qwen3-4B-Base	GRPO-Merge	Δ Base	GRPO-Wave	Δ Base	Δ Merge
arc_easy	acc	0.7891	0.7870	-0.27%	0.7912	+0.27%	+0.53%
arc_easy	acc_norm	0.7609	0.7605	-0.05%	0.7643	+0.45%	+0.50%
lambada_openai	acc	0.6912	0.6984	+1.04%	0.7006	+1.36%	+0.31%
lambada_openai	perplexity ↓	4.2433	4.0490	-4.58%	3.9616	-6.64%	-2.16%
openbookqa	acc	0.3160	0.3180	+0.63%	0.3180	+0.63%	±0.00%
openbookqa	acc_norm	0.4100	0.4120	+0.49%	0.4100	±0.00%	-0.49%
piqa	acc	0.7797	0.7807	+0.13%	0.7813	+0.21%	+0.08%
piqa	acc_norm	0.7807	0.7807	±0.00%	0.7813	+0.08%	+0.08%

This is a merge of pre-trained language models created using mergekit.

Merge Details

Merge Method

This model was merged using the WAVE merge method using Qwen/Qwen3-4B-Base as a base.

Models Merged

The following models were included in the merge:

Lambent/Qwen3-4B-Base-Continued-GRPO + ../rlvr-envs/grpo-aware-5e6
Lambent/Qwen3-4B-Base-Continued-GRPO + ../rlvr-envs/grpo-merge-llm-judge-ep2
Lambent/Qwen3-4B-Base-Continued-GRPO + ../rlvr-envs/grpo-ao3-minilm
Lambent/Qwen3-4B-Base-Continued-GRPO + ../rlvr-envs/grpo-ao3-qwen
Lambent/Qwen3-4B-Base-Continued-GRPO + ../rlvr-envs/grpo-aware-test
Lambent/Qwen3-4B-Base-Continued-GRPO-Merge
Lambent/Qwen3-4B-Base-Continued-GRPO + ../rlvr-envs/grpo-python-creative
Lambent/Qwen3-4B-Base-Continued-GRPO + ../rlvr-envs/grpo-bbc-qwen

Configuration

The following YAML configuration was used to produce this model:

models:
  - model: Lambent/Qwen3-4B-Base-Continued-GRPO+../rlvr-envs/grpo-python-creative
  - model: Lambent/Qwen3-4B-Base-Continued-GRPO+../rlvr-envs/grpo-ao3-minilm
  - model: Lambent/Qwen3-4B-Base-Continued-GRPO+../rlvr-envs/grpo-aware-5e6
  - model: Lambent/Qwen3-4B-Base-Continued-GRPO+../rlvr-envs/grpo-aware-test
  - model: Lambent/Qwen3-4B-Base-Continued-GRPO+../rlvr-envs/grpo-merge-llm-judge-ep2
  - model: Lambent/Qwen3-4B-Base-Continued-GRPO+../rlvr-envs/grpo-bbc-qwen
  - model: Lambent/Qwen3-4B-Base-Continued-GRPO+../rlvr-envs/grpo-ao3-qwen
  - model: Lambent/Qwen3-4B-Base-Continued-GRPO-Merge
merge_method: wave
base_model: Qwen/Qwen3-4B-Base
parameters:
  synergy: 0.5  # 0.0 to 1.0. Higher = keep more "controversial" high-variance parameters
  entropy: 0.1  # Adds slight noise to break ties/prevent overfitting
dtype: bfloat16
tokenizer_source: Lambent/Qwen3-4B-Base-Continued-GRPO-Merge

Downloads last month: 9

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for Lambent/Qwen3-4B-Base-Continued-GRPO-Wave

Lambent/Qwen3-4B-Base-Continued-GRPO

Lambent/Qwen3-4B-Base-Continued-GRPO-Merge

Qwen/Qwen3-4B-Base

Merge model

this model

Finetunes

1 model

Quantizations

2 models