Tested out GRPO training on domain-specific adapters and then using a WAVE merge in a very ... uh, scattershot way.

It worked out apparently smarter than when I tried to trim down to the "best" adapters of the exploration, though, so scattershot it is.

I'm not sure if they're a "better" base model here; domain-wise they may have lost out slightly on other domains and improved mainly on Python code?

The lm-eval diagnostic tasks here look promising though.

Task Metric Qwen3-4B-Base GRPO-Merge Δ Base GRPO-Wave Δ Base Δ Merge
arc_easy acc 0.7891 0.7870 -0.27% 0.7912 +0.27% +0.53%
arc_easy acc_norm 0.7609 0.7605 -0.05% 0.7643 +0.45% +0.50%
lambada_openai acc 0.6912 0.6984 +1.04% 0.7006 +1.36% +0.31%
lambada_openai perplexity ↓ 4.2433 4.0490 -4.58% 3.9616 -6.64% -2.16%
openbookqa acc 0.3160 0.3180 +0.63% 0.3180 +0.63% ±0.00%
openbookqa acc_norm 0.4100 0.4120 +0.49% 0.4100 ±0.00% -0.49%
piqa acc 0.7797 0.7807 +0.13% 0.7813 +0.21% +0.08%
piqa acc_norm 0.7807 0.7807 ±0.00% 0.7813 +0.08% +0.08%

This is a merge of pre-trained language models created using mergekit.

Merge Details

Merge Method

This model was merged using the WAVE merge method using Qwen/Qwen3-4B-Base as a base.

Models Merged

The following models were included in the merge:

Configuration

The following YAML configuration was used to produce this model:

models:
  - model: Lambent/Qwen3-4B-Base-Continued-GRPO+../rlvr-envs/grpo-python-creative
  - model: Lambent/Qwen3-4B-Base-Continued-GRPO+../rlvr-envs/grpo-ao3-minilm
  - model: Lambent/Qwen3-4B-Base-Continued-GRPO+../rlvr-envs/grpo-aware-5e6
  - model: Lambent/Qwen3-4B-Base-Continued-GRPO+../rlvr-envs/grpo-aware-test
  - model: Lambent/Qwen3-4B-Base-Continued-GRPO+../rlvr-envs/grpo-merge-llm-judge-ep2
  - model: Lambent/Qwen3-4B-Base-Continued-GRPO+../rlvr-envs/grpo-bbc-qwen
  - model: Lambent/Qwen3-4B-Base-Continued-GRPO+../rlvr-envs/grpo-ao3-qwen
  - model: Lambent/Qwen3-4B-Base-Continued-GRPO-Merge
merge_method: wave
base_model: Qwen/Qwen3-4B-Base
parameters:
  synergy: 0.5  # 0.0 to 1.0. Higher = keep more "controversial" high-variance parameters
  entropy: 0.1  # Adds slight noise to break ties/prevent overfitting
dtype: bfloat16
tokenizer_source: Lambent/Qwen3-4B-Base-Continued-GRPO-Merge
Downloads last month
9
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Lambent/Qwen3-4B-Base-Continued-GRPO-Wave