metadata
license: apache-2.0
base_model: Qwen/Qwen3-1.7B
tags:
- rlm
- recursive-language-model
- lora
- qwen3
datasets:
- custom
language:
- en
pipeline_tag: text-generation
RL4RLM-STaR: Iterative Self-Improvement RLM
LoRA adapter for Qwen3-1.7B trained as a Recursive Language Model (RLM) — a model that writes Python code to decompose and solve long-context tasks via a persistent REPL environment.
Paper
Training Native Recursive Language Models — CS234 Final Project, Stanford University (Winter 2026)
- GitHub: pythonomar22/rl4rlm
Training Details
- Method: Second round of SFT (STaR-style) on combined 132 trajectories
- Data: 87 original NIAH + 45 self-generated on harder tasks (multi-needle, doc classification)
- Training: 5 epochs from base model on combined set
- Key result: Most balanced model — 58.4% Multi-NIAH, 83.4% DocClassify
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-1.7B")
model = PeftModel.from_pretrained(base, "omar81939/rl4rlm-star")
tokenizer = AutoTokenizer.from_pretrained("omar81939/rl4rlm-star")
Results
| Model | NIAH (100) | Multi-NIAH (24) | DocClassify (20) | Avg |
|---|---|---|---|---|
| Base | 72.0 | 38.3 | 80.3 | 63.5 |
| SFT | 90.0 | 57.9 | 82.4 | 76.8 |
| STaR | 87.0 | 58.4 | 83.4 | 76.3 |
| DPO | 83.0 | 87.9 | 82.6 | 84.5 |
| GRPO-v4 | 82.0 | 85.1 | 83.2 | 83.4 |
LoRA Config
- Rank: 16, Alpha: 32, Dropout: 0.05
- Target modules: all attention and MLP projections