metadata
license: apache-2.0
base_model: Qwen/Qwen3-1.7B
tags:
- rlm
- recursive-language-model
- lora
- qwen3
datasets:
- custom
language:
- en
pipeline_tag: text-generation
RL4RLM-SFT: Supervised Fine-Tuned RLM
LoRA adapter for Qwen3-1.7B trained as a Recursive Language Model (RLM) — a model that writes Python code to decompose and solve long-context tasks via a persistent REPL environment.
Paper
Training Native Recursive Language Models — CS234 Final Project, Stanford University (Winter 2026)
- GitHub: pythonomar22/rl4rlm
Training Details
- Method: LoRA SFT on 87 self-bootstrapped trajectories from Qwen3-1.7B
- Data: NIAH tasks (5K-100K documents), cleaned and template-fixed
- Training: 5 epochs, lr 2e-4, batch 16, 34 seconds on 1 H200
- Key result: +18pp on NIAH, +19.6pp on Multi-NIAH over base
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-1.7B")
model = PeftModel.from_pretrained(base, "omar81939/rl4rlm-sft")
tokenizer = AutoTokenizer.from_pretrained("omar81939/rl4rlm-sft")
Results
| Model | NIAH (100) | Multi-NIAH (24) | DocClassify (20) | Avg |
|---|---|---|---|---|
| Base | 72.0 | 38.3 | 80.3 | 63.5 |
| SFT | 90.0 | 57.9 | 82.4 | 76.8 |
| STaR | 87.0 | 58.4 | 83.4 | 76.3 |
| DPO | 83.0 | 87.9 | 82.6 | 84.5 |
| GRPO-v4 | 82.0 | 85.1 | 83.2 | 83.4 |
LoRA Config
- Rank: 16, Alpha: 32, Dropout: 0.05
- Target modules: all attention and MLP projections