Spaces:

Sizzing
/

aws_rl_env

Running

App Files Files Community

aws_rl_env / train /README.md

Sizzing

Upload folder using huggingface_hub

b13d4d9 verified 17 days ago

preview code

raw

history blame contribute delete

28.8 kB

	# `train/` — SFT + GRPO Training Pipeline

	[← back to main README](../README.md)

	This directory holds the training notebooks for the AWS RL agent. Heavy logic for the GRPO loop lives at the repo root in [train_grpo.py](../train_grpo.py); the notebooks here are thin drivers that you can run end-to-end on Colab.

	The training pipeline has two stages:

	```
	┌────────── data/sft/ ──────────┐
	│ 1,500 train · 150 val rows │
	│ 5 trajectory types │
	└───────────────┬───────────────┘
	│
	┌──────────────────────────────────▼──────────────────────────────────┐
	│ STAGE 1 — Supervised Fine-Tuning (train_sft_lora.ipynb) │
	│ Qwen2.5-Coder-3B-Instruct + LoRA r=8/16/32 (Optuna) → SFT adapter │
	└──────────────────────────────────┬──────────────────────────────────┘
	│ Sizzing/aws-rl-sft-qwen25coder3b-adapter
	┌──────────────────────────────────▼──────────────────────────────────┐
	│ STAGE 2 — GRPO RL (train_grpo_lora.ipynb) │
	│ G=8 parallel rollouts · multi-turn · reward = env return │
	│ Optuna over (lr, β, G, T, top_p, lora_r, max_turns) │
	└─────────────────────────────────────────────────────────────────────┘
	```

	The two stages are intentionally separable: the SFT adapter is published to the Hugging Face Hub so anyone can pull it and start GRPO without re-running SFT.

	---

	## Table of contents

	1. [SFT stage — supervised LoRA](#1-sft-stage--supervised-lora)
	2. [GRPO stage — reinforcement learning](#2-grpo-stage--reinforcement-learning)
	3. [Optuna hyperparameter search](#3-optuna-hyperparameter-search)
	4. [Multi-turn rollouts + parallel envs](#4-multi-turn-rollouts--parallel-envs)
	5. [Training modes (CLI)](#5-training-modes-cli)
	6. [How to run](#6-how-to-run)
	7. [Logging and artifacts](#7-logging-and-artifacts)
	8. [Reproducing results](#8-reproducing-results)
	9. [Files in this directory](#9-files-in-this-directory)

	---

	## 1. SFT stage — supervised LoRA

	[train/train_sft_lora.ipynb](train_sft_lora.ipynb) — primary SFT notebook.

	### Why SFT before GRPO?

	Two reasons — both showed up in our base-model evaluation ([data/sft/MODEL_EVALUATION.md](../data/sft/MODEL_EVALUATION.md)):

	1. Format-locking. Even strong coder models occasionally wrap commands in markdown fences or quotes. SFT removes that surface noise in one epoch.
	2. Bootstrap the GRPO reward signal. GRPO with a base model that's only 41% exact-match starts from a low-density reward landscape. Pre-training on canonical commands raises the baseline so GRPO can spend its compute on optimization, not search.

	### Base model

	\| Choice \| `unsloth/Qwen2.5-Coder-3B-Instruct-bnb-4bit` \|
	\|--------\|--\|
	\| Why \| Highest exact-match (41%) of 11 candidates we benchmarked, fastest viable inference (3.1 s/call), tightest output (86 chars). Full reasoning in [data/sft/MODEL_EVALUATION.md](../data/sft/MODEL_EVALUATION.md). \|
	\| Loader \| Unsloth's 4-bit quantized variant — fits comfortably on a single 24 GB GPU, 2× faster training kernels \|

	### LoRA config

	```python
	LoraConfig(
	r = trial.suggest_categorical("lora_r", [8, 16, 32]),
	lora_alpha = r * trial.suggest_categorical("lora_alpha_mul", [1, 2, 4]),
	lora_dropout = trial.suggest_float("lora_dropout", 0.005, 0.031),
	bias = "none",
	task_type = "CAUSAL_LM",
	target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"],
	)
	```

	- Only attention projections are adapted — MLP / output heads stay frozen, keeping the trainable parameter count tiny (~10–40 M depending on rank).
	- `lora_alpha = r × multiplier` keeps the effective scaling stable across rank variations during the Optuna search.

	### Optimization

	\| Hyperparameter \| Value / Range \|
	\|--------------------------\|------------------------------------------\|
	\| Optimizer \| AdamW (Unsloth's fused implementation) \|
	\| Learning rate \| `[1e-4, 5e-4]` log-scale (Optuna) \|
	\| Schedule \| Cosine annealing \|
	\| Warmup ratio \| `{0.03, 0.1}` (Optuna; best 0.1) \|
	\| Batch size \| 2 per GPU \|
	\| Epochs \| 2 \|
	\| Max sequence length \| 512 \|
	\| Packing \| Disabled (we keep chat-template separators intact) \|
	\| Loss masking \| Assistant-only (user message tokens are masked from the loss) \|

	### Dataset

	[data/sft/aws_rl_sft.train.jsonl](../data/sft/aws_rl_sft.train.jsonl) — 1,500 examples. Format:

	```json
	{
	"messages": [
	{"role": "system", "content": "You are an AWS cloud engineer..."},
	{"role": "user", "content": "TASK: ...\n\nCURRENT OBSERVATION:\nProgress: 0.00 ..."},
	{"role": "assistant", "content": "aws s3 mb s3://my-app-data"}
	],
	"difficulty": "intermediate",
	"source": "success_first_step",
	"task_id": 42
	}
	```

	The dataset is a careful mix of 5 trajectory types (success, multi-step continuation, failure recovery, verification, hint usage). Full generation methodology in [data/README.md](../data/README.md).

	### Training graphs

	A reference SFT run achieved validation loss `0.052` after 188 training steps with the best Optuna trial. The plots below were exported from that run into [`docs/figures/`](../docs/figures/).

	> ![SFT loss curve](../docs/figures/sft_loss_curve.png)

	---

	## 2. GRPO stage — reinforcement learning

	The core trainer lives at [train_grpo.py](../train_grpo.py) (1,283 LOC). Notebooks call into it:

	- [train/train_grpo_lora.ipynb](train_grpo_lora.ipynb) — clean
	- [train/train_grpo_lora_with_outputs.ipynb](train_grpo_lora_with_outputs.ipynb) — with execution outputs preserved
	- [aws_rl_env_colab.ipynb](../aws_rl_env_colab.ipynb) — Colab driver wrapping the entire pipeline

	### What GRPO is, briefly

	GRPO (Group Relative Policy Optimization) is the algorithm introduced by DeepSeekMath and adopted by TRL ≥ 0.18. Unlike PPO, GRPO does not train a critic. Instead:

	1. For one prompt (here, one curriculum-picked task), generate `G` completions
	2. Score each with the reward function(s)
	3. Compute group-relative advantage: `(reward_i − group_mean) / group_std`
	4. Backpropagate the policy gradient with that advantage
	5. Apply a KL penalty to the SFT reference model (coefficient `β`) to prevent drift

	This is dramatically simpler than PPO (no value head, no GAE), more sample-efficient for verifier-style rewards, and a natural fit for our setup — the AWS RL env is the reward function.

	### TRL GRPOTrainer config

	From [train_grpo.py:_build_grpo_config()](../train_grpo.py):

	\| Parameter \| Default value \| Notes \|
	\|------------------------------------\|---------------\|-------------------------------------------------------------\|
	\| `learning_rate` \| `5e-6` \| Optuna range `[1e-6, 1e-4]` log-scale \|
	\| `beta` (KL coefficient) \| `0.04` \| Optuna range `[0.0, 0.1]` \|
	\| `num_generations` (G) \| `8` \| Optuna `{4, 8}` \|
	\| `temperature` \| `0.9` \| Optuna `[0.7, 1.0]` \|
	\| `top_p` \| `0.95` \| Optuna `[0.85, 0.98]` \|
	\| `per_device_train_batch_size` \| `1` \| \|
	\| `gradient_accumulation_steps` \| `8` \| Effective batch 8 \|
	\| `gradient_checkpointing` \| `True` \| `use_reentrant=False` — VRAM optimization \|
	\| `max_completion_length` \| `256` \| Per-turn; one AWS CLI command fits comfortably \|
	\| `max_prompt_length` \| `2048` \| Holds task + history + observation \|
	\| `loss_type` \| `"dapo"` \| Distributional Advantage Policy Optimization (TRL default for GRPO) \|
	\| `mask_truncated_completions` \| `True` \| Drop samples that hit `max_completion_length` \|
	\| `warmup_ratio` \| `0.05` \| \|
	\| `lr_scheduler_type` \| `"cosine"` \| \|
	\| `max_grad_norm` \| `1.0` \| \|
	\| `use_vllm` \| `False` \| Plain `model.generate()` — vLLM integration is future work \|

	### Reward functions (TRL convention)

	Three reward functions are registered, summed by GRPO:

	```python
	reward_funcs=[reward_task, reward_achieved, reward_progress]
	```

	- `reward_task(completions, **kwargs)` → episode return (sum of per-step env rewards). The dominant signal.
	- `reward_achieved(completions, **kwargs)` → 1.0 if `task.task_achieved` at end of episode, else 0.0. Sparse but unambiguous.
	- `reward_progress(completions, **kwargs)` → final `partial_progress` ∈ [0, 1]. Densifies the credit assignment for partial completions.

	The env's reward shaping (see [server/README.md §8](../server/README.md#8-reward-shaping--taskgrader)) does most of the work — these three TRL functions are a thin façade.

	### Episode = one rollout

	- Each rollout runs up to `MAX_TURNS=6` sequential AWS CLI commands
	- Each command's stdout/stderr/progress is fed back as the user message for the next turn (see `build_user_prompt()` and `format_observation()` in [train_grpo.py](../train_grpo.py))
	- The episode terminates on `task_achieved`, max turns, or `max_total_tokens` (per-episode token budget)
	- Token sequences (prompt_ids, completion_ids, logprobs) are accumulated across turns, so GRPO assigns the episode-level reward to the full multi-turn token sequence — not just the last turn

	### Curriculum integration

	```
	trainer step:
	1. task = curriculum.next_task() # one task per GRPO step
	2. results = pool.run_group(task, ...) # G rollouts on that task
	3. mean_r = sum(group_rewards) / G
	4. curriculum.record_result(task, achieved=any_achieved, reward=mean_r)
	5. trainer applies group-relative advantages # standard GRPO
	```

	The curriculum drives task selection — every rollout in a group runs the same task, forced through `env.reset(task=task)`. This matches GRPO's group-relative semantics (you need the same prompt across the group to compute baseline correctly).

	Full curriculum mechanics (priority scoring, mastery, spaced rep, tier promotion) live in [server/README.md §7](../server/README.md#7-curriculum-manager).

	### Training graphs

	A reference GRPO run trained 35 steps with the best Optuna config (`lr=1.6e-5`, `β=0.0021`, `T=0.99`). Per-step training signals (extracted from the run's `trainer_state.json`) are mirrored into [`docs/figures/`](../docs/figures/):

	> ![GRPO final per-step training signals](../docs/figures/grpo_final_per_step.png)
	> ![GRPO env reward over training](../docs/figures/grpo_reward_curve.png)
	> ![Success by tier (multi-step)](../docs/figures/grpo_per_tier_curve.png)
	> ![Reward by tier (multi-step)](../docs/figures/grpo_reward_by_tier.png)

	Notable signals from the run:

	\| \| \|
	\|---\|---\|
	\| `env_reward/mean` \| 0.31 (mean over 16 reward-logged steps), max 0.94, min 0.13 \|
	\| `kl` \| 0.15 (mean) — KL stays small despite tiny β \|
	\| `completion_length` \| 87 tokens (mean) — agent emits compact AWS CLI commands \|
	\| Format compliance \| 100% (`format_reward/mean = 1.0` every step) \|

	Multi-step end-to-end re-eval after GRPO:

	> ![SFT vs GRPO multi-step metrics grid](../docs/figures/sft_vs_grpo_metrics_grid.png)

	These are produced by [`plot_rewards()`](../train_grpo.py) reading `reward_log.csv` written by `EpisodeLogger`, plus the post-hoc plots generated during the GRPO notebook run.

	---

	## 3. Optuna hyperparameter search

	[train_grpo.py:optuna_search()](../train_grpo.py)

	### Search space

	\| Parameter \| Range \| Reason \|
	\|-------------------\|------------------------------------\|------------------------------------------------------------------------\|
	\| `learning_rate` \| `[1e-6, 1e-4]` log \| GRPO is sensitive to LR; log-scale is the right prior \|
	\| `beta` \| `[0.0, 0.1]` \| KL coefficient. 0 = pure RL (drift risk), 0.1 = anchored to SFT \|
	\| `num_generations` \| `{4, 8}` \| Group size. Larger → tighter advantage estimates but slower \|
	\| `temperature` \| `[0.7, 1.0]` \| Exploration knob \|
	\| `top_p` \| `[0.85, 0.98]` \| Nucleus sampling \|
	\| `lora_r` \| `{8, 16, 32}` \| Adapter capacity \|
	\| `lora_alpha_mul` \| `{1, 2, 4}` \| `lora_alpha = lora_r × multiplier` \|
	\| `max_turns` \| `{4, 6, 8}` \| Episode length cap \|

	### Objective

	```
	objective = 0.7 × achieved_rate + 0.3 × mean_progress
	```

	Calculated on the held-out validation tasks at the end of each trial. Weighting `achieved_rate` higher matches the project goal — actual task completion matters more than partial progress.

	### Sampler

	`optuna.samplers.TPESampler(seed=42)` — Tree-structured Parzen Estimator. TPE outperforms random search on 8-dim spaces with ~6 trials in our experience.

	Persisted to `outputs/.../optuna.db` (SQLite), so trials can be resumed if a Colab session disconnects.

	### Frozen validation set

	`pick_validation_task_ids(k_per_tier=2, seed=42)` picks 2 tasks per tier (≈10 tasks total) at the start of training. The same set is used by every Optuna trial and the final post-training eval — no benchmark leakage between trials.

	### SFT-stage Optuna results (6 trials)

	The SFT-stage Optuna run explored a 5-parameter space (`lora_r`, `lora_alpha_mul`, `lora_dropout`, `learning_rate`, `warmup_ratio`). 6 trials, validation loss as objective (lower = better):

	\| Trial \| r \| α \| dropout \| lr \| warmup \| val_loss \|
	\|------:\|---:\|---:\|:-------:\|:---------:\|:------:\|:--------:\|
	\| 0 \| 16 \| 16 \| 0.006 \| 4.03e-4 \| 0.10 \| 0.0523 ★ \|
	\| 1 \| 16 \| 16 \| 0.030 \| 2.33e-4 \| 0.03 \| 0.0790 \|
	\| 2 \| 8 \| 32 \| 0.020 \| 2.29e-4 \| 0.03 \| 0.0587 \|
	\| 3 \| 8 \| 16 \| 0.030 \| 1.17e-4 \| 0.03 \| 0.1199 \|
	\| 4 \| 16 \| 16 \| 0.031 \| 2.31e-4 \| 0.03 \| 0.0793 \|
	\| 5 \| 8 \| 32 \| 0.009 \| 1.37e-4 \| 0.10 \| 0.0828 \|

	> ![SFT Optuna trial comparison table](../docs/figures/sft_optuna_trials_table.png)

	```json
	{
	"best_value": 0.052,
	"best_params": {
	"lora_r": 16,
	"lora_alpha_mul": 1, // → lora_alpha = 16
	"lora_dropout": 0.005808,
	"learning_rate": 4.03e-4,
	"warmup_ratio": 0.1
	}
	}
	```

	Visualized:

	> ![Optuna parameter importances](../docs/figures/optuna_param_importance.png)
	> ![Optuna optimization history](../docs/figures/optuna_history.png)
	> ![Optuna parallel coordinate plot](../docs/figures/optuna_parallel.png)
	> ![Optuna slice plot](../docs/figures/optuna_slice.png)
	> ![Optuna trial training curves](../docs/figures/optuna_trial_curves.png)

	### GRPO-stage Optuna results (4 trials)

	The GRPO-stage Optuna run explored a 3-parameter space (`learning_rate`, `beta`, `temperature`). 4 trials, single-step env reward as objective (higher = better):

	\| Trial \| lr \| β \| T \| env_reward \| success \|
	\|------:\|:---------:\|:--------:\|:-----:\|:----------:\|:-------:\|
	\| 0 \| varied \| varied \| varied\| 0.473 \| 25.0% \|
	\| 1 \| varied \| varied \| varied\| 0.469 \| 25.0% \|
	\| 2 \| varied \| varied \| varied\| 0.469 \| 25.0% \|
	\| 3 \| 1.60e-5 \| 0.0021 \| 0.99 \| 0.552 \| 33.3% ★ \|

	> ![GRPO Optuna trial comparison](../docs/figures/grpo_optuna_trials_comparison.png)
	> ![GRPO Optuna importances](../docs/figures/grpo_optuna_importances.png)
	> ![GRPO Optuna parallel coordinate](../docs/figures/grpo_optuna_parallel.png)
	> ![GRPO Optuna hparams](../docs/figures/grpo_optuna_hparams.png)
	> ![GRPO Optuna trial curves](../docs/figures/grpo_optuna_trial_curves.png)

	The winning GRPO config uses a much smaller learning rate (1.6e-5, vs 4.0e-4 for SFT) and a tiny KL coefficient (β=0.0021) — both expected for an RL phase that is only correcting the SFT-bootstrapped policy, not retraining it.

	---

	## 4. Multi-turn rollouts + parallel envs

	This section is a quick overview — the full mechanics, including the three pool layers and asyncio orchestration, are in [scripts/README.md](../scripts/README.md).

	### MultiTurnEnvPool

	[train_grpo.py:MultiTurnEnvPool](../train_grpo.py) — owns a background thread running an asyncio loop, opens N WebSocket sessions on startup, exposes a synchronous `run_group(task, ...)` API.

	- One pool instance lives for the duration of training
	- `run_group()` calls `asyncio.gather()` over `rollout_one_episode(env, task, ...)` for each of the N envs — every rollout runs the same task in its own MiniStack (see server-side pool in [server/README.md §6](../server/README.md#6-server-side-ministack-pool-parallel-rollouts))
	- Returns a list of `{prompt_ids, completion_ids, logprobs, task_reward, task_achieved, final_progress, num_steps, transcript, task_id, difficulty}`

	### Why parallelism matters here

	GRPO's group-relative advantage requires `G` rollouts before any gradient. Running them serially at MAX_TURNS=6 turns × ~50 ms env step = ~300 ms per rollout would cost 2.4 s × G=8 = ~20 s of env time per training step. With parallel rollouts that drops to ~300 ms (the slowest of 8). The model forward pass dominates, exactly as desired.

	### Generation lock

	Because the policy lives on a single GPU, `model.generate()` calls across the asyncio.gather group are serialised behind a `_GENERATE_LOCK` (`threading.Lock`). The env step calls — the slow part — happily overlap. This is the single non-obvious detail that makes the parallel rollout approach actually work.

	---

	## 5. Training modes (CLI)

	```bash
	# Optuna search only — produces best_cfg.json
	python train_grpo.py --mode optuna --n-trials 6 --trial-max-steps 30

	# Train once with explicit hyperparams (no search)
	python train_grpo.py --mode train \
	--env-url http://localhost:8000 \
	--num-generations 8 --max-turns 6 --max-steps 200

	# Search → train: Optuna trials, then a full-length run with the best config
	python train_grpo.py --mode full --n-trials 6 --max-steps 200
	```

	All modes write to `outputs/aws-rl-grpo-<TIMESTAMP>/`.

	---

	## 6. How to run

	### Prerequisites

	- A running env server: `make run` from the repo root (starts MiniStack + FastAPI on `http://localhost:8000`)
	- For pool size > 1: `AWS_RL_ENV_POOL_SIZE=8 make run`
	- A GPU with ≥ 24 GB VRAM (A10, T4×2, A100, L4 all confirmed working)
	- HuggingFace token (`HF_TOKEN`) if you want to push the trained adapter

	### Local

	```bash
	# 1. Start the env server in one terminal
	AWS_RL_ENV_POOL_SIZE=8 make run

	# 2. Run training in another terminal
	python train_grpo.py --mode full --n-trials 6 --max-steps 200
	```

	### Colab

	The notebook [aws_rl_env_colab.ipynb](../aws_rl_env_colab.ipynb) wraps the full pipeline (env URL config, HF login, val set, Optuna, training, plotting, optional push-to-Hub):

	\| Notebook \| Open in Colab \|
	\|----------\|---------------\|
	\| GRPO end-to-end driver \| <!-- TODO: paste Colab URL here --> \|
	\| SFT-only ([train/train_sft_lora.ipynb](train_sft_lora.ipynb)) \| <!-- TODO: paste Colab URL here --> \|
	\| GRPO-only ([train/train_grpo_lora.ipynb](train_grpo_lora.ipynb)) \| <!-- TODO: paste Colab URL here --> \|

	Note: the Colab notebooks expect the env server to be reachable. Two options:

	1. HF Space tunnel: deploy the env to your own HF Space and point `ENV_URL` at it (see main README's deployment section)
	2. ngrok: run the env locally and expose it via ngrok / cloudflared so Colab can reach it

	---

	## 7. Logging and artifacts

	### Reference training runs (numbers baked into this documentation)

	The headline numbers and plots in this repo come from two reference training runs we performed end-to-end:

	- SFT reference run — 188 SFT steps with the best Optuna trial. Achieved val loss 0.052 (best of 6 trials). Post-SFT eval delta: format `33% → 100%`, exact `39% → 89%`, latency `2.03s → 1.40s`. The training curves, Optuna plots, and eval comparisons from this run live in [`docs/figures/`](../docs/figures/) (`sft_loss_curve.png`, `optuna_*.png`, `base_vs_sft_success.png`, …).
	- GRPO reference run — 35 GRPO steps with the best Optuna trial. Achieved single-step env reward 0.55 (best of 4 trials). Multi-step eval (n≈108): success `86.8% → 86.2%`, beginner `+3.8 pp`, intermediate `+6.0 pp`, expert flat at 22%. The training signals, by-tier breakdowns, and qualitative rollouts from this run also live in [`docs/figures/`](../docs/figures/) (`grpo_final_per_step.png`, `grpo_reward_curve.png`, `sft_vs_grpo_*.png`, `qualitative_rollouts.png`, …).

	The raw training-output directories (TRL checkpoints, optimizer states, exported adapters totalling ~330 MB) are not committed. The metrics, hyperparameters, and visualizations they produced are preserved inline in this README and as PNGs under [`docs/figures/`](../docs/figures/).

	### GRPO output layout

	Each GRPO run writes to a fresh `outputs/aws-rl-grpo-<TIMESTAMP>/`:

	\| File \| Written by \| Contents \|
	\|-------------------------\|------------------------\|-------------------------------------------------------------------------\|
	\| `reward_log.csv` \| `EpisodeLogger` \| One row per rollout: `step, rollout_idx, task_id, difficulty, task_reward, task_achieved, final_progress, num_steps, tier, tier_success_rate, timestamp` \|
	\| `transcripts.jsonl` \| `EpisodeLogger` \| Same rows + the full multi-turn transcript per rollout (commands, outputs, rewards) \|
	\| `optuna.db` \| Optuna \| SQLite study (resumable) \|
	\| `best_cfg.json` \| `optuna_search()` \| Final winning hyperparameters \|
	\| `trial_NNN/` \| `_run_one_trial()` \| Per-trial trainer checkpoints + `trial_metrics.json` \|
	\| `val_task_ids.json` \| Notebook driver \| Frozen held-out validation set (for reproducibility) \|
	\| `post_train_val.json` \| Notebook §10 \| Final post-training validation metrics \|
	\| `reward_plot.png` \| `plot_rewards()` \| Group mean reward + per-tier scatter \|
	\| `<adapter_dir>/` \| TRL `GRPOTrainer.save` \| Trained LoRA adapter (`adapter_config.json`, `adapter_model.safetensors`, etc.) \|

	Push to HF Hub:

	```python
	from huggingface_hub import create_repo, upload_folder
	create_repo("your-org/aws-rl-grpo-qwen25coder3b", exist_ok=True, private=False)
	upload_folder(folder_path=str(OUTPUT_DIR), repo_id="your-org/aws-rl-grpo-qwen25coder3b")
	```

	---

	## 8. Reproducing results

	### Actual SFT result

	```
	SFT (188 steps, best Optuna trial, ~30 min on A10):
	best val_loss : 0.052
	best lora_r : 16
	best lora_alpha : 16 (alpha_mul=1)
	best lora_dropout: 0.0058
	best lr : 4.03e-4
	best warmup : 0.10

	Held-out eval (post-SFT, same prompts as base):
	format_pct : 33.3% → 100.0% (+66.7 pp)
	exact_pct : 38.9% → 88.9% (+50.0 pp)
	service_pct : 77.8% → 88.9% (+11.1 pp)
	operation_pct : 61.1% → 88.9% (+27.8 pp)
	avg_latency : 2.03s → 1.40s (−0.63s)
	avg_len : 85.8 → 74.7 (tighter outputs)
	```

	Every target from [data/sft/MODEL_EVALUATION.md §11](../data/sft/MODEL_EVALUATION.md) is met or exceeded.

	### Actual GRPO result

	```
	GRPO (35 steps from best Optuna trial, ~1.5 hr on A10):
	best lr : 1.60e-5
	best beta : 0.0021
	best temperature : 0.99
	num_generations : 8

	Per-step training signals (16 reward-logged steps):
	env_reward (mean): 0.31 max: 0.94 min: 0.13
	KL to SFT ref : 0.15 mean (small β = 0.0021 keeps drift in check)
	format_reward : 1.00 every step (perfect format compliance)
	completion length: 87 tokens mean (compact AWS CLI commands)

	Multi-step end-to-end eval (n≈108 episodes):
	Base+SFT Base+SFT+GRPO Δ
	overall_success 86.8% 86.2% −0.5 pp
	overall_reward 0.883 0.877 −0.006
	beginner_success 96.2% 100.0% +3.8 pp ✓
	intermediate_success 81.0% 87.0% +6.0 pp ✓
	warmup_success 96.0% 90.2% −5.8 pp
	expert_success 22.2% 22.2% flat (bottleneck)
	drift_repair 22.2% 22.2% flat
	destructive_fail 15.1% 14.7% −0.4 pp
	steps_to_solve 1.45 1.55 +0.10
	```

	Honest reading. A 35-step GRPO run from a strong SFT starting point (already 86.8% success) is short by RL standards. It preserves the SFT gains, modestly improves the middle tiers, but does not crack the expert-tier ceiling — the 22% expert / 22% drift-repair numbers stay flat because there are too few expert episodes in 35 GRPO steps × G=8 = 280 rollouts, with the curriculum focusing primarily on warmup/beginner/intermediate.

	Variance comes mostly from Optuna trial composition. The published SFT adapter (`Sizzing/aws-rl-sft-qwen25coder3b-adapter`) is the SFT result; the GRPO adapter regenerates per-run from the trainer's output directory.

	---

	## 9. Files in this directory

	\| File \| Purpose \|
	\|-----------------------------------------\|------------------------------------------------------------------------\|
	\| [train_sft_lora.ipynb](train_sft_lora.ipynb) \| Stage 1 — supervised LoRA fine-tuning \|
	\| [train_grpo_lora.ipynb](train_grpo_lora.ipynb) \| Stage 2 — GRPO RL training (clean) \|
	\| [train_grpo_lora_with_outputs.ipynb](train_grpo_lora_with_outputs.ipynb) \| Same notebook with cell outputs preserved \|

	Heavy logic referenced from these notebooks:

	- [train_grpo.py](../train_grpo.py) — the `MultiTurnEnvPool`, GRPO config, Optuna search, `plot_rewards`, and the `run_training` entry point
	- [aws_rl_env_colab.ipynb](../aws_rl_env_colab.ipynb) — Colab driver that imports from `train_grpo.py`

	---

	## See also

	- [Main README](../README.md)
	- [data/README.md](../data/README.md) — dataset generation, base-model selection
	- [data/sft/MODEL_EVALUATION.md](../data/sft/MODEL_EVALUATION.md) — full 11-model benchmark
	- [scripts/README.md](../scripts/README.md) — parallel-rollout architecture deep-dive
	- [server/README.md](../server/README.md) — environment internals (curriculum, reward shaping, anti-hacking)
	- [compare/README.md](../compare/README.md) — base vs SFT comparison harness