# `train/` — SFT + GRPO Training Pipeline [← back to main README](../README.md) This directory holds the **training notebooks** for the AWS RL agent. Heavy logic for the GRPO loop lives at the repo root in [train_grpo.py](../train_grpo.py); the notebooks here are thin drivers that you can run end-to-end on Colab. The training pipeline has two stages: ``` ┌────────── data/sft/ ──────────┐ │ 1,500 train · 150 val rows │ │ 5 trajectory types │ └───────────────┬───────────────┘ │ ┌──────────────────────────────────▼──────────────────────────────────┐ │ STAGE 1 — Supervised Fine-Tuning (train_sft_lora.ipynb) │ │ Qwen2.5-Coder-3B-Instruct + LoRA r=8/16/32 (Optuna) → SFT adapter │ └──────────────────────────────────┬──────────────────────────────────┘ │ Sizzing/aws-rl-sft-qwen25coder3b-adapter ┌──────────────────────────────────▼──────────────────────────────────┐ │ STAGE 2 — GRPO RL (train_grpo_lora.ipynb) │ │ G=8 parallel rollouts · multi-turn · reward = env return │ │ Optuna over (lr, β, G, T, top_p, lora_r, max_turns) │ └─────────────────────────────────────────────────────────────────────┘ ``` The two stages are intentionally separable: the SFT adapter is published to the Hugging Face Hub so anyone can pull it and start GRPO without re-running SFT. --- ## Table of contents 1. [SFT stage — supervised LoRA](#1-sft-stage--supervised-lora) 2. [GRPO stage — reinforcement learning](#2-grpo-stage--reinforcement-learning) 3. [Optuna hyperparameter search](#3-optuna-hyperparameter-search) 4. [Multi-turn rollouts + parallel envs](#4-multi-turn-rollouts--parallel-envs) 5. [Training modes (CLI)](#5-training-modes-cli) 6. [How to run](#6-how-to-run) 7. [Logging and artifacts](#7-logging-and-artifacts) 8. [Reproducing results](#8-reproducing-results) 9. [Files in this directory](#9-files-in-this-directory) --- ## 1. SFT stage — supervised LoRA [train/train_sft_lora.ipynb](train_sft_lora.ipynb) — primary SFT notebook. ### Why SFT before GRPO? Two reasons — both showed up in our base-model evaluation ([data/sft/MODEL_EVALUATION.md](../data/sft/MODEL_EVALUATION.md)): 1. **Format-locking**. Even strong coder models occasionally wrap commands in markdown fences or quotes. SFT removes that surface noise in one epoch. 2. **Bootstrap the GRPO reward signal**. GRPO with a base model that's only 41% exact-match starts from a low-density reward landscape. Pre-training on canonical commands raises the baseline so GRPO can spend its compute on optimization, not search. ### Base model | Choice | `unsloth/Qwen2.5-Coder-3B-Instruct-bnb-4bit` | |--------|--| | Why | Highest exact-match (41%) of 11 candidates we benchmarked, fastest viable inference (3.1 s/call), tightest output (86 chars). Full reasoning in [data/sft/MODEL_EVALUATION.md](../data/sft/MODEL_EVALUATION.md). | | Loader | Unsloth's 4-bit quantized variant — fits comfortably on a single 24 GB GPU, 2× faster training kernels | ### LoRA config ```python LoraConfig( r = trial.suggest_categorical("lora_r", [8, 16, 32]), lora_alpha = r * trial.suggest_categorical("lora_alpha_mul", [1, 2, 4]), lora_dropout = trial.suggest_float("lora_dropout", 0.005, 0.031), bias = "none", task_type = "CAUSAL_LM", target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"], ) ``` - Only attention projections are adapted — MLP / output heads stay frozen, keeping the trainable parameter count tiny (~10–40 M depending on rank). - `lora_alpha = r × multiplier` keeps the effective scaling stable across rank variations during the Optuna search. ### Optimization | Hyperparameter | Value / Range | |--------------------------|------------------------------------------| | Optimizer | AdamW (Unsloth's fused implementation) | | Learning rate | `[1e-4, 5e-4]` log-scale (Optuna) | | Schedule | Cosine annealing | | Warmup ratio | `{0.03, 0.1}` (Optuna; best 0.1) | | Batch size | 2 per GPU | | Epochs | 2 | | Max sequence length | 512 | | Packing | **Disabled** (we keep chat-template separators intact) | | Loss masking | Assistant-only (user message tokens are masked from the loss) | ### Dataset [data/sft/aws_rl_sft.train.jsonl](../data/sft/aws_rl_sft.train.jsonl) — 1,500 examples. Format: ```json { "messages": [ {"role": "system", "content": "You are an AWS cloud engineer..."}, {"role": "user", "content": "TASK: ...\n\nCURRENT OBSERVATION:\nProgress: 0.00 ..."}, {"role": "assistant", "content": "aws s3 mb s3://my-app-data"} ], "difficulty": "intermediate", "source": "success_first_step", "task_id": 42 } ``` The dataset is a careful mix of **5 trajectory types** (success, multi-step continuation, failure recovery, verification, hint usage). Full generation methodology in [data/README.md](../data/README.md). ### Training graphs A reference SFT run achieved validation loss `0.052` after 188 training steps with the best Optuna trial. The plots below were exported from that run into [`docs/figures/`](../docs/figures/). > ![SFT loss curve](../docs/figures/sft_loss_curve.png) --- ## 2. GRPO stage — reinforcement learning The core trainer lives at [train_grpo.py](../train_grpo.py) (1,283 LOC). Notebooks call into it: - [train/train_grpo_lora.ipynb](train_grpo_lora.ipynb) — clean - [train/train_grpo_lora_with_outputs.ipynb](train_grpo_lora_with_outputs.ipynb) — with execution outputs preserved - [aws_rl_env_colab.ipynb](../aws_rl_env_colab.ipynb) — Colab driver wrapping the entire pipeline ### What GRPO is, briefly **GRPO** (Group Relative Policy Optimization) is the algorithm introduced by DeepSeekMath and adopted by TRL ≥ 0.18. Unlike PPO, GRPO does **not** train a critic. Instead: 1. For one prompt (here, one curriculum-picked task), generate `G` completions 2. Score each with the reward function(s) 3. Compute group-relative advantage: `(reward_i − group_mean) / group_std` 4. Backpropagate the policy gradient with that advantage 5. Apply a KL penalty to the SFT reference model (coefficient `β`) to prevent drift This is dramatically simpler than PPO (no value head, no GAE), more sample-efficient for verifier-style rewards, and a natural fit for our setup — the AWS RL env *is* the reward function. ### TRL GRPOTrainer config From [train_grpo.py:_build_grpo_config()](../train_grpo.py): | Parameter | Default value | Notes | |------------------------------------|---------------|-------------------------------------------------------------| | `learning_rate` | `5e-6` | Optuna range `[1e-6, 1e-4]` log-scale | | `beta` (KL coefficient) | `0.04` | Optuna range `[0.0, 0.1]` | | `num_generations` (G) | `8` | Optuna `{4, 8}` | | `temperature` | `0.9` | Optuna `[0.7, 1.0]` | | `top_p` | `0.95` | Optuna `[0.85, 0.98]` | | `per_device_train_batch_size` | `1` | | | `gradient_accumulation_steps` | `8` | Effective batch 8 | | `gradient_checkpointing` | `True` | `use_reentrant=False` — VRAM optimization | | `max_completion_length` | `256` | Per-turn; one AWS CLI command fits comfortably | | `max_prompt_length` | `2048` | Holds task + history + observation | | `loss_type` | `"dapo"` | Distributional Advantage Policy Optimization (TRL default for GRPO) | | `mask_truncated_completions` | `True` | Drop samples that hit `max_completion_length` | | `warmup_ratio` | `0.05` | | | `lr_scheduler_type` | `"cosine"` | | | `max_grad_norm` | `1.0` | | | `use_vllm` | `False` | Plain `model.generate()` — vLLM integration is future work | ### Reward functions (TRL convention) Three reward functions are registered, summed by GRPO: ```python reward_funcs=[reward_task, reward_achieved, reward_progress] ``` - `reward_task(completions, **kwargs)` → episode return (sum of per-step env rewards). The dominant signal. - `reward_achieved(completions, **kwargs)` → 1.0 if `task.task_achieved` at end of episode, else 0.0. Sparse but unambiguous. - `reward_progress(completions, **kwargs)` → final `partial_progress` ∈ [0, 1]. Densifies the credit assignment for partial completions. The env's reward shaping (see [server/README.md §8](../server/README.md#8-reward-shaping--taskgrader)) does most of the work — these three TRL functions are a thin façade. ### Episode = one rollout - Each rollout runs **up to `MAX_TURNS=6` sequential AWS CLI commands** - Each command's stdout/stderr/progress is fed back as the user message for the next turn (see `build_user_prompt()` and `format_observation()` in [train_grpo.py](../train_grpo.py)) - The episode terminates on `task_achieved`, max turns, or `max_total_tokens` (per-episode token budget) - Token sequences (prompt_ids, completion_ids, logprobs) are accumulated **across turns**, so GRPO assigns the episode-level reward to the full multi-turn token sequence — not just the last turn ### Curriculum integration ``` trainer step: 1. task = curriculum.next_task() # one task per GRPO step 2. results = pool.run_group(task, ...) # G rollouts on that task 3. mean_r = sum(group_rewards) / G 4. curriculum.record_result(task, achieved=any_achieved, reward=mean_r) 5. trainer applies group-relative advantages # standard GRPO ``` The curriculum drives task selection — every rollout in a group runs the *same* task, forced through `env.reset(task=task)`. This matches GRPO's group-relative semantics (you need the same prompt across the group to compute baseline correctly). Full curriculum mechanics (priority scoring, mastery, spaced rep, tier promotion) live in [server/README.md §7](../server/README.md#7-curriculum-manager). ### Training graphs A reference GRPO run trained 35 steps with the best Optuna config (`lr=1.6e-5`, `β=0.0021`, `T=0.99`). Per-step training signals (extracted from the run's `trainer_state.json`) are mirrored into [`docs/figures/`](../docs/figures/): > ![GRPO final per-step training signals](../docs/figures/grpo_final_per_step.png) > ![GRPO env reward over training](../docs/figures/grpo_reward_curve.png) > ![Success by tier (multi-step)](../docs/figures/grpo_per_tier_curve.png) > ![Reward by tier (multi-step)](../docs/figures/grpo_reward_by_tier.png) Notable signals from the run: | | | |---|---| | `env_reward/mean` | 0.31 (mean over 16 reward-logged steps), max 0.94, min 0.13 | | `kl` | 0.15 (mean) — KL stays small despite tiny β | | `completion_length` | 87 tokens (mean) — agent emits compact AWS CLI commands | | Format compliance | **100%** (`format_reward/mean = 1.0` every step) | Multi-step end-to-end re-eval after GRPO: > ![SFT vs GRPO multi-step metrics grid](../docs/figures/sft_vs_grpo_metrics_grid.png) These are produced by [`plot_rewards()`](../train_grpo.py) reading `reward_log.csv` written by `EpisodeLogger`, plus the post-hoc plots generated during the GRPO notebook run. --- ## 3. Optuna hyperparameter search [train_grpo.py:optuna_search()](../train_grpo.py) ### Search space | Parameter | Range | Reason | |-------------------|------------------------------------|------------------------------------------------------------------------| | `learning_rate` | `[1e-6, 1e-4]` log | GRPO is sensitive to LR; log-scale is the right prior | | `beta` | `[0.0, 0.1]` | KL coefficient. 0 = pure RL (drift risk), 0.1 = anchored to SFT | | `num_generations` | `{4, 8}` | Group size. Larger → tighter advantage estimates but slower | | `temperature` | `[0.7, 1.0]` | Exploration knob | | `top_p` | `[0.85, 0.98]` | Nucleus sampling | | `lora_r` | `{8, 16, 32}` | Adapter capacity | | `lora_alpha_mul` | `{1, 2, 4}` | `lora_alpha = lora_r × multiplier` | | `max_turns` | `{4, 6, 8}` | Episode length cap | ### Objective ``` objective = 0.7 × achieved_rate + 0.3 × mean_progress ``` Calculated on the held-out validation tasks at the end of each trial. Weighting `achieved_rate` higher matches the project goal — actual task completion matters more than partial progress. ### Sampler `optuna.samplers.TPESampler(seed=42)` — Tree-structured Parzen Estimator. TPE outperforms random search on 8-dim spaces with ~6 trials in our experience. Persisted to `outputs/.../optuna.db` (SQLite), so trials can be resumed if a Colab session disconnects. ### Frozen validation set `pick_validation_task_ids(k_per_tier=2, seed=42)` picks 2 tasks per tier (≈10 tasks total) at the start of training. The same set is used by every Optuna trial and the final post-training eval — no benchmark leakage between trials. ### SFT-stage Optuna results (6 trials) The SFT-stage Optuna run explored a 5-parameter space (`lora_r`, `lora_alpha_mul`, `lora_dropout`, `learning_rate`, `warmup_ratio`). 6 trials, validation loss as objective (lower = better): | Trial | r | α | dropout | lr | warmup | val_loss | |------:|---:|---:|:-------:|:---------:|:------:|:--------:| | **0** | 16 | 16 | 0.006 | 4.03e-4 | 0.10 | **0.0523** ★ | | 1 | 16 | 16 | 0.030 | 2.33e-4 | 0.03 | 0.0790 | | 2 | 8 | 32 | 0.020 | 2.29e-4 | 0.03 | 0.0587 | | 3 | 8 | 16 | 0.030 | 1.17e-4 | 0.03 | 0.1199 | | 4 | 16 | 16 | 0.031 | 2.31e-4 | 0.03 | 0.0793 | | 5 | 8 | 32 | 0.009 | 1.37e-4 | 0.10 | 0.0828 | > ![SFT Optuna trial comparison table](../docs/figures/sft_optuna_trials_table.png) ```json { "best_value": 0.052, "best_params": { "lora_r": 16, "lora_alpha_mul": 1, // → lora_alpha = 16 "lora_dropout": 0.005808, "learning_rate": 4.03e-4, "warmup_ratio": 0.1 } } ``` Visualized: > ![Optuna parameter importances](../docs/figures/optuna_param_importance.png) > ![Optuna optimization history](../docs/figures/optuna_history.png) > ![Optuna parallel coordinate plot](../docs/figures/optuna_parallel.png) > ![Optuna slice plot](../docs/figures/optuna_slice.png) > ![Optuna trial training curves](../docs/figures/optuna_trial_curves.png) ### GRPO-stage Optuna results (4 trials) The GRPO-stage Optuna run explored a 3-parameter space (`learning_rate`, `beta`, `temperature`). 4 trials, single-step env reward as objective (higher = better): | Trial | lr | β | T | env_reward | success | |------:|:---------:|:--------:|:-----:|:----------:|:-------:| | 0 | varied | varied | varied| 0.473 | 25.0% | | 1 | varied | varied | varied| 0.469 | 25.0% | | 2 | varied | varied | varied| 0.469 | 25.0% | | **3** | 1.60e-5 | 0.0021 | 0.99 | **0.552** | **33.3%** ★ | > ![GRPO Optuna trial comparison](../docs/figures/grpo_optuna_trials_comparison.png) > ![GRPO Optuna importances](../docs/figures/grpo_optuna_importances.png) > ![GRPO Optuna parallel coordinate](../docs/figures/grpo_optuna_parallel.png) > ![GRPO Optuna hparams](../docs/figures/grpo_optuna_hparams.png) > ![GRPO Optuna trial curves](../docs/figures/grpo_optuna_trial_curves.png) The winning GRPO config uses a **much smaller learning rate** (1.6e-5, vs 4.0e-4 for SFT) and a **tiny KL coefficient** (β=0.0021) — both expected for an RL phase that is only correcting the SFT-bootstrapped policy, not retraining it. --- ## 4. Multi-turn rollouts + parallel envs This section is a quick overview — the full mechanics, including the three pool layers and asyncio orchestration, are in [scripts/README.md](../scripts/README.md). ### MultiTurnEnvPool [train_grpo.py:MultiTurnEnvPool](../train_grpo.py) — owns a background thread running an asyncio loop, opens N WebSocket sessions on startup, exposes a synchronous `run_group(task, ...)` API. - One pool instance lives for the duration of training - `run_group()` calls `asyncio.gather()` over `rollout_one_episode(env, task, ...)` for each of the N envs — every rollout runs the same task in its own MiniStack (see server-side pool in [server/README.md §6](../server/README.md#6-server-side-ministack-pool-parallel-rollouts)) - Returns a list of `{prompt_ids, completion_ids, logprobs, task_reward, task_achieved, final_progress, num_steps, transcript, task_id, difficulty}` ### Why parallelism matters here GRPO's group-relative advantage requires `G` rollouts before any gradient. Running them serially at MAX_TURNS=6 turns × ~50 ms env step = ~300 ms per rollout would cost 2.4 s × G=8 = ~20 s of env time per training step. With parallel rollouts that drops to ~300 ms (the slowest of 8). The model forward pass dominates, exactly as desired. ### Generation lock Because the policy lives on a single GPU, `model.generate()` calls across the asyncio.gather group are serialised behind a `_GENERATE_LOCK` (`threading.Lock`). The env step calls — the slow part — happily overlap. This is the single non-obvious detail that makes the parallel rollout approach actually work. --- ## 5. Training modes (CLI) ```bash # Optuna search only — produces best_cfg.json python train_grpo.py --mode optuna --n-trials 6 --trial-max-steps 30 # Train once with explicit hyperparams (no search) python train_grpo.py --mode train \ --env-url http://localhost:8000 \ --num-generations 8 --max-turns 6 --max-steps 200 # Search → train: Optuna trials, then a full-length run with the best config python train_grpo.py --mode full --n-trials 6 --max-steps 200 ``` All modes write to `outputs/aws-rl-grpo-/`. --- ## 6. How to run ### Prerequisites - A running env server: `make run` from the repo root (starts MiniStack + FastAPI on `http://localhost:8000`) - For pool size > 1: `AWS_RL_ENV_POOL_SIZE=8 make run` - A GPU with ≥ 24 GB VRAM (A10, T4×2, A100, L4 all confirmed working) - HuggingFace token (`HF_TOKEN`) if you want to push the trained adapter ### Local ```bash # 1. Start the env server in one terminal AWS_RL_ENV_POOL_SIZE=8 make run # 2. Run training in another terminal python train_grpo.py --mode full --n-trials 6 --max-steps 200 ``` ### Colab The notebook [aws_rl_env_colab.ipynb](../aws_rl_env_colab.ipynb) wraps the full pipeline (env URL config, HF login, val set, Optuna, training, plotting, optional push-to-Hub): | Notebook | Open in Colab | |----------|---------------| | GRPO end-to-end driver | | | SFT-only ([train/train_sft_lora.ipynb](train_sft_lora.ipynb)) | | | GRPO-only ([train/train_grpo_lora.ipynb](train_grpo_lora.ipynb)) | | Note: the Colab notebooks expect the env server to be reachable. Two options: 1. **HF Space tunnel**: deploy the env to your own HF Space and point `ENV_URL` at it (see main README's deployment section) 2. **ngrok**: run the env locally and expose it via ngrok / cloudflared so Colab can reach it --- ## 7. Logging and artifacts ### Reference training runs (numbers baked into this documentation) The headline numbers and plots in this repo come from two reference training runs we performed end-to-end: - **SFT reference run** — 188 SFT steps with the best Optuna trial. Achieved val loss 0.052 (best of 6 trials). Post-SFT eval delta: format `33% → 100%`, exact `39% → 89%`, latency `2.03s → 1.40s`. The training curves, Optuna plots, and eval comparisons from this run live in [`docs/figures/`](../docs/figures/) (`sft_loss_curve.png`, `optuna_*.png`, `base_vs_sft_success.png`, …). - **GRPO reference run** — 35 GRPO steps with the best Optuna trial. Achieved single-step env reward 0.55 (best of 4 trials). Multi-step eval (n≈108): success `86.8% → 86.2%`, beginner `+3.8 pp`, intermediate `+6.0 pp`, expert flat at 22%. The training signals, by-tier breakdowns, and qualitative rollouts from this run also live in [`docs/figures/`](../docs/figures/) (`grpo_final_per_step.png`, `grpo_reward_curve.png`, `sft_vs_grpo_*.png`, `qualitative_rollouts.png`, …). The raw training-output directories (TRL checkpoints, optimizer states, exported adapters totalling ~330 MB) are not committed. The metrics, hyperparameters, and visualizations they produced are preserved inline in this README and as PNGs under [`docs/figures/`](../docs/figures/). ### GRPO output layout Each GRPO run writes to a fresh `outputs/aws-rl-grpo-/`: | File | Written by | Contents | |-------------------------|------------------------|-------------------------------------------------------------------------| | `reward_log.csv` | `EpisodeLogger` | One row per rollout: `step, rollout_idx, task_id, difficulty, task_reward, task_achieved, final_progress, num_steps, tier, tier_success_rate, timestamp` | | `transcripts.jsonl` | `EpisodeLogger` | Same rows + the full multi-turn transcript per rollout (commands, outputs, rewards) | | `optuna.db` | Optuna | SQLite study (resumable) | | `best_cfg.json` | `optuna_search()` | Final winning hyperparameters | | `trial_NNN/` | `_run_one_trial()` | Per-trial trainer checkpoints + `trial_metrics.json` | | `val_task_ids.json` | Notebook driver | Frozen held-out validation set (for reproducibility) | | `post_train_val.json` | Notebook §10 | Final post-training validation metrics | | `reward_plot.png` | `plot_rewards()` | Group mean reward + per-tier scatter | | `/` | TRL `GRPOTrainer.save` | Trained LoRA adapter (`adapter_config.json`, `adapter_model.safetensors`, etc.) | Push to HF Hub: ```python from huggingface_hub import create_repo, upload_folder create_repo("your-org/aws-rl-grpo-qwen25coder3b", exist_ok=True, private=False) upload_folder(folder_path=str(OUTPUT_DIR), repo_id="your-org/aws-rl-grpo-qwen25coder3b") ``` --- ## 8. Reproducing results ### Actual SFT result ``` SFT (188 steps, best Optuna trial, ~30 min on A10): best val_loss : 0.052 best lora_r : 16 best lora_alpha : 16 (alpha_mul=1) best lora_dropout: 0.0058 best lr : 4.03e-4 best warmup : 0.10 Held-out eval (post-SFT, same prompts as base): format_pct : 33.3% → 100.0% (+66.7 pp) exact_pct : 38.9% → 88.9% (+50.0 pp) service_pct : 77.8% → 88.9% (+11.1 pp) operation_pct : 61.1% → 88.9% (+27.8 pp) avg_latency : 2.03s → 1.40s (−0.63s) avg_len : 85.8 → 74.7 (tighter outputs) ``` Every target from [data/sft/MODEL_EVALUATION.md §11](../data/sft/MODEL_EVALUATION.md) is met or exceeded. ### Actual GRPO result ``` GRPO (35 steps from best Optuna trial, ~1.5 hr on A10): best lr : 1.60e-5 best beta : 0.0021 best temperature : 0.99 num_generations : 8 Per-step training signals (16 reward-logged steps): env_reward (mean): 0.31 max: 0.94 min: 0.13 KL to SFT ref : 0.15 mean (small β = 0.0021 keeps drift in check) format_reward : 1.00 every step (perfect format compliance) completion length: 87 tokens mean (compact AWS CLI commands) Multi-step end-to-end eval (n≈108 episodes): Base+SFT Base+SFT+GRPO Δ overall_success 86.8% 86.2% −0.5 pp overall_reward 0.883 0.877 −0.006 beginner_success 96.2% 100.0% +3.8 pp ✓ intermediate_success 81.0% 87.0% +6.0 pp ✓ warmup_success 96.0% 90.2% −5.8 pp expert_success 22.2% 22.2% flat (bottleneck) drift_repair 22.2% 22.2% flat destructive_fail 15.1% 14.7% −0.4 pp steps_to_solve 1.45 1.55 +0.10 ``` **Honest reading.** A 35-step GRPO run from a strong SFT starting point (already 86.8% success) is short by RL standards. It preserves the SFT gains, modestly improves the middle tiers, but does not crack the expert-tier ceiling — the 22% expert / 22% drift-repair numbers stay flat because there are too few expert episodes in 35 GRPO steps × G=8 = 280 rollouts, with the curriculum focusing primarily on warmup/beginner/intermediate. Variance comes mostly from Optuna trial composition. The published SFT adapter (`Sizzing/aws-rl-sft-qwen25coder3b-adapter`) is the SFT result; the GRPO adapter regenerates per-run from the trainer's output directory. --- ## 9. Files in this directory | File | Purpose | |-----------------------------------------|------------------------------------------------------------------------| | [train_sft_lora.ipynb](train_sft_lora.ipynb) | Stage 1 — supervised LoRA fine-tuning | | [train_grpo_lora.ipynb](train_grpo_lora.ipynb) | Stage 2 — GRPO RL training (clean) | | [train_grpo_lora_with_outputs.ipynb](train_grpo_lora_with_outputs.ipynb) | Same notebook with cell outputs preserved | Heavy logic referenced from these notebooks: - [train_grpo.py](../train_grpo.py) — the `MultiTurnEnvPool`, GRPO config, Optuna search, `plot_rewards`, and the `run_training` entry point - [aws_rl_env_colab.ipynb](../aws_rl_env_colab.ipynb) — Colab driver that imports from `train_grpo.py` --- ## See also - [Main README](../README.md) - [data/README.md](../data/README.md) — dataset generation, base-model selection - [data/sft/MODEL_EVALUATION.md](../data/sft/MODEL_EVALUATION.md) — full 11-model benchmark - [scripts/README.md](../scripts/README.md) — parallel-rollout architecture deep-dive - [server/README.md](../server/README.md) — environment internals (curriculum, reward shaping, anti-hacking) - [compare/README.md](../compare/README.md) — base vs SFT comparison harness