aws_rl_env / Blog.MD
Sizzing's picture
Upload folder using huggingface_hub
5886d6f verified
---
title: "From Cloud Chaos to Capable Agents: Training an LLM SRE on 120+ AWS Tasks"
thumbnail: docs/figures/blog_hero.png
authors:
- user: Sizzing
name: Uday Kiran Padhy
tags:
- reinforcement-learning
- openenv
- grpo
- agents
- rlve
- aws
- sft
- lora
- trl
date: "2026-04-26"
---
![From Cloud Chaos to Capable Agents](docs/figures/blog_hero.png)
# From Cloud Chaos to Capable Agents
### Training an LLM SRE on 120+ AWS Tasks with SFT β†’ GRPO
> **TL;DR.** Cloud agents fail in production not because they don't know the commands β€” but because **state drifts, services hiccup, and reward signals get gamed.** We built an OpenEnv-compatible RL environment that simulates all three: 120+ AWS tasks across 5 difficulty tiers under chaos and drift, an **8-layer anti-reward-hacking stack**, and a SFT β†’ GRPO pipeline with **8-way parallel multi-turn rollouts on a single GPU**. After training, format compliance hit **100%**, exact-match jumped **39% β†’ 89%**, and intermediate-tier success climbed **81% β†’ 87%** β€” all with a 3B-parameter base model on a free Colab runtime.
| | |
|---|---|
| **Live demo** | [sizzing-aws-rl-env.hf.space/web](https://sizzing-aws-rl-env.hf.space/web) |
| **API docs** | [sizzing-aws-rl-env.hf.space/docs](https://sizzing-aws-rl-env.hf.space/docs) (Swagger) Β· [/redoc](https://sizzing-aws-rl-env.hf.space/redoc) |
| **HF Space** | [huggingface.co/spaces/Sizzing/aws_rl_env](https://huggingface.co/spaces/Sizzing/aws_rl_env) |
| **SFT adapter**| [Sizzing/aws-rl-sft-qwen25coder3b-adapter](https://huggingface.co/Sizzing/aws-rl-sft-qwen25coder3b-adapter) |
| **GRPO adapter**| [Sizzing/aws-rl-grpo-qwen25coder3b-adapter](https://huggingface.co/Sizzing/aws-rl-grpo-qwen25coder3b-adapter) |
| **Dataset** | [Sizzing/aws-rl-sft](https://huggingface.co/datasets/Sizzing/aws-rl-sft) |
| **GitHub** | [github.com/udaykiranpadhy/aws-rl-env](https://github.com/udaykiranpadhy/aws-rl-env) |
---
## 1. The problem: why cloud-ops RL is hard
Modern AI agents are increasingly asked to operate cloud infrastructure β€” provision resources, fix misconfigurations, respond to drift, lock down a leaky bucket at 2 a.m. To train such agents you need three things at once: a **realistic environment**, **reliable reward signals**, and **enough scale to make RL feasible**. The market currently forces a hard tradeoff:
- **Real AWS** β€” production-fidelity, but **hundreds of dollars per training run**, impossible to reset cleanly, dangerous if the agent decides to delete prod.
- **Toy emulators / vanilla LocalStack** β€” free and resettable, but they **don't behave like production AWS**: error codes drift, response shapes diverge, and the agent learns shortcuts that crumble on real cloud.
There's a third trap that bites every RL practitioner who's tried this before: **reward hacking**. An agent that optimizes a naΓ―ve reward will discover that printing `"bucket created"` to stdout is way easier than actually creating a bucket, and its training curve will look great while its real success rate stays at zero.
This project closes the gap. We built:
1. **An OpenEnv-compatible RL environment** that speaks **real AWS CLI semantics**. The agent sends `aws s3 mb …`, `aws iam create-role …`, exactly the commands a human SRE would type.
2. **A vendored, customized [MiniStack](https://github.com/srivenkat/MiniStack) simulator** that responds with production-equivalent JSON, runs locally for **zero cost**, supports 34 AWS services, and exposes a single-call state-introspection endpoint we added so the grader has cheap ground-truth access.
3. **A 120+ task curriculum** across 5 tiers (warmup β†’ expert) plus an adversarial drift track, with adaptive selection, mastery tracking, spaced repetition, chaos injection, and randomized drift mutations β€” every feature designed to keep the reward signal honest.
4. **A complete SFT β†’ GRPO training pipeline.** A 1,500-row synthetic dataset spanning 5 trajectory shapes, an 11-model base benchmark, LoRA fine-tuning, and TRL GRPO with multi-turn rollouts and Optuna hyperparameter search.
5. **An 8-way parallel-rollout architecture.** Server-side MiniStack pool, client-side `GrpoPool`, in-process `MultiTurnEnvPool` β€” three coordinated layers that let G=8 concurrent rollouts run on one GPU **without state contamination**.
This isn't another gym classic. It's grounded in real-world utility: **everything an SRE actually does on call.**
---
## 2. System architecture
![System architecture](docs/figures/architecture_diagram.png)
The whole environment ships as **one Docker container** that bundles a FastAPI server, a pool of MiniStack simulator instances, and the AWS CLI v2 binary. Nothing reaches the public internet at runtime.
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ Docker container ──────────────────────────────┐
β”‚ β”‚
β”‚ FastAPI server (port 8000) β”‚
β”‚ β”œβ”€β”€ OpenEnv router /reset /step /state /schema /ws /health β”‚
β”‚ β”œβ”€β”€ Web playground /web (Jinja2 + 40 AWS service icons) β”‚
β”‚ β”œβ”€β”€ env_factory per-WS-session AwsRlEnvironment instance β”‚
β”‚ β”‚ (acquires a MiniStack port from MiniStackPool) β”‚
β”‚ └── Services β”‚
β”‚ Curriculum Β· TaskGrader Β· ResourceVerifier Β· ChaosEngine Β· DriftEngine β”‚
β”‚ HintProvider Β· EpisodeTracker Β· EnvironmentDesigner Β· …Strategy β”‚
β”‚ β”‚
β”‚ MiniStack instances :4566 :4567 :4568 … :4566+POOL_SIZE-1 β”‚
β”‚ (vendored at aws_infra/, started by the Dockerfile entrypoint) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–² β–²
β”‚ HTTP / WebSocket β”‚ AWS CLI subprocess
β”‚ β”‚ (AWS_ENDPOINT_URL=http://localhost:4566+i)
β”‚ β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ RL Agent β”‚ β”‚ AWS CLI commands β”‚
β”‚ (the agent) β”‚ β”‚ (client.py) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
### Episode lifecycle
```mermaid
flowchart LR
A([reset]) --> B[Curriculum<br/>picks task]
B --> C[Run<br/>setup_commands]
C --> D{drift<br/>task?}
D -->|yes| E[DriftEngine<br/>applies 2–3 mutations]
D -->|no| F[Initial<br/>observation]
E --> F
F --> G([step])
G --> H{starts<br/>with 'aws'?}
H -->|no| I[reject<br/>success=False]
H -->|yes| J[EnvironmentStrategy<br/>runs AWS CLI]
J --> K[EpisodeTracker<br/>records command]
K --> L[TaskGrader<br/>computes reward]
L --> M[ChaosEngine<br/>maybe mutates state]
M --> N{terminate?}
N -->|achieved or step β‰₯ MAX| O([done])
N -->|continue| G
I --> G
```
Three primitives β€” `reset`, `step`, `state` β€” exposed over HTTP and WebSocket. The OpenEnv contract gives any compatible trainer (TRL, TorchForge, SkyRL, Unsloth) a drop-in interface.
Full mechanics in [server/README.md](server/README.md).
---
## 3. The curriculum: 124 tasks, 5 tiers, one priority formula
![Curriculum tier pyramid](docs/figures/tier_pyramid.png)
We didn't hand-author a fixed schedule. The `Curriculum` service runs a **single weighted-priority formula** that handles exploration, weakness-targeting, and forgetting prevention all at once:
```
score = novelty_bonus # +100 if never attempted
+ weakness_weight # +50 Γ— (1 βˆ’ task_success_rate)
+ spaced_rep_bonus # +30 if a graduated task is "due" for re-test
βˆ’ recency_penalty # βˆ’20 if attempted in the last 2 episodes
```
Read that formula and you immediately know the schedule: never-seen tasks dominate at first; once attempted, weak ones rise; once mastered, they go on a re-test schedule with intervals `[3, 6, 12, 24, 48]` episodes; you never see the same task two episodes in a row. **Explainable. Auditable. Boring in the best sense.**
### Mastery and tier promotion
Every task carries a sliding 10-episode success window with `0.85` exponential decay. When that window's success rate crosses `0.7`, the task **graduates** β€” it stops appearing in the standard rotation but resurfaces on the spaced-rep schedule above. If a graduated task fails on re-test, it un-graduates and rejoins the pool. There are **two ways** to get promoted to the next tier:
- **Standard path** β€” meet the tier's `min_episodes` AND `advance_rate` (0.6 – 0.7 depending on tier).
- **Fast-track** β€” three consecutive episodes at β‰₯ 0.9 success. If you're crushing it, you skip ahead.
![Curriculum progression](docs/figures/curriculum_progression.png)
### What's in each tier
| Tier | Tasks | Chaos | Grading strategy | What the agent must do |
|------|------:|------:|------------------|------------------------|
| Warmup | 25 | 10% | `command_match` | Emit the right service + operation. |
| Beginner | 25 | 10% | `resource_creation` | Actually create a resource that ends up in MiniStack state. |
| Intermediate | 25 | 20% | `multi_step` | Complete an ordered sequence (e.g., bucket β†’ policy β†’ versioning). |
| Advanced | 25 | 30% | `multi_step + services` | Same, but **all** required services must be touched. |
| Expert | 24 | 30% | `state_checks` | Pass arbitrary AWS CLI assertions on the final state. |
| **Drift** | 9 | β€” | `state_checks` (auto-repair) | Detect and fix 2–3 random pre-applied mutations. |
The full task pool is YAML-defined in [server/services/tasks/](server/services/tasks/) β€” judges can read or modify it without touching code.
---
## 4. Reward shaping and the 8-layer anti-reward-hacking stack
> **This is the most novel part of the project.** Most environments trust the reward signal. This one assumes the agent will try to game it β€” and stops it eight different ways.
### How reward is built up
```mermaid
flowchart TD
Start([step result]) --> Q1{task<br/>achieved?}
Q1 -->|yes| R1[reward = 1.0]
R1 --> CB{survived<br/>chaos?}
CB -->|yes| R2[Γ— 1.05<br/>chaos bonus]
CB -->|no| R3[reward stays 1.0]
R2 --> HD[Γ— 0.85^n<br/>hint decay]
R3 --> HD
Q1 -->|no| S1[reward = partial Γ— 0.8]
S1 --> S2{progress<br/>increased?}
S2 -->|yes| S3[+ 0.1<br/>progress delta]
S2 -->|no| S4[no delta]
S3 --> S5{command<br/>failed?}
S4 --> S5
S5 -->|yes| S6[Γ— 0.5<br/>error penalty]
S5 -->|no| S7[no penalty]
S6 --> S8[βˆ’ 0.1 Γ— rollback_count<br/>+ 0.02 Γ— idempotent_retries]
S7 --> S8
S8 --> S9[clamp to 0.0–0.99<br/>1.0 reserved for completion]
S9 --> HD
HD --> End([final reward])
```
![Reward components](docs/figures/reward_components.png)
The reward is **dense by design**: every step provides meaningful signal, not just terminal success. Rollbacks (create-then-delete cycles) are explicitly penalized. Graceful retries on "already exists" errors get a small bonus. **Operational discipline is baked into the reward**, not just task completion.
### Five grading strategies, dispatched by tier
A single grader can't fairly score "did you say `aws s3 mb`?" and "did the bucket end up with versioning enabled, encrypted, blocking public access, AND not deleted by accident?" so the `TaskGrader` polymorphs:
| Tier | Strategy | Example assertion |
|------|----------|-------------------|
| Warmup | `command_match` | `command_contains: "s3 mb"` |
| Beginner | `resource_creation` | `resource_exists: {service: s3, name: my-bucket}` |
| Intermediate | `multi_step` | Ordered list of step criteria |
| Advanced | `multi_step + services` | Same + `services: [s3, iam]` must all be touched |
| Expert | `state_checks` | Arbitrary AWS CLI assertions on infra state |
### The 8 defense layers
```mermaid
flowchart LR
Agent[Agent action] --> L1["β‘  Allow-list<br/>must start with 'aws '"]
L1 --> L2["β‘‘ Per-episode dedup<br/>op,resource credits once"]
L2 --> L3["β‘’ Grader invisibility<br/>state-checks never seen by agent"]
L3 --> L4["β‘£ No read-credit<br/>describe/list earn zero"]
L4 --> L5["β‘€ Monotonic progress<br/>can't decrement to re-credit"]
L5 --> L6["β‘₯ Exact resource-name match<br/>my-bucket-2 β‰  my-bucket"]
L6 --> L7["⑦ Ground-truth via MiniStack<br/>not agent stdout"]
L7 --> L8["β‘§ Final-state assertions<br/>jq-paths on live state"]
L8 --> Reward([Reward])
```
| # | Layer | Hack it defeats |
|---|-------|------------------|
| 1 | **Command allow-list** (`aws ` prefix only) | Shell escapes, fake stdout |
| 2 | **Dedup of `(operation, resource)` per episode** | Spamming `s3 mb …` 50Γ— to inflate a "completed steps" counter |
| 3 | **Grader invisibility** | Reverse-engineering reward by reading state-check queries |
| 4 | **No verification reward** | Running `aws s3 ls` to "prove" the bucket exists |
| 5 | **Monotonic `partial_progress`** | Bouncing progress down then back up to re-earn credit |
| 6 | **Exact resource-name validation** | Creating `my-test-bucket-2` instead of `my-test-bucket` |
| 7 | **Ground-truth via `/_ministack/state`** | Forging stdout that looks successful when the resource doesn't exist |
| 8 | **Final-state AWS CLI assertions** | Passing the steps but leaving prod broken |
These layers **compose**. To hack the reward, the agent would have to defeat all eight independently β€” each one alone is a hard problem.
### Chaos engine and drift engine
The reward stack is hardened, but the env itself is also adversarial:
- **Chaos** (`server/services/chaos.py`) β€” silent mid-episode mutations on services the task is touching. Probabilities scale by tier: 10% / 20% / 30%. Survive a chaotic episode and the reward is multiplied by **Γ—1.05**.
- **Drift** (`server/services/drift.py`) β€” for the 9 drift tasks, 2–3 random mutations from a per-task pool are applied **before** the agent sees the env. The agent must detect and repair them. Mutations are **randomized per episode** so the agent can't memorize a script.
- **Hints** β€” three progressive levels available via `aws help --task-hint`. Each hint multiplies the final reward by `0.85` (so 3 hints β†’ 0.61Γ— decay). The agent decides whether the cost is worth it.
Full mechanics, including all 5 grading strategies and the chaos/drift logic, are in [server/README.md Β§8 – Β§13](server/README.md).
---
## 5. Parallel rollout architecture: 3 coordinated pool layers
GRPO needs `G=8` rollouts **on the same task** per training step β€” that's how it computes group-relative advantages without a critic. Run them sequentially and you pay 8 Γ— 6 turns Γ— 50 ms = **2,400 ms** of wall-clock per step, before the GPU has done anything. Run them in parallel and a state bug between two rollouts will silently destroy your gradient.
So we built three coordinated pool layers that **parallelize transparently while guaranteeing state isolation**.
```mermaid
flowchart TD
T[Trainer step<br/>needs G=8 rollouts] --> M[MultiTurnEnvPool<br/>sync API Β· owns asyncio loop]
M --> G[GrpoPool<br/>async Β· asyncio.gather]
G --> WS1[WS session 1]
G --> WS2[WS session 2]
G --> WS3[WS session ...]
G --> WS8[WS session 8]
WS1 --> S[FastAPI server<br/>OpenEnv max_concurrent_envs=8]
WS2 --> S
WS3 --> S
WS8 --> S
S --> P[MiniStackPool<br/>free-list Β· threading.Lock]
P --> M1[:4566]
P --> M2[:4567]
P --> M3[:4568]
P --> M8[:4573]
style P fill:#fff7fa,stroke:#ff4f8b
style M fill:#fff7fa,stroke:#ff4f8b
style G fill:#fff7fa,stroke:#ff4f8b
```
![Parallel rollout architecture](docs/figures/parallel_rollout_diagram.png)
### The three layers
- **Server-side `MiniStackPool`** ([server/app.py](server/app.py)) β€” free-list of ports `[BASE, BASE + POOL_SIZE)`, lock-guarded `acquire()` / `release()`. Each WebSocket session gets a unique MiniStack process that persists for the session's lifetime. **8 isolated MiniStack instances on ports 4566–4573 mean zero cross-rollout state bleed.**
- **Client-side async `GrpoPool`** ([scripts/grpo_pool.py](scripts/grpo_pool.py)) β€” pure-asyncio, uses `asyncio.gather` over N WebSocket sessions. Used by training and demo notebooks.
- **In-process sync `MultiTurnEnvPool`** ([train/train_grpo_lora.ipynb](train/train_grpo_lora.ipynb)) β€” wraps `GrpoPool` behind a sync API by owning a background asyncio loop. The TRL trainer keeps its sync API; concurrency happens inside.
### The all-or-nothing connect protocol
Here's the surprising-detail callout, the kind a judge appreciates:
> **If 7 of 8 WebSocket connects succeed and the 8th fails, all 8 must be rolled back and closed.**
Why? Because the 7 successful connects already acquired MiniStack ports from the server-side pool. If we kept them open and just retried the 8th, those 7 ports would leak β€” they stay acquired until the server's idle timeout fires (minutes), and the next training step finds the pool exhausted.
This single invariant is the difference between *"training resumes cleanly after every flake"* and *"every flake corrupts the pool; rebuild the container at 3 a.m."*
![8 simultaneous WebSocket sessions](docs/figures/env_init_screenshot.png)
### Wall-clock impact
- **Sequential**: 8 rollouts Γ— 6 turns Γ— ~50 ms env time = **2,400 ms / GRPO step**.
- **Parallel (8-way)**: max(8 envs) β‰ˆ **300 ms / GRPO step**.
- **Effective speedup**: ~8Γ— on the env side. The GPU forward-pass still serializes behind a `threading.Lock`, but env time is no longer the bottleneck.
Full details, including all the corner cases of the all-or-nothing protocol, are in [scripts/README.md](scripts/README.md).
---
## 6. MiniStack: vendored, customized, reproducible
The simulator powering the env is **vendored as a git subtree** at [aws_infra/](aws_infra/), not pulled as a black-box dependency. Why fork a perfectly good upstream?
1. **One-call grading**. We added a custom `/_ministack/state` endpoint (commit `a648c3a`) that returns the entire infrastructure inventory in **one HTTP call** instead of iterating 20+ list APIs per grading pass. This single endpoint is what makes layer 7 of the anti-hacking stack cheap enough to run every step.
2. **Reproducible Docker builds with no runtime network**. Pinning a specific MiniStack revision means the image is bit-identical across rebuilds. The Docker image bundles the simulator; it doesn't pull at startup.
3. **Freedom to extend service coverage** when a task needs a service the upstream doesn't yet support.
The custom commits are kept as **small, isolated patches** so periodic upstream syncs (e.g., `af2e945`, `579597b`) replay cleanly with `git subtree pull`. To inspect:
```bash
git show a648c3a # the state-endpoint diff
git log --oneline -- aws_infra/ # only the aws_infra subtree history
```
This is a small thing, but it's one of those engineering-maturity signals that says **"this repo is built to be maintained, not just demoed."** The full subtree workflow is in [server/README.md Β§5](server/README.md#5-ministack-vendored-fork--customizations).
---
## 7. The training pipeline: SFT β†’ GRPO
```mermaid
flowchart LR
TT[tests_tasks/<br/>134 canonical solutions] --> AST[AST extract<br/>build_sft_dataset.py]
AST --> DS[1,500 row<br/>SFT dataset<br/>5 trajectory types]
DS -.->|published| HF1[(HF Dataset<br/>aws-rl-sft)]
DS --> SFT[Stage 1: SFT LoRA<br/>Qwen2.5-Coder-3B<br/>Optuna 6 trials]
SFT --> SA[SFT adapter]
SA -.->|published| HF2[(HF Hub<br/>aws-rl-sft-adapter)]
SA --> GRPO[Stage 2: GRPO<br/>TRL Β· G=8 rollouts<br/>Optuna 4 trials]
ENV[(AWS RL Env<br/>FastAPI + MiniStack pool)] --> GRPO
GRPO --> GA[GRPO adapter]
GA -.->|published| HF3[(HF Hub<br/>aws-rl-grpo-adapter)]
style ENV fill:#fff7fa,stroke:#ff4f8b
style HF1 fill:#fffbeb,stroke:#f59e0b
style HF2 fill:#fffbeb,stroke:#f59e0b
style HF3 fill:#fffbeb,stroke:#f59e0b
```
Two stages, both reproducible on a free Colab GPU runtime. Full detail in [train/README.md](train/README.md).
### 7.1 Dataset β€” 1,500 deterministic synthetic rows
![SFT dataset composition](docs/figures/dataset_composition.png)
The dataset is **synthetic but deterministic** β€” and that's not an oxymoron. We don't run pytest to generate examples; we use Python's `ast` module to extract canonical commands directly from `tests_tasks/test_<tier>_tasks.py`. **No simulator spin-up. Zero flake risk. Bit-for-bit reproducible** with one script.
Five trajectory types teach realistic multi-turn behavior:
- **Success (55%)** β€” the canonical command for the task.
- **Multi-step continuation (20%)** β€” given the partial conversation, predict the next command. Simulated AWS responses are interpolated with resource names, so the model learns *"what you do depends on what's already been done"*, not *"always run the first command"*.
- **Failure recovery (15%)** β€” on a malformed AWS error, fix the command.
- **Verification (5%)** β€” pick the right `aws describe-*` to confirm state.
- **Hint usage (5%)** β€” given a hint, follow it.
Tier weighting is **50/30/15/5/0** (warmup / beginner / intermediate / advanced / expert). **Expert is intentionally excluded from SFT** β€” expert tasks have randomized state checks, so there's no single canonical script. Teaching SFT a fixed solution would be wrong; GRPO's reward signal is the right tool for randomized end-states.
Published as [Sizzing/aws-rl-sft](https://huggingface.co/datasets/Sizzing/aws-rl-sft).
### 7.2 Base model selection β€” 11 candidates, 1 winner
![Top-4 candidate models on the held-out benchmark](docs/figures/model_eval_chart.png)
We didn't pick a base model on vibes. **11 chat models Γ— 27 held-out prompts**, four quality metrics plus latency. Full report in [data/sft/MODEL_EVALUATION.md](data/sft/MODEL_EVALUATION.md).
| Model | exact% | op% | latency | Verdict |
|-------|------:|----:|--------:|---------|
| **Qwen2.5-Coder-3B-Instruct** βœ… | **41%** | **63%** | **3.1 s** | Best balance of accuracy and speed |
| Qwen3-4B | 33% | 59% | 10.4 s | Perfect format, but 3Γ— slower |
| Qwen2.5-Coder-1.5B | 22% | 41% | 2.5 s | Fast, but 19-pp accuracy gap |
| SmolLM2-1.7B | 7% | 19% | 2.0 s | Too small for AWS knowledge |
| DeepSeek-R1-Distill-Qwen-1.5B | 0% | 4% | 6.8 s | Wrong domain β€” reasoning β‰  AWS |
**Winner: [unsloth/Qwen2.5-Coder-3B-Instruct-bnb-4bit](https://huggingface.co/unsloth/Qwen2.5-Coder-3B-Instruct-bnb-4bit)** β€” 41% exact-match, 63% operation-match, 3.1 s latency. Small enough for 8-way parallel GRPO on a 24 GB GPU; accurate enough that SFT has a strong starting point.
### 7.3 Stage 1 β€” SFT (LoRA)
LoRA, attention-only, ~10–40M trainable parameters. We let Optuna sweep 6 trials over `[lora_r, lora_alpha_mul, lora_dropout, learning_rate, warmup_ratio]`:
| Hyperparameter | Search space | Best value |
|---------------|--------------|-----------:|
| `lora_r` | {8, 16, 32} | **16** |
| `lora_alpha_mul` | [0.5, 2.0] | **1.0** (Ξ± = 16) |
| `lora_dropout` | [0.005, 0.031] | **0.0058** |
| `learning_rate` | [5e-5, 5e-4] | **4.03e-4** |
| `warmup_ratio` | [0.05, 0.15] | **0.10** |
![SFT loss curve](docs/figures/sft_loss_curve.png)
![Optuna parameter importance](docs/figures/optuna_param_importance.png)
![Optuna optimization history](docs/figures/optuna_history.png)
Best trial reached **val loss 0.052 after 188 steps** (~30 min on a Colab A10). Adapter published: [Sizzing/aws-rl-sft-qwen25coder3b-adapter](https://huggingface.co/Sizzing/aws-rl-sft-qwen25coder3b-adapter).
### 7.4 Stage 2 β€” GRPO (TRL)
GRPO is a critic-free RL algorithm that computes advantages from a **group of G rollouts** on the same prompt. TRL's `GRPOTrainer` is the implementation; we wrap it with our `MultiTurnEnvPool` so each "rollout" is a multi-turn AWS CLI episode, not a single completion.
```python
GRPOConfig(
model_name_or_path="Sizzing/aws-rl-sft-qwen25coder3b-adapter",
num_generations=8, # G=8 rollouts per step
beta=0.0021, # KL coefficient (tight β€” Optuna picked it)
learning_rate=1.6e-5,
temperature=0.99,
top_p=0.95,
max_turns=6, # multi-turn episode length
loss_type="dapo",
reward_func=env_reward, # AwsRlEnv β†’ final reward
)
```
Optuna swept 4 trials over `[learning_rate, beta, temperature]` β€” a tighter 3-parameter space because we already had a strong SFT baseline.
![GRPO Optuna trials comparison](docs/figures/grpo_optuna_trials_comparison.png)
![GRPO Optuna parameter importances](docs/figures/grpo_optuna_importances.png)
![GRPO Optuna optimization history](docs/figures/grpo_optuna_history.png)
Final run: **35 GRPO steps, ~1.5 hours on Colab A10**.
![GRPO per-step training signals](docs/figures/grpo_final_per_step.png)
![GRPO env reward over training](docs/figures/grpo_reward_curve.png)
![GRPO per-tier reward curve](docs/figures/grpo_per_tier_curve.png)
Adapter published: [Sizzing/aws-rl-grpo-qwen25coder3b-adapter](https://huggingface.co/Sizzing/aws-rl-grpo-qwen25coder3b-adapter).
---
## 8. Results
### 8.1 Base vs SFT β€” single-step held-out eval
After running the SFT pipeline end-to-end, the eval delta on the same held-out prompts is striking:
| Metric | Base | Post-SFT | Ξ” |
|-----------------|-------:|---------:|:------------:|
| `format_pct` | 33.3% | **100.0%** | **+66.7 pp** |
| `exact_pct` | 38.9% | **88.9%** | **+50.0 pp** |
| `service_pct` | 77.8% | **88.9%** | +11.1 pp |
| `operation_pct` | 61.1% | **88.9%** | +27.8 pp |
| `avg_len` | 85.8 | 74.7 | βˆ’11 chars (tighter) |
![Base vs SFT eval-metrics comparison](docs/figures/base_vs_sft_success.png)
![Single-step eval, base vs SFT](docs/figures/single_step_eval.png)
![Dataset comparison: base vs SFT](docs/figures/compare_dataset.png)
Every target from [data/sft/MODEL_EVALUATION.md Β§11](data/sft/MODEL_EVALUATION.md#11-target-metrics-for-sft) is met or exceeded. **Format compliance is now perfect**; the model never wraps commands in fences or quotes after SFT. **Exact-match jumped from 39% to 89%** β€” the agent now emits the canonical command for ~9 of every 10 prompts.
### 8.2 SFT vs GRPO β€” multi-step live env eval (100+ episodes)
This is the harder benchmark. We let the SFT and GRPO adapters loose on the live RL environment for 100+ episodes each:
| Metric | SFT | SFT + GRPO | Ξ” |
|-------------------------------:|:-------:|:----------:|:------------:|
| Overall success rate | 86.8% | 86.2% | βˆ’0.5 pp |
| Overall mean reward | 0.883 | 0.877 | βˆ’0.006 |
| Beginner success | 96.2% | **100.0%** | **+3.8 pp** |
| **Intermediate success** | 81.0% | **87.0%** | **+6.0 pp** |
| Warmup success | 96.0% | 90.2% | βˆ’5.8 pp |
| Expert success | 22.2% | 22.2% | flat |
| Drift repair rate | 22.2% | 22.2% | flat |
| Destructive-action fail rate | 15.1% | 14.7% | βˆ’0.4 pp |
| Steps to solve | 1.45 | 1.55 | +0.10 |
![SFT vs GRPO metrics grid](docs/figures/sft_vs_grpo_metrics_grid.png)
![SFT vs GRPO by tier](docs/figures/sft_vs_grpo_by_tier.png)
![SFT vs GRPO scalar comparison](docs/figures/sft_vs_grpo_scalar.png)
![RL env comparison: base vs SFT (per-episode rewards)](docs/figures/compare_rl_env.png)
> **Honest reading.** GRPO **preserves the SFT gains** and **modestly improves the middle tiers** (beginner +3.8 pp, intermediate +6.0 pp). It does **not crack the expert-tier bottleneck** β€” 22% on SRE / drift / security-posture tasks, flat from SFT. With longer GRPO runs and an expert-weighted curriculum, this is the next gain to chase. We're calling this out directly because credibility matters more than a clean win-bar.
### 8.3 Qualitative rollouts
One sample episode per tier, post-GRPO:
![Qualitative rollouts on representative tasks](docs/figures/qualitative_rollouts.png)
The full notebook with side-by-side base / SFT / GRPO transcripts is at [compare/compare_base_vs_sft.ipynb](compare/compare_base_vs_sft.ipynb).
---
## 9. Reproducibility
Everything in this blog runs from three Colab notebooks. **No private dependencies, no purchased compute, no leaked state.**
| Notebook | What it does | Open |
|---|---|---|
| [train/train_sft_lora.ipynb](train/train_sft_lora.ipynb) | Stage 1 β€” SFT LoRA fine-tune | [Colab](https://colab.research.google.com/drive/1dm9sDaLxHX6s9zEG_SC0FQcKWKkc3TfL?usp=sharing) |
| [train/train_grpo_lora.ipynb](train/train_grpo_lora.ipynb) | Stage 2 β€” GRPO multi-turn rollouts | [Colab](https://colab.research.google.com/drive/1NwiOM0h_JpXXGRxfY_xZtDiaigvIaKjx?usp=sharing) |
| [compare/compare_base_vs_sft.ipynb](compare/compare_base_vs_sft.ipynb) | Side-by-side base vs SFT (dataset + RL env) | [Colab](https://colab.research.google.com/drive/17406aiad8h4nAphV42vVNZ-a5SzZMIre?usp=sharing) |
**Local dev** is one command:
```bash
make docker-run # FastAPI + MiniStack on :8000
# 8-way parallel rollouts for training:
AWS_RL_ENV_POOL_SIZE=8 make run
```
**The test suite** is also the canonical-solution source. 10 unit tests + 134 tier-integration tests, where each integration test is an AST-extractable solution for the SFT dataset:
```bash
pytest tests/ tests_tasks/ -v
```
| Path | What it covers |
|------|----------------|
| [tests/test_task_grader.py](tests/test_task_grader.py) | All 5 grading strategies + every penalty/bonus |
| [tests/test_resource_verifier.py](tests/test_resource_verifier.py) | Per-service ground-truth verification (20+ services) |
| [tests/test_pool.py](tests/test_pool.py) Β· [test_grpo_pool.py](tests/test_grpo_pool.py) | All-or-nothing connect protocol |
| [tests/test_drift_engine.py](tests/test_drift_engine.py) | Random drift selection + mutation application |
| [tests_tasks/test_*_tasks.py](tests_tasks/) | 134 tasks exercised end-to-end against MiniStack |
All artifacts are on the Hub (dataset, SFT adapter, GRPO adapter, Space). A judge can fork this repo and re-run the entire pipeline in a few hours.
---
## 10. What's next
The expert-tier bottleneck (22% success on state-check / drift / security-posture tasks) is the single biggest target:
- **Longer GRPO runs** β€” 35 steps is short by RL standards. We'd expect compounded improvements from 200–500 steps with the same config.
- **Expert-weighted curriculum** β€” currently the priority formula doesn't preferentially upweight expert tasks; with a small bias term we'd see more expert exposure per step.
- **DPO on expert trajectories** β€” preference pairs (good vs bad expert solves) might shape multi-step expert behavior more efficiently than scalar reward.
- **Real-AWS strategy backend** β€” `BACKEND_TYPE=aws` is wired and ready. Cost-budgeted eval runs against a sandboxed real account would close the sim-to-real gap once and for all.
PRs welcome at [github.com/udaykiranpadhy/aws-rl-env](https://github.com/udaykiranpadhy/aws-rl-env). The env is OpenEnv-compliant, so any TRL / TorchForge / SkyRL / Unsloth user can plug in tomorrow.
---
## 11. Acknowledgments
Thank you to:
- **Meta, PyTorch, Hugging Face, Unsloth, and Scaler** for organizing the [OpenEnv Hackathon](https://huggingface.co/blog/openenv) and providing mentors who helped clarify questions throughout.
- **MiniStack** β€” vendored at [aws_infra/](aws_infra/), upstream license preserved. Custom modifications are commits `a648c3a`, `a00e981`; periodic upstream syncs `af2e945`, `579597b`.
- **OpenEnv** β€” environment protocol and Python client framework that this entire project plugs into.
- **TRL** (Hugging Face) β€” `GRPOTrainer` implementation and the rest of the post-training stack.
- **Unsloth** β€” 4-bit quantized model loaders and fused training kernels that fit a 3B model + 8 rollouts on 24 GB.
- **Optuna** β€” TPE sampler that found the SFT and GRPO hyperparameters without us having to.
- **Google Colab** β€” free GPU runtime for the full training pipeline.
- **AWS service icons** in [server/static/img/aws/](server/static/img/aws/) β€” used in the web playground.
---
### Sub-README index β€” for the deeper dives
| Path | What it covers |
|------|----------------|
| [server/README.md](server/README.md) | Environment internals β€” curriculum, reward shaping, anti-hacking, chaos, drift, MiniStack-fork detail |
| [train/README.md](train/README.md) | SFT + GRPO pipeline β€” LoRA config, Optuna search, multi-turn rollouts |
| [scripts/README.md](scripts/README.md) | Parallel-rollout architecture β€” 3 pool layers, all-or-nothing connect, concurrency safety |
| [data/README.md](data/README.md) | Dataset generation β€” 5 trajectory types, AST extraction, base-model selection summary |
| [data/sft/MODEL_EVALUATION.md](data/sft/MODEL_EVALUATION.md) | Full 11-model benchmark report β€” methodology, per-model verdicts |
| [compare/README.md](compare/README.md) | Base vs SFT comparison harness |
| [aws_infra/README.md](aws_infra/README.md) | Vendored MiniStack upstream documentation |
---
*Built for the **OpenEnv Hackathon 2026** β€” Apr 26, 2026. Questions / feedback? Open an issue or PR at [github.com/udaykiranpadhy/aws-rl-env](https://github.com/udaykiranpadhy/aws-rl-env).*
### Small Explanation Video
- [Recorded Video](https://share.zight.com/NQu0pLvQ)