aws_rl_env / README.md
Sizzing's picture
Upload folder using huggingface_hub
c745a99 verified
---
title: AWS RL Environment Server
emoji: πŸ₯‡
colorFrom: pink
colorTo: pink
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv
---
# AWS Cloud Operations β€” RL Environment & Training Pipeline
> Cloud agents fail in production not because they don’t know the commands β€” but because state drifts, services hiccup, and reward signals get gamed. We built an environment that simulates all three: 120+ AWS tasks under chaos and drift, an 8-layer anti-reward-hacking stack, and an adversarial curriculum that targets the agent’s own weak spots. After SFT β†’ GRPO on a single GPU with 8 parallel rollouts, format compliance hit 100%, exact-match jumped 39% β†’ 89%, and intermediate-tier success climbed 81% β†’ 87%.
| | |
|---|---|
| **Live demo** | [sizzing-aws-rl-env.hf.space/web](https://sizzing-aws-rl-env.hf.space/web) β€” try the playground in a browser |
| **API docs** | [sizzing-aws-rl-env.hf.space/docs](https://sizzing-aws-rl-env.hf.space/docs) (Swagger), [/redoc](https://sizzing-aws-rl-env.hf.space/redoc) |
| **HF Space** | [huggingface.co/spaces/Sizzing/aws_rl_env](https://huggingface.co/spaces/Sizzing/aws_rl_env) |
| **SFT adapter**| [Sizzing/aws-rl-sft-qwen25coder3b-adapter](https://huggingface.co/Sizzing/aws-rl-sft-qwen25coder3b-adapter) |
| **Dataset** | [Sizzing/aws-rl-sft](https://huggingface.co/datasets/Sizzing/aws-rl-sft) |
---
## Table of contents
1. [What this is & why it matters](#1-what-this-is--why-it-matters)
2. [Highlights β€” full feature inventory](#2-highlights--full-feature-inventory)
3. [Architecture](#3-architecture)
4. [Live demo & Quick Start](#4-live-demo--quick-start)
5. [Run on Colab](#5-run-on-colab)
6. [Action / Observation spec](#6-action--observation-spec)
7. [Curriculum & Reward (overview)](#7-curriculum--reward-overview)
8. [Training pipeline (SFT β†’ GRPO)](#8-training-pipeline-sft--grpo)
9. [Parallel rollout architecture](#9-parallel-rollout-architecture)
10. [MiniStack: vendored & customized](#10-ministack-vendored--customized)
11. [Results & Benchmarks](#11-results--benchmarks)
12. [Repository map](#12-repository-map)
13. [Configuration & Running](#13-configuration--running)
14. [Testing](#14-testing)
15. [Tech stack](#15-tech-stack)
16. [Links](#16-links)
17. [Acknowledgments](#17-acknowledgments)
---
## 1. What this is & why it matters
Modern AI agents are increasingly asked to operate cloud infrastructure β€” provisioning resources, fixing misconfigurations, responding to drift. Training such agents needs (a) a realistic environment, (b) reliable reward signals, and (c) enough scale to make RL feasible. Existing options force a hard tradeoff: real AWS costs hundreds of dollars per training run and is impossible to reset; toy emulators don't behave like production AWS.
**This project closes that gap.** We built:
1. **An OpenEnv-compatible RL environment** that speaks real AWS CLI semantics. The agent sends `aws s3 mb …`, `aws iam create-role …`, and so on β€” the exact same commands a human SRE would type.
2. **A vendored, customized MiniStack simulator** that responds with production-equivalent JSON, runs locally for zero cost, supports 34 AWS services, and exposes a single-call state-introspection endpoint we added so the grader has cheap ground-truth access.
3. **A 120+ task curriculum** across 5 tiers (warmup β†’ expert) with adaptive selection, mastery tracking, spaced repetition, chaos injection, and drift-detection scenarios β€” every feature designed to keep the reward signal honest and prevent the agent from gaming it.
4. **A complete SFT β†’ GRPO training pipeline.** A 1,500-row synthetic dataset spanning 5 trajectory shapes, an 11-model base benchmark, LoRA fine-tuning, and TRL GRPO with multi-turn rollouts and Optuna hyperparameter search.
5. **An 8-way parallel-rollout architecture.** Server-side MiniStack pool, client-side `GrpoPool`, in-process `MultiTurnEnvPool` β€” three coordinated layers that let G=8 concurrent rollouts run on one GPU without state contamination.
Everything is reproducible: the dataset is generated by a deterministic script, the model selection is documented end-to-end, training entry points run on Colab, and the env runs locally in a single Docker container with no external network requirement.
---
## 2. Highlights β€” full feature inventory
This is the complete surface area of the project. Each entry links to deeper documentation in the corresponding sub-README.
### Environment & Curriculum
- **[120+ tasks across 5 tiers](server/services/tasks/)** β€” warmup (25), beginner (25), intermediate (25), advanced (25), expert (24), drift (9). YAML-defined task spec per tier.
- **[Curriculum learning with priority scoring](server/README.md#7-curriculum-manager)** β€” `score = novelty + weakness βˆ’ recency + spaced_rep_bonus` drives task selection.
- **[Mastery tracking](server/README.md#7-curriculum-manager)** β€” sliding 10-episode window, 0.7 threshold, 0.85 exponential decay, supports un-graduation.
- **[Spaced repetition](server/README.md#7-curriculum-manager)** β€” graduated tasks resurface at intervals `[3, 6, 12, 24, 48]` to prevent forgetting.
- **[Tier promotion](server/README.md#7-curriculum-manager)** β€” standard (min episodes + success rate) + fast-track (3 consecutive β‰₯90% episodes).
- **[Strategy pattern: simulator vs real AWS](server/README.md#4-strategy-pattern-simulator-vs-real-aws)** β€” `BACKEND_TYPE=simulator` (default) or `aws`, no code fork.
### Reward shaping
- **[Five grading strategies](server/README.md#8-reward-shaping--taskgrader)** β€” command-match (warmup), resource-creation (beginner), multi-step (intermediate), multi-step+services (advanced), state-checks (expert).
- **[Dense partial-progress signal](server/README.md#8-reward-shaping--taskgrader)** β€” clamped to `[0.0, 0.99]`, `1.0` reserved for verified completion.
- **[Rollback penalty](server/README.md#8-reward-shaping--taskgrader)** β€” `βˆ’0.1` per `(create-X, …, delete-X)` pair.
- **[Idempotency bonus](server/README.md#8-reward-shaping--taskgrader)** β€” `+0.02` for graceful "already exists" retry.
- **[Hint decay](server/README.md#13-hint-provider)** β€” three-level progressive hints with `0.85^n` reward multiplier.
- **[Chaos survival bonus](server/README.md#11-chaos-engine)** β€” `Γ—1.05` if the agent completes a chaotic task.
### Resilience & adversarial features
- **[Chaos injection](server/README.md#11-chaos-engine)** β€” silent mid-episode mutations, tier-scaled probabilities (10/20/30%) on services the task is touching.
- **[Drift detection](server/README.md#12-drift-engine)** β€” 6 expert tasks, 2–3 random mutations from a per-task pool, randomized per episode (no memorization).
- **[Security-posture audit tasks](server/README.md#17-security-posture-audit-examples)** β€” S3 public bucket lockdown, IAM least-privilege, Lambda secret rotation.
- **[8-layer anti-reward-hacking](server/README.md#9-anti-reward-hacking--8-defense-layers)** β€” ground-truth verification, dedup, grader invisibility, command allow-list, no-credit-for-reads, monotonic progress, exact resource-name validation, final state checks.
### Training pipeline
- **[Synthetic SFT dataset (1,500 rows)](data/README.md)** β€” 5 trajectory types: success / multi-step continuation / failure recovery / verification / hint usage.
- **[Rigorous base-model selection](data/sft/MODEL_EVALUATION.md)** β€” 11 models Γ— 27 prompts, [Qwen2.5-Coder-3B-Instruct](https://huggingface.co/unsloth/Qwen2.5-Coder-3B-Instruct-bnb-4bit) wins.
- **[LoRA SFT](train/README.md#1-sft-stage--supervised-lora)** β€” `r ∈ {8,16,32}`, `lora_alpha = r Γ— multiplier`, attention-only adaptation.
- **[GRPO RL via TRL](train/README.md#2-grpo-stage--reinforcement-learning)** β€” group-relative advantages, KL to SFT reference, `dapo` loss, no critic.
- **[Multi-turn rollouts](train/README.md#4-multi-turn-rollouts--parallel-envs)** β€” up to `MAX_TURNS=6`, observation fed back as next-turn user message.
- **[Optuna hyperparameter search](train/README.md#3-optuna-hyperparameter-search)** β€” TPE sampler over 8-dim space, frozen held-out validation set.
- **[HuggingFace integration](data/README.md#7-huggingface-publishing)** β€” adapter + dataset published to Hub, OpenEnv Space deployment.
### Parallel rollout architecture
- **[Server-side MiniStack pool](server/README.md#6-server-side-ministack-pool-parallel-rollouts)** β€” `MiniStackPool` ([server/app.py](server/app.py)), free-list of ports, lock-guarded acquire/release.
- **[Client-side GrpoPool](scripts/README.md#2-three-coordinated-pool-layers)** β€” async-native, all-or-nothing connect, asyncio.gather for concurrent rollouts.
- **[In-process MultiTurnEnvPool](train/README.md#4-multi-turn-rollouts--parallel-envs)** β€” sync API, owns a background asyncio loop, used by the trainer.
- **[8 isolated rollouts on one server](scripts/README.md#7-running-the-multi-connection-demo)** β€” proof in [scripts/TestMultipleConnects.ipynb](scripts/TestMultipleConnects.ipynb).
### Vendored simulator
- **[MiniStack as git subtree](server/README.md#5-ministack-vendored-fork--customizations)** β€” vendored at [aws_infra/](aws_infra/) (commit `2c38c0b`). 34 AWS services. MIT.
- **[Custom `/_ministack/state` endpoint](server/README.md#5-ministack-vendored-fork--customizations)** β€” added in commit `a648c3a`; returns full infra inventory in one call.
- **[Upstream sync workflow](server/README.md#5-ministack-vendored-fork--customizations)** β€” periodic `git subtree pull`; isolated patches keep conflicts minimal.
### Operations & deployment
- **[OpenEnv-compliant](https://github.com/openai/openenv)** β€” `/reset`, `/step`, `/state`, `/schema`, `/ws` HTTP+WebSocket endpoints.
- **[Web playground UI](server/README.md#19-web-playground)** β€” `/web` route, 40 AWS service icons, Jinja2 + JS frontend.
- **[Docker-first deployment](Dockerfile)** β€” multi-stage build, container ships server + N MiniStack instances + AWS CLI.
- **[Comprehensive test suite](#14-testing)** β€” 10 unit tests + 6 tier-integration suites covering 134 tasks.
---
## 3. Architecture
> ![System architecture](docs/figures/architecture_diagram.png)
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ Docker container ──────────────────────────────────┐
β”‚ β”‚
β”‚ FastAPI server (port 8000) β”‚
β”‚ β”œβ”€β”€ OpenEnv router /reset /step /state /schema /ws /health β”‚
β”‚ β”œβ”€β”€ Web playground /web (Jinja2 + 40 AWS icon SVGs) β”‚
β”‚ β”œβ”€β”€ env_factory per-WS-session AwsRlEnvironment instance β”‚
β”‚ β”‚ (acquires a MiniStack port from MiniStackPool) β”‚
β”‚ └── Services β”‚
β”‚ Curriculum Β· TaskGrader Β· ResourceVerifier Β· ChaosEngine Β· DriftEngine β”‚
β”‚ HintProvider Β· EpisodeTracker Β· EnvironmentDesigner Β· EnvironmentStrategy β”‚
β”‚ β”‚
β”‚ β”‚
β”‚ MiniStack instances :4566 :4567 :4568 … :4566+POOL_SIZE-1 β”‚
β”‚ (vendored at aws_infra/, started by the Dockerfile entrypoint) β”‚
β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–² β–²
β”‚ HTTP/WS β”‚ AWS CLI subprocess
β”‚ β”‚ (AWS_ENDPOINT_URL=http://localhost:4566+i)
β”‚ β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ RL Agent β”‚ β”‚ AWS CLI commands β”‚
β”‚ the agent emits β”‚ β”‚ (client.py) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
### Episode lifecycle
1. **`reset()`** β€” wipes simulator state, picks next task from the curriculum, runs `setup_commands`, applies drift if applicable, returns initial observation.
2. **`step(action)`** β€” validates the command (must start with `aws `), intercepts hint requests, executes via the strategy, records in tracker, grades with shaped reward, optionally injects chaos, returns observation.
3. **Hint** β€” agent sends `aws help --task-hint`; intercepted before reaching MiniStack; returns next-level hint, increments `hints_used` (which decays final reward by `0.85^n`).
4. **Termination** β€” `task_achieved=True` or `step_count >= MAX_STEPS` (default 15).
Full mechanics in [At server/README.md file](server/README.md).
---
## 4. Live demo & Quick Start
### Try it in a browser
The hosted playground lets you click around any task without writing code:
> **[Hugging Face Spaces Playground](https://sizzing-aws-rl-env.hf.space/web#playground)**
### Python client
```python
from aws_rl_env import AwsRlAction, AwsRlEnv
with AwsRlEnv.from_docker_image("aws-rl-env:latest") as env:
result = env.reset()
print(f"Task: {result.observation.task.description}")
result = env.step(AwsRlAction(command="aws s3 mb s3://my-bucket"))
print(f"Reward: {result.reward}, Done: {result.done}")
```
Or against a running server:
```python
env = AwsRlEnv(base_url="http://localhost:8000")
result = env.reset()
result = env.step(AwsRlAction(command="aws s3 ls"))
```
### WebSocket API
```python
import websockets, json
async with websockets.connect("wss://sizzing-aws-rl-env.hf.space/ws") as ws:
await ws.send(json.dumps({"type": "reset"}))
obs = json.loads(await ws.recv())
await ws.send(json.dumps({"type": "step", "data": {"command": "aws s3 ls"}}))
obs = json.loads(await ws.recv())
```
### Local Docker
```bash
make docker-build # build the image
make docker-run # foreground; serves on :8000
make docker-run-detach # background
make docker-health # liveness probe
```
For training (8-way parallel rollouts):
```bash
AWS_RL_ENV_POOL_SIZE=8 make run
```
---
## 5. Run on Colab
The full pipeline is reproducible on a Colab GPU runtime. Drop your token into Colab Secrets, set `ENV_BASE_URL` to your HF Space (or local with ngrok), and run.
| Notebook | What it does | Open in Colab |
|-------------------------------------------------------------------------------------|-------------------------------------------------------|----------------------------------------------|
| [train/train_sft_lora.ipynb](train/train_sft_lora.ipynb) | Stage 1 β€” SFT LoRA fine-tuning of Qwen2.5-Coder-3B | https://colab.research.google.com/drive/1dm9sDaLxHX6s9zEG_SC0FQcKWKkc3TfL?usp=sharing|
| [train/train_grpo_lora.ipynb](train/train_grpo_lora.ipynb) | Stage 2 β€” GRPO RL training with multi-turn rollouts | https://colab.research.google.com/drive/1NwiOM0h_JpXXGRxfY_xZtDiaigvIaKjx?usp=sharing |
| [compare/compare_base_vs_sft.ipynb](compare/compare_base_vs_sft.ipynb) | Side-by-side: base model vs SFT adapter (dataset + RL env) | https://colab.research.google.com/drive/17406aiad8h4nAphV42vVNZ-a5SzZMIre?usp=sharing |
Replace each `<!-- TODO -->` with the Colab badge URL once published.
---
## 6. Action / Observation spec
The full Pydantic data models β€” kept inline so any reader can wire up an agent without leaving this page. Source: [models.py](models.py).
### Action
```python
class AwsRlAction(Action):
command: str # AWS CLI command, e.g. "aws s3 ls"
```
The environment validates that `command` starts with `aws `; anything else is rejected with `success=False`.
### Observation
```python
class AwsRlObservation(Observation):
episode_id: EpisodeID
step_count: StepCount
command_success: bool # exit code == 0
command_output: str # stdout from the AWS CLI invocation
error: str # stderr (empty if success)
task: TaskInfo | None # masked task definition (no success criteria)
task_achieved: bool
partial_progress: float # current task progress in [0.0, 1.0]
hints_used: int # cumulative hint count this episode
hint_text: str # most recent hint text (if any)
```
### State
```python
class AwsRlState(State):
current_task: Task | None # full task assigned for the episode
tracker: TrackerState # episode tracker snapshot
infra_state: dict # AWS infrastructure state keyed by service name
chaos_occurred: bool # whether chaos was injected this episode
current_tier: str # agent's current difficulty tier
class TrackerState:
step_count: int # steps taken this episode
hints_used: int # hints requested this episode
progress: float # current partial progress [0.0, 1.0]
commands_executed: list[str] # commands executed this episode
credited_operations: list[str] # (operation, resource) pairs that earned credit
```
### Task definitions
```python
class Task:
task_id: TaskID
difficulty: TaskDifficulty # warmup | beginner | intermediate | advanced | expert
description: str # human-readable goal
success_criteria: SuccessCriteria
setup_commands: list[SetupCommand] # pre-provision for SRE tasks
desired_state_spec: str | None # natural-language desired end state (drift tasks)
possible_drifts: list[SetupCommand] # pool of mutations for DriftEngine
class TaskInfo:
"""Agent-visible subset of Task β€” masks success_criteria, setup_commands, and possible_drifts."""
task_id: TaskID
difficulty: TaskDifficulty
description: str
desired_state_spec: str | None
class SuccessCriteria:
command_contains: str | None # warmup/beginner
operation: str | None # warmup/beginner
resource_exists: ResourceExistsCheck | None # beginner
steps: list[StepCriteria] # intermediate/advanced/expert
services: list[AwsService] # advanced/expert
state_checks: list[StateCheck] # expert (ground truth)
```
### Curriculum config
```python
class TierConfig:
min_episodes: int # minimum episodes before promotion
advance_rate: float # tier success rate threshold (0.6 - 1.0)
mastery_window: int # sliding window size (default: 10)
mastery_threshold: float # per-task graduation threshold (default: 0.7)
fast_track_rate: float # early promotion threshold (default: 0.9)
chaos_probability: float # probability of chaos injection per step
class SpacedRepState:
interval: int # episodes until next re-test (3 β†’ 48)
last_graduated_episode: int # when last graduated
```
---
## 7. Curriculum & Reward (overview)
The curriculum and reward stack is the heart of the project. This section is the elevator pitch; **the full mechanics β€” priority scoring math, anti-reward-hacking layers, chaos engine, drift engine β€” live in [server/README.md](server/README.md)**.
### Priority scoring (one-formula task selection)
```
score = novelty_bonus # +100 if never attempted
+ weakness_weight # +50 Γ— (1 βˆ’ task_success_rate)
+ spaced_rep_bonus # +30 if a graduated task is "due" for re-test
βˆ’ recency_penalty # βˆ’20 if attempted in the last 2 episodes
```
Exploration, weakness-targeting, anti-forgetting, and variety β€” all balanced by one weighted sum.
### Reward shaping
```
if task_achieved:
reward = 1.0
if survived_chaos: reward *= 1.05 # chaos survival bonus
else:
reward = partial_progress * 0.8 # ≀ 0.8 from steps alone
if progress_increased: reward += 0.1 # dense progress signal
if command_failed: reward *= 0.5 # error penalty
reward -= 0.1 * rollback_count # waste penalty
reward += 0.02 * idempotent_retries # graceful retry bonus
reward = clamp(reward, 0.0, 0.99) # 1.0 reserved for completion
reward *= 0.85 ** hints_used # hint decay applied last
```
The agent's loss surface is intentionally narrow: only doing the task earns full reward, and every reward-hacking shortcut we identified during design has a defense layer (full list in [Server's Readme file section Β§9](server/README.md#9-anti-reward-hacking--8-defense-layers)).
> ![Curriculum progression: 5 tiers, priority scoring formula, mastery + spaced rep + fast-track](docs/figures/curriculum_progression.png)
---
## 8. Training pipeline (SFT β†’ GRPO)
The training pipeline runs in two stages, both reproducible on Colab. Full detail in **[train/README.md](train/README.md)**.
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ data/sft/ ──────────┐
β”‚ 1,500 train Β· 150 val rows β”‚
β”‚ 5 trajectory types β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό
STAGE 1 β€” Supervised Fine-Tuning train/train_sft_lora.ipynb
Qwen2.5-Coder-3B-Instruct + LoRA r=8/16/32 (Optuna) β†’ SFT adapter
β”‚
β”‚ Sizzing/aws-rl-sft-qwen25coder3b-adapter
β–Ό
STAGE 2 β€” GRPO RL train/train_grpo_lora.ipynb
G=8 parallel rollouts Β· multi-turn Β· reward = env return
Optuna over (lr, Ξ², G, T, top_p, lora_r, max_turns)
```
### Numbers worth knowing
| | |
|---|---|
| **Base model** | `unsloth/Qwen2.5-Coder-3B-Instruct-bnb-4bit` β€” picked via [Through model evaluation](data/sft/MODEL_EVALUATION.md) |
| **SFT LoRA** | `r ∈ {8,16,32}`, `lora_alpha = r Γ— multiplier`, target = attention only, dropout `[0.005, 0.031]` |
| **GRPO config** | `G=8`, `Ξ²=0.04`, `lr=5e-6`, `T=0.9`, `top_p=0.95`, `max_turns=6`, loss=`dapo` |
| **Optuna search** | TPE sampler, 6 trials Γ— 30 GRPO steps, frozen 10-task held-out val set |
| **Final training** | 200 GRPO steps with best config |
### Training graphs
> Embed once notebook is executed:
> ![SFT loss curve](docs/figures/sft_loss_curve.png)
> ![GRPO mean reward over training](docs/figures/grpo_reward_curve.png)
> ![Per-rollout reward by curriculum tier](docs/figures/grpo_per_tier_curve.png)
> ![Optuna parameter importance](docs/figures/optuna_param_importance.png)
---
## 9. Parallel rollout architecture
GRPO needs `G` rollouts on the same task per training step. We run all G in parallel with **state isolation guaranteed**. Three coordinated pool layers make it work:
```
Trainer (G=8 generations needed per step)
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β–Ό β–Ό β–Ό
MultiTurnEnvPool GrpoPool (in-process)
(train_grpo.py) (scripts/grpo_pool.py)
sync API async API
β”‚ β”‚
└─────── 8 WebSocket connections β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
FastAPI server :8000
+ OpenEnv max_concurrent_envs=8
β”‚
β–Ό
MiniStackPool (free-list, lock-guarded)
acquire(port) on connect, release on disconnect
β”‚
β–Ό
8 isolated MiniStack instances :4566..:4573
```
Wall-clock impact: an 8-rollout Γ— 6-turn episode runs in ~300 ms of env time vs ~2.4 s sequential. Full mechanics, including the **all-or-nothing connect protocol** that prevents pool-slot leakage on flake, are in **[Scripts README file](scripts/README.md)**.
> ![Parallel rollout: 3 coordinated pool layers](docs/figures/parallel_rollout_diagram.png)
---
## 10. MiniStack: vendored & customized
The simulator powering the env is **vendored** as a git subtree at [aws_infra/](aws_infra/), not pulled as a black-box dependency. We forked it because we needed:
1. A custom `/_ministack/state` JSON endpoint so the grader can read the entire infra inventory in **one HTTP call** instead of iterating 20+ list APIs per grading pass. Added in commit `a648c3a "feat: Add support for service state retrieval and action listing across multiple AWS services"`.
2. A reproducible build with no runtime network requirement β€” the Docker image bundles a specific MiniStack revision.
3. The freedom to extend service coverage on demand.
Custom commits live as small, isolated patches so periodic upstream syncs (`af2e945`, `579597b`) replay cleanly. To inspect:
```bash
git show a648c3a # the state-endpoint diff
git log --oneline -- aws_infra/ # only the aws_infra subtree history
```
Full subtree workflow + commit-by-commit detail in [server/README.md Β§5](server/README.md#5-ministack-vendored-fork--customizations). Upstream MiniStack docs (81 KB) are preserved at [aws_infra/README.md](aws_infra/README.md).
---
## 11. Results & Benchmarks
### Base-model selection
We evaluated 11 chat models on 27 held-out prompts. **Qwen2.5-Coder-3B-Instruct** wins on every metric that matters: 41% exact match (highest), 63% operation match (highest), 3.1 s/call (3Γ— faster than the 4B runner-up). Full report:
> **[data/sft/MODEL_EVALUATION.md](data/sft/MODEL_EVALUATION.md)** β€” 270-line writeup, per-model verdicts, methodology
> ![Top 4 candidate models on the held-out benchmark](docs/figures/model_eval_chart.png)
### Base vs SFT β€” actual results
After running the SFT pipeline end-to-end, the eval delta on the same held-out prompts is striking:
| Metric | Base | Post-SFT | Delta |
|-----------------|:------:|:--------:|:-----------:|
| `format_pct` | 33.3% | **100.0%** | **+66.7 pp** |
| `exact_pct` | 38.9% | **88.9%** | **+50.0 pp** |
| `service_pct` | 77.8% | **88.9%** | +11.1 pp |
| `operation_pct` | 61.1% | **88.9%** | +27.8 pp |
| `avg_len` | 85.8 | 74.7 | βˆ’11 chars (tighter) |
> ![Base vs SFT eval-metrics comparison](docs/figures/base_vs_sft_success.png)
Every target from [data/sft/MODEL_EVALUATION.md Β§11](data/sft/MODEL_EVALUATION.md) is met or exceeded. Format compliance is now perfect; the model never wraps commands in fences or quotes after SFT. Exact-match jumped from 39% to 89% β€” the agent now emits the canonical command for ~9 of every 10 prompts.
The richer two-mode benchmark (dataset eval + live RL env eval) is in [compare/compare_base_vs_sft.ipynb](compare/compare_base_vs_sft.ipynb); methodology in [compare/README.md](compare/README.md).
> ![Dataset comparison: base vs SFT (per-row scores)](docs/figures/compare_dataset.png)
> ![RL env comparison: base vs SFT (per-episode rewards)](docs/figures/compare_rl_env.png)
### SFT training curves
> ![SFT loss curve over training](docs/figures/sft_loss_curve.png)
### Optuna SFT search
The best SFT trial (out of 6) used `lora_r=16, lora_alpha=16, dropout=0.0058, lr=4.03e-4, warmup=0.1` β€” see [train/README.md Β§3](train/README.md#3-optuna-hyperparameter-search) for the full Optuna study table.
> ![Optuna parameter importances](docs/figures/optuna_param_importance.png)
> ![Optuna optimization history](docs/figures/optuna_history.png)
### GRPO results (live multi-step env eval)
After 35 GRPO steps on top of the SFT adapter (best Optuna config: `lr=1.6e-5, Ξ²=0.0021, T=0.99`), we re-evaluated end-to-end on 100+ episodes:
| Metric | Base + SFT | Base + SFT + GRPO | Ξ” |
|-------------------------------|:---------:|:-----------------:|:------------:|
| Overall success rate | 86.8% | 86.2% | βˆ’0.5 pp |
| Overall mean reward | 0.883 | 0.877 | βˆ’0.006 |
| Beginner success | 96.2% | **100.0%** | **+3.8 pp** |
| Intermediate success | 81.0% | **87.0%** | **+6.0 pp** |
| Warmup success | 96.0% | 90.2% | βˆ’5.8 pp |
| Expert success | 22.2% | 22.2% | flat |
| Drift repair rate | 22.2% | 22.2% | flat |
| Destructive-action fail rate | 15.1% | 14.7% | βˆ’0.4 pp |
| Steps to solve | 1.45 | 1.55 | +0.10 |
> ![SFT vs GRPO metrics grid](docs/figures/sft_vs_grpo_metrics_grid.png)
> ![SFT vs GRPO by tier](docs/figures/sft_vs_grpo_by_tier.png)
**Honest reading:** the 35-step GRPO run preserves the SFT gains and modestly improves the middle tiers (beginner +3.8 pp, intermediate +6.0 pp) β€” but does not crack the **expert-tier bottleneck** (22% success on SRE / drift / security-posture tasks). With longer GRPO runs and more curriculum exposure to expert tasks, this is the next gain to chase.
### GRPO training curves
Per-step training signals from the final 35-step GRPO run:
> ![GRPO final per-step training signals](docs/figures/grpo_final_per_step.png)
> ![GRPO env reward over training](docs/figures/grpo_reward_curve.png)
Optuna search across 4 trials picked the final config:
> ![GRPO Optuna trial comparison](docs/figures/grpo_optuna_trials_comparison.png)
> ![GRPO Optuna parameter importances](docs/figures/grpo_optuna_importances.png)
> ![GRPO Optuna optimization history](docs/figures/grpo_optuna_history.png)
### Qualitative rollouts (post-GRPO)
One sample episode per tier:
> ![Qualitative rollouts on representative tasks](docs/figures/qualitative_rollouts.png)
---
## 12. Repository map
| Path | Purpose | Sub-README |
|--------------------------------|--------------------------------------------------------------------|-----------------------------------------|
| [server/](server/) | OpenEnv FastAPI server, env logic, services, web playground | [server/README.md](server/README.md) |
| [train/](train/) | SFT and GRPO training notebooks | [train/README.md](train/README.md) |
| [data/](data/) | SFT dataset, base-model selection, eval harness | [data/README.md](data/README.md) Β· [MODEL_EVALUATION.md](data/sft/MODEL_EVALUATION.md) |
| [compare/](compare/) | Base vs SFT side-by-side benchmark | [compare/README.md](compare/README.md) |
| [scripts/](scripts/) | Parallel-rollout architecture + multi-connection demo | [scripts/README.md](scripts/README.md) |
| [aws_infra/](aws_infra/) | Vendored MiniStack simulator (git subtree) | [aws_infra/README.md](aws_infra/README.md) |
| [tests/](tests/), [tests_tasks/](tests_tasks/) | Unit + tier-integration test suites | (see [Β§14](#14-testing)) |
| [models.py](models.py) | Pydantic data models for action/observation/task | (inline Β§6) |
| [client.py](client.py) | OpenEnv HTTP/WebSocket client wrapper | β€” |
| [inference.py](inference.py) | Single-model agent loop (matches RL eval mode of `compare/`) | β€” |
| [train_grpo.py](train_grpo.py) | GRPO trainer (1,283 LOC) β€” `MultiTurnEnvPool`, Optuna, plotting | (see [train/README.md](train/README.md)) |
| [aws_rl_env_colab.ipynb](aws_rl_env_colab.ipynb) | Colab driver for the full training pipeline | β€” |
| [docs/figures/](docs/figures/) | All README graphs and screenshots | β€” |
---
## 13. Configuration & Running
### Docker (recommended)
```bash
make docker-build # build the image
make docker-run # foreground on :8000
make docker-run-detach # background
make docker-health # liveness probe
```
### OpenEnv deployment
```bash
make openenv-validate # validate config
make openenv-build # build environment
make openenv-push # push to HuggingFace Spaces
```
### Environment variables
| Variable | Default | Description |
|-------------------------------------|--------------------------|-------------------------------------------------------------------|
| `AWS_INFRA_URL` | `http://localhost:4566` | MiniStack endpoint (used when `POOL_SIZE=1`) |
| `AWS_RL_ENV_POOL_SIZE` | `1` | **Server-side MiniStack pool size; set to 8 for GRPO training** |
| `AWS_RL_ENV_MINISTACK_BASE_PORT` | `4566` | First MiniStack port; pool covers `[BASE, BASE + POOL_SIZE)` |
| `BACKEND_TYPE` | `simulator` | `simulator` (MiniStack) or `aws` (real AWS, no pool) |
| `AWS_ACCESS_KEY_ID` | `test` | AWS credentials (any value works for the simulator) |
| `AWS_SECRET_ACCESS_KEY` | `test` | AWS credentials (any value works for the simulator) |
| `AWS_DEFAULT_REGION` | `us-east-1` | AWS region |
| `MAX_STEPS` | `15` | Max steps per episode |
| `API_BASE_URL` | β€” | LLM API endpoint for [inference.py](inference.py) |
| `MODEL_NAME` | β€” | LLM model name for [inference.py](inference.py) |
| `HF_TOKEN` | β€” | HuggingFace token (dataset/adapter access, push) |
| `TEMPERATURE` | `0.7` | LLM sampling temperature |
### Curriculum stats API
```python
curriculum.get_stats()
# {
# "episode_count": 42,
# "tier": "intermediate",
# "tier_episodes": 12,
# "tier_success_rate": 0.75,
# "graduated_tasks": [0, 2, 4],
# "weak_spots": [11, 12],
# "skill_profile": {0: 0.95, 1: 0.8, ...},
# "spaced_rep_due": [0, 2],
# "avg_reward_last_10": 0.65
# }
```
---
## 14. Testing
The test suite covers both isolated unit logic and end-to-end task execution against MiniStack.
### Unit tests β€” [tests/](tests/)
```bash
pytest tests/ -v
```
| File | Covers |
|----------------------------------------------------------------------------------------------|-----------------------------------------------------------------|
| [test_aws_rl_env_environment.py](tests/test_aws_rl_env_environment.py) | Environment lifecycle, reset/step semantics, reward integration |
| [test_task_grader.py](tests/test_task_grader.py) | All 5 grading strategies, partial progress, penalties, bonuses |
| [test_resource_verifier.py](tests/test_resource_verifier.py) | Per-service ground-truth verification (20+ services) |
| [test_episode_tracker.py](tests/test_episode_tracker.py) | Command parsing, dedup, monotonic progress, rollback detection |
| [test_episode_context.py](tests/test_episode_context.py) | Per-episode context lifecycle |
| [test_drift_engine.py](tests/test_drift_engine.py) | Random drift selection, mutation application |
| [test_hint_provider.py](tests/test_hint_provider.py) | Three-level progressive hints, decay computation |
| [test_environment_designer.py](tests/test_environment_designer.py) | Setup-command provisioning |
| [test_pool.py](tests/test_pool.py) | Server-side `MiniStackPool` acquire/release, exhaustion |
| [test_grpo_pool.py](tests/test_grpo_pool.py) | Client-side `GrpoPool` connect/close, all-or-nothing rollback |
### Tier integration tests β€” [tests_tasks/](tests_tasks/)
```bash
pytest tests_tasks/ -v
```
134 tasks exercised end-to-end:
| File | Tasks |
|-----------------------------------------------------------------------------------------------------|------:|
| [test_warmup_tasks.py](tests_tasks/test_warmup_tasks.py) | 25 |
| [test_beginner_tasks.py](tests_tasks/test_beginner_tasks.py) | 25 |
| [test_intermediate_tasks.py](tests_tasks/test_intermediate_tasks.py) | 25 |
| [test_advanced_tasks.py](tests_tasks/test_advanced_tasks.py) | 25 |
| [test_expert_tasks.py](tests_tasks/test_expert_tasks.py) | 24 |
| [test_drift_tasks.py](tests_tasks/test_drift_tasks.py) | 9 |
| **Total** | **133** |
These tests double as the source of truth for canonical solutions used by the SFT dataset generator (extracted via AST β€” see [data/README.md Β§1](data/README.md#1-sft-dataset-generation)).
---
## 15. Tech stack
- **Python 3.12**, [`uv`](https://github.com/astral-sh/uv) for dependency management, multi-stage Docker
- **FastAPI**, **OpenEnv** (HTTP + WebSocket env protocol), **uvicorn**
- **TRL β‰₯ 0.21** (`GRPOTrainer`, `GRPOConfig`)
- **PEFT** (LoRA), **Unsloth** (4-bit quantized base, fused training kernels)
- **Transformers β‰₯ 4.45**, **datasets β‰₯ 2.20**, **HuggingFace Hub β‰₯ 0.24**
- **Optuna β‰₯ 3.6** (TPE sampler, SQLite study storage)
- **asyncio** + **websockets** + **httpx** (parallel rollout orchestration)
- **MiniStack** (vendored at [aws_infra/](aws_infra/), 34 AWS services)
- **AWS CLI v2** (subprocess invocation against MiniStack endpoint)
- **matplotlib**, **plotly** (training curves, Optuna visualizations)
- **pytest** (16 test files, ~250 KB of test code)
---
## 16. Links
- **Live demo**: [sizzing-aws-rl-env.hf.space/web](https://sizzing-aws-rl-env.hf.space/web)
- **HF Space**: [huggingface.co/spaces/Sizzing/aws_rl_env](https://huggingface.co/spaces/Sizzing/aws_rl_env)
- **API docs**: [/docs](https://sizzing-aws-rl-env.hf.space/docs) Β· [/redoc](https://sizzing-aws-rl-env.hf.space/redoc)
- **SFT adapter**: [Sizzing/aws-rl-sft-qwen25coder3b-adapter](https://huggingface.co/Sizzing/aws-rl-sft-qwen25coder3b-adapter)
- **GRPO adapter**: [Sizzing/aws-rl-grpo-qwen25coder3b-adapter](https://huggingface.co/Sizzing/aws-rl-grpo-qwen25coder3b-adapter)
- **Dataset**: [Sizzing/aws-rl-sft](https://huggingface.co/datasets/Sizzing/aws-rl-sft)
- **GitHub**: [github.com/udaykiranpadhy/aws-rl-env](https://github.com/udaykiranpadhy/aws-rl-env)
---
## 17. Acknowledgments
- **MiniStack** β€” vendored at [aws_infra/](aws_infra/). Upstream license preserved. Custom modifications attributable to commits `a648c3a`, `a00e981`; periodic upstream syncs `af2e945`, `579597b`.
- **OpenEnv** β€” environment protocol and Python client framework.
- **TRL** (HuggingFace) β€” `GRPOTrainer` implementation.
- **Unsloth** β€” 4-bit quantized model loaders + fused training kernels.
- **Google Colab** for providing their infrastructure to train models.
- **AWS service icons** in [server/static/img/aws/](server/static/img/aws/) β€” used in the web playground.
---
## Sub-README index
For deep technical detail on any subsystem:
- [server/README.md](server/README.md) β€” environment internals (curriculum, reward shaping, anti-hacking, chaos, drift, MiniStack-fork detail)
- [train/README.md](train/README.md) β€” SFT + GRPO training pipeline (LoRA config, Optuna search, multi-turn rollouts)
- [scripts/README.md](scripts/README.md) β€” parallel-rollout architecture (3 pool layers, all-or-nothing connect, concurrency safety)
- [data/README.md](data/README.md) β€” dataset generation (5 trajectory types, AST extraction) + base-model selection summary
- [data/sft/MODEL_EVALUATION.md](data/sft/MODEL_EVALUATION.md) β€” full 11-model benchmark report
- [compare/README.md](compare/README.md) β€” base vs SFT comparison harness
- [aws_infra/README.md](aws_infra/README.md) β€” vendored MiniStack upstream documentation (81 KB)
## Small Video Explanation
- [Recorded Video explaining core functionality](https://share.zight.com/NQu0pLvQ)