Spaces:
Running
Running
File size: 42,822 Bytes
40056ec c745a99 40056ec c745a99 40056ec c745a99 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 | ---
title: AWS RL Environment Server
emoji: π₯
colorFrom: pink
colorTo: pink
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv
---
# AWS Cloud Operations β RL Environment & Training Pipeline
> Cloud agents fail in production not because they donβt know the commands β but because state drifts, services hiccup, and reward signals get gamed. We built an environment that simulates all three: 120+ AWS tasks under chaos and drift, an 8-layer anti-reward-hacking stack, and an adversarial curriculum that targets the agentβs own weak spots. After SFT β GRPO on a single GPU with 8 parallel rollouts, format compliance hit 100%, exact-match jumped 39% β 89%, and intermediate-tier success climbed 81% β 87%.
| | |
|---|---|
| **Live demo** | [sizzing-aws-rl-env.hf.space/web](https://sizzing-aws-rl-env.hf.space/web) β try the playground in a browser |
| **API docs** | [sizzing-aws-rl-env.hf.space/docs](https://sizzing-aws-rl-env.hf.space/docs) (Swagger), [/redoc](https://sizzing-aws-rl-env.hf.space/redoc) |
| **HF Space** | [huggingface.co/spaces/Sizzing/aws_rl_env](https://huggingface.co/spaces/Sizzing/aws_rl_env) |
| **SFT adapter**| [Sizzing/aws-rl-sft-qwen25coder3b-adapter](https://huggingface.co/Sizzing/aws-rl-sft-qwen25coder3b-adapter) |
| **Dataset** | [Sizzing/aws-rl-sft](https://huggingface.co/datasets/Sizzing/aws-rl-sft) |
---
## Table of contents
1. [What this is & why it matters](#1-what-this-is--why-it-matters)
2. [Highlights β full feature inventory](#2-highlights--full-feature-inventory)
3. [Architecture](#3-architecture)
4. [Live demo & Quick Start](#4-live-demo--quick-start)
5. [Run on Colab](#5-run-on-colab)
6. [Action / Observation spec](#6-action--observation-spec)
7. [Curriculum & Reward (overview)](#7-curriculum--reward-overview)
8. [Training pipeline (SFT β GRPO)](#8-training-pipeline-sft--grpo)
9. [Parallel rollout architecture](#9-parallel-rollout-architecture)
10. [MiniStack: vendored & customized](#10-ministack-vendored--customized)
11. [Results & Benchmarks](#11-results--benchmarks)
12. [Repository map](#12-repository-map)
13. [Configuration & Running](#13-configuration--running)
14. [Testing](#14-testing)
15. [Tech stack](#15-tech-stack)
16. [Links](#16-links)
17. [Acknowledgments](#17-acknowledgments)
---
## 1. What this is & why it matters
Modern AI agents are increasingly asked to operate cloud infrastructure β provisioning resources, fixing misconfigurations, responding to drift. Training such agents needs (a) a realistic environment, (b) reliable reward signals, and (c) enough scale to make RL feasible. Existing options force a hard tradeoff: real AWS costs hundreds of dollars per training run and is impossible to reset; toy emulators don't behave like production AWS.
**This project closes that gap.** We built:
1. **An OpenEnv-compatible RL environment** that speaks real AWS CLI semantics. The agent sends `aws s3 mb β¦`, `aws iam create-role β¦`, and so on β the exact same commands a human SRE would type.
2. **A vendored, customized MiniStack simulator** that responds with production-equivalent JSON, runs locally for zero cost, supports 34 AWS services, and exposes a single-call state-introspection endpoint we added so the grader has cheap ground-truth access.
3. **A 120+ task curriculum** across 5 tiers (warmup β expert) with adaptive selection, mastery tracking, spaced repetition, chaos injection, and drift-detection scenarios β every feature designed to keep the reward signal honest and prevent the agent from gaming it.
4. **A complete SFT β GRPO training pipeline.** A 1,500-row synthetic dataset spanning 5 trajectory shapes, an 11-model base benchmark, LoRA fine-tuning, and TRL GRPO with multi-turn rollouts and Optuna hyperparameter search.
5. **An 8-way parallel-rollout architecture.** Server-side MiniStack pool, client-side `GrpoPool`, in-process `MultiTurnEnvPool` β three coordinated layers that let G=8 concurrent rollouts run on one GPU without state contamination.
Everything is reproducible: the dataset is generated by a deterministic script, the model selection is documented end-to-end, training entry points run on Colab, and the env runs locally in a single Docker container with no external network requirement.
---
## 2. Highlights β full feature inventory
This is the complete surface area of the project. Each entry links to deeper documentation in the corresponding sub-README.
### Environment & Curriculum
- **[120+ tasks across 5 tiers](server/services/tasks/)** β warmup (25), beginner (25), intermediate (25), advanced (25), expert (24), drift (9). YAML-defined task spec per tier.
- **[Curriculum learning with priority scoring](server/README.md#7-curriculum-manager)** β `score = novelty + weakness β recency + spaced_rep_bonus` drives task selection.
- **[Mastery tracking](server/README.md#7-curriculum-manager)** β sliding 10-episode window, 0.7 threshold, 0.85 exponential decay, supports un-graduation.
- **[Spaced repetition](server/README.md#7-curriculum-manager)** β graduated tasks resurface at intervals `[3, 6, 12, 24, 48]` to prevent forgetting.
- **[Tier promotion](server/README.md#7-curriculum-manager)** β standard (min episodes + success rate) + fast-track (3 consecutive β₯90% episodes).
- **[Strategy pattern: simulator vs real AWS](server/README.md#4-strategy-pattern-simulator-vs-real-aws)** β `BACKEND_TYPE=simulator` (default) or `aws`, no code fork.
### Reward shaping
- **[Five grading strategies](server/README.md#8-reward-shaping--taskgrader)** β command-match (warmup), resource-creation (beginner), multi-step (intermediate), multi-step+services (advanced), state-checks (expert).
- **[Dense partial-progress signal](server/README.md#8-reward-shaping--taskgrader)** β clamped to `[0.0, 0.99]`, `1.0` reserved for verified completion.
- **[Rollback penalty](server/README.md#8-reward-shaping--taskgrader)** β `β0.1` per `(create-X, β¦, delete-X)` pair.
- **[Idempotency bonus](server/README.md#8-reward-shaping--taskgrader)** β `+0.02` for graceful "already exists" retry.
- **[Hint decay](server/README.md#13-hint-provider)** β three-level progressive hints with `0.85^n` reward multiplier.
- **[Chaos survival bonus](server/README.md#11-chaos-engine)** β `Γ1.05` if the agent completes a chaotic task.
### Resilience & adversarial features
- **[Chaos injection](server/README.md#11-chaos-engine)** β silent mid-episode mutations, tier-scaled probabilities (10/20/30%) on services the task is touching.
- **[Drift detection](server/README.md#12-drift-engine)** β 6 expert tasks, 2β3 random mutations from a per-task pool, randomized per episode (no memorization).
- **[Security-posture audit tasks](server/README.md#17-security-posture-audit-examples)** β S3 public bucket lockdown, IAM least-privilege, Lambda secret rotation.
- **[8-layer anti-reward-hacking](server/README.md#9-anti-reward-hacking--8-defense-layers)** β ground-truth verification, dedup, grader invisibility, command allow-list, no-credit-for-reads, monotonic progress, exact resource-name validation, final state checks.
### Training pipeline
- **[Synthetic SFT dataset (1,500 rows)](data/README.md)** β 5 trajectory types: success / multi-step continuation / failure recovery / verification / hint usage.
- **[Rigorous base-model selection](data/sft/MODEL_EVALUATION.md)** β 11 models Γ 27 prompts, [Qwen2.5-Coder-3B-Instruct](https://huggingface.co/unsloth/Qwen2.5-Coder-3B-Instruct-bnb-4bit) wins.
- **[LoRA SFT](train/README.md#1-sft-stage--supervised-lora)** β `r β {8,16,32}`, `lora_alpha = r Γ multiplier`, attention-only adaptation.
- **[GRPO RL via TRL](train/README.md#2-grpo-stage--reinforcement-learning)** β group-relative advantages, KL to SFT reference, `dapo` loss, no critic.
- **[Multi-turn rollouts](train/README.md#4-multi-turn-rollouts--parallel-envs)** β up to `MAX_TURNS=6`, observation fed back as next-turn user message.
- **[Optuna hyperparameter search](train/README.md#3-optuna-hyperparameter-search)** β TPE sampler over 8-dim space, frozen held-out validation set.
- **[HuggingFace integration](data/README.md#7-huggingface-publishing)** β adapter + dataset published to Hub, OpenEnv Space deployment.
### Parallel rollout architecture
- **[Server-side MiniStack pool](server/README.md#6-server-side-ministack-pool-parallel-rollouts)** β `MiniStackPool` ([server/app.py](server/app.py)), free-list of ports, lock-guarded acquire/release.
- **[Client-side GrpoPool](scripts/README.md#2-three-coordinated-pool-layers)** β async-native, all-or-nothing connect, asyncio.gather for concurrent rollouts.
- **[In-process MultiTurnEnvPool](train/README.md#4-multi-turn-rollouts--parallel-envs)** β sync API, owns a background asyncio loop, used by the trainer.
- **[8 isolated rollouts on one server](scripts/README.md#7-running-the-multi-connection-demo)** β proof in [scripts/TestMultipleConnects.ipynb](scripts/TestMultipleConnects.ipynb).
### Vendored simulator
- **[MiniStack as git subtree](server/README.md#5-ministack-vendored-fork--customizations)** β vendored at [aws_infra/](aws_infra/) (commit `2c38c0b`). 34 AWS services. MIT.
- **[Custom `/_ministack/state` endpoint](server/README.md#5-ministack-vendored-fork--customizations)** β added in commit `a648c3a`; returns full infra inventory in one call.
- **[Upstream sync workflow](server/README.md#5-ministack-vendored-fork--customizations)** β periodic `git subtree pull`; isolated patches keep conflicts minimal.
### Operations & deployment
- **[OpenEnv-compliant](https://github.com/openai/openenv)** β `/reset`, `/step`, `/state`, `/schema`, `/ws` HTTP+WebSocket endpoints.
- **[Web playground UI](server/README.md#19-web-playground)** β `/web` route, 40 AWS service icons, Jinja2 + JS frontend.
- **[Docker-first deployment](Dockerfile)** β multi-stage build, container ships server + N MiniStack instances + AWS CLI.
- **[Comprehensive test suite](#14-testing)** β 10 unit tests + 6 tier-integration suites covering 134 tasks.
---
## 3. Architecture
> 
```
βββββββββββββββββββββββββββββββββββ Docker container βββββββββββββββββββββββββββββββββββ
β β
β FastAPI server (port 8000) β
β βββ OpenEnv router /reset /step /state /schema /ws /health β
β βββ Web playground /web (Jinja2 + 40 AWS icon SVGs) β
β βββ env_factory per-WS-session AwsRlEnvironment instance β
β β (acquires a MiniStack port from MiniStackPool) β
β βββ Services β
β Curriculum Β· TaskGrader Β· ResourceVerifier Β· ChaosEngine Β· DriftEngine β
β HintProvider Β· EpisodeTracker Β· EnvironmentDesigner Β· EnvironmentStrategy β
β β
β β
β MiniStack instances :4566 :4567 :4568 β¦ :4566+POOL_SIZE-1 β
β (vendored at aws_infra/, started by the Dockerfile entrypoint) β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β² β²
β HTTP/WS β AWS CLI subprocess
β β (AWS_ENDPOINT_URL=http://localhost:4566+i)
β β
βββββββββ΄ββββββββββββ βββββββββ΄ββββββββββββ
β RL Agent β β AWS CLI commands β
β the agent emits β β (client.py) β
βββββββββββββββββββββ βββββββββββββββββββββ
```
### Episode lifecycle
1. **`reset()`** β wipes simulator state, picks next task from the curriculum, runs `setup_commands`, applies drift if applicable, returns initial observation.
2. **`step(action)`** β validates the command (must start with `aws `), intercepts hint requests, executes via the strategy, records in tracker, grades with shaped reward, optionally injects chaos, returns observation.
3. **Hint** β agent sends `aws help --task-hint`; intercepted before reaching MiniStack; returns next-level hint, increments `hints_used` (which decays final reward by `0.85^n`).
4. **Termination** β `task_achieved=True` or `step_count >= MAX_STEPS` (default 15).
Full mechanics in [At server/README.md file](server/README.md).
---
## 4. Live demo & Quick Start
### Try it in a browser
The hosted playground lets you click around any task without writing code:
> **[Hugging Face Spaces Playground](https://sizzing-aws-rl-env.hf.space/web#playground)**
### Python client
```python
from aws_rl_env import AwsRlAction, AwsRlEnv
with AwsRlEnv.from_docker_image("aws-rl-env:latest") as env:
result = env.reset()
print(f"Task: {result.observation.task.description}")
result = env.step(AwsRlAction(command="aws s3 mb s3://my-bucket"))
print(f"Reward: {result.reward}, Done: {result.done}")
```
Or against a running server:
```python
env = AwsRlEnv(base_url="http://localhost:8000")
result = env.reset()
result = env.step(AwsRlAction(command="aws s3 ls"))
```
### WebSocket API
```python
import websockets, json
async with websockets.connect("wss://sizzing-aws-rl-env.hf.space/ws") as ws:
await ws.send(json.dumps({"type": "reset"}))
obs = json.loads(await ws.recv())
await ws.send(json.dumps({"type": "step", "data": {"command": "aws s3 ls"}}))
obs = json.loads(await ws.recv())
```
### Local Docker
```bash
make docker-build # build the image
make docker-run # foreground; serves on :8000
make docker-run-detach # background
make docker-health # liveness probe
```
For training (8-way parallel rollouts):
```bash
AWS_RL_ENV_POOL_SIZE=8 make run
```
---
## 5. Run on Colab
The full pipeline is reproducible on a Colab GPU runtime. Drop your token into Colab Secrets, set `ENV_BASE_URL` to your HF Space (or local with ngrok), and run.
| Notebook | What it does | Open in Colab |
|-------------------------------------------------------------------------------------|-------------------------------------------------------|----------------------------------------------|
| [train/train_sft_lora.ipynb](train/train_sft_lora.ipynb) | Stage 1 β SFT LoRA fine-tuning of Qwen2.5-Coder-3B | https://colab.research.google.com/drive/1dm9sDaLxHX6s9zEG_SC0FQcKWKkc3TfL?usp=sharing|
| [train/train_grpo_lora.ipynb](train/train_grpo_lora.ipynb) | Stage 2 β GRPO RL training with multi-turn rollouts | https://colab.research.google.com/drive/1NwiOM0h_JpXXGRxfY_xZtDiaigvIaKjx?usp=sharing |
| [compare/compare_base_vs_sft.ipynb](compare/compare_base_vs_sft.ipynb) | Side-by-side: base model vs SFT adapter (dataset + RL env) | https://colab.research.google.com/drive/17406aiad8h4nAphV42vVNZ-a5SzZMIre?usp=sharing |
Replace each `<!-- TODO -->` with the Colab badge URL once published.
---
## 6. Action / Observation spec
The full Pydantic data models β kept inline so any reader can wire up an agent without leaving this page. Source: [models.py](models.py).
### Action
```python
class AwsRlAction(Action):
command: str # AWS CLI command, e.g. "aws s3 ls"
```
The environment validates that `command` starts with `aws `; anything else is rejected with `success=False`.
### Observation
```python
class AwsRlObservation(Observation):
episode_id: EpisodeID
step_count: StepCount
command_success: bool # exit code == 0
command_output: str # stdout from the AWS CLI invocation
error: str # stderr (empty if success)
task: TaskInfo | None # masked task definition (no success criteria)
task_achieved: bool
partial_progress: float # current task progress in [0.0, 1.0]
hints_used: int # cumulative hint count this episode
hint_text: str # most recent hint text (if any)
```
### State
```python
class AwsRlState(State):
current_task: Task | None # full task assigned for the episode
tracker: TrackerState # episode tracker snapshot
infra_state: dict # AWS infrastructure state keyed by service name
chaos_occurred: bool # whether chaos was injected this episode
current_tier: str # agent's current difficulty tier
class TrackerState:
step_count: int # steps taken this episode
hints_used: int # hints requested this episode
progress: float # current partial progress [0.0, 1.0]
commands_executed: list[str] # commands executed this episode
credited_operations: list[str] # (operation, resource) pairs that earned credit
```
### Task definitions
```python
class Task:
task_id: TaskID
difficulty: TaskDifficulty # warmup | beginner | intermediate | advanced | expert
description: str # human-readable goal
success_criteria: SuccessCriteria
setup_commands: list[SetupCommand] # pre-provision for SRE tasks
desired_state_spec: str | None # natural-language desired end state (drift tasks)
possible_drifts: list[SetupCommand] # pool of mutations for DriftEngine
class TaskInfo:
"""Agent-visible subset of Task β masks success_criteria, setup_commands, and possible_drifts."""
task_id: TaskID
difficulty: TaskDifficulty
description: str
desired_state_spec: str | None
class SuccessCriteria:
command_contains: str | None # warmup/beginner
operation: str | None # warmup/beginner
resource_exists: ResourceExistsCheck | None # beginner
steps: list[StepCriteria] # intermediate/advanced/expert
services: list[AwsService] # advanced/expert
state_checks: list[StateCheck] # expert (ground truth)
```
### Curriculum config
```python
class TierConfig:
min_episodes: int # minimum episodes before promotion
advance_rate: float # tier success rate threshold (0.6 - 1.0)
mastery_window: int # sliding window size (default: 10)
mastery_threshold: float # per-task graduation threshold (default: 0.7)
fast_track_rate: float # early promotion threshold (default: 0.9)
chaos_probability: float # probability of chaos injection per step
class SpacedRepState:
interval: int # episodes until next re-test (3 β 48)
last_graduated_episode: int # when last graduated
```
---
## 7. Curriculum & Reward (overview)
The curriculum and reward stack is the heart of the project. This section is the elevator pitch; **the full mechanics β priority scoring math, anti-reward-hacking layers, chaos engine, drift engine β live in [server/README.md](server/README.md)**.
### Priority scoring (one-formula task selection)
```
score = novelty_bonus # +100 if never attempted
+ weakness_weight # +50 Γ (1 β task_success_rate)
+ spaced_rep_bonus # +30 if a graduated task is "due" for re-test
β recency_penalty # β20 if attempted in the last 2 episodes
```
Exploration, weakness-targeting, anti-forgetting, and variety β all balanced by one weighted sum.
### Reward shaping
```
if task_achieved:
reward = 1.0
if survived_chaos: reward *= 1.05 # chaos survival bonus
else:
reward = partial_progress * 0.8 # β€ 0.8 from steps alone
if progress_increased: reward += 0.1 # dense progress signal
if command_failed: reward *= 0.5 # error penalty
reward -= 0.1 * rollback_count # waste penalty
reward += 0.02 * idempotent_retries # graceful retry bonus
reward = clamp(reward, 0.0, 0.99) # 1.0 reserved for completion
reward *= 0.85 ** hints_used # hint decay applied last
```
The agent's loss surface is intentionally narrow: only doing the task earns full reward, and every reward-hacking shortcut we identified during design has a defense layer (full list in [Server's Readme file section Β§9](server/README.md#9-anti-reward-hacking--8-defense-layers)).
> 
---
## 8. Training pipeline (SFT β GRPO)
The training pipeline runs in two stages, both reproducible on Colab. Full detail in **[train/README.md](train/README.md)**.
```
βββββββββββ data/sft/ βββββββββββ
β 1,500 train Β· 150 val rows β
β 5 trajectory types β
βββββββββββββββββ¬ββββββββββββββββ
βΌ
STAGE 1 β Supervised Fine-Tuning train/train_sft_lora.ipynb
Qwen2.5-Coder-3B-Instruct + LoRA r=8/16/32 (Optuna) β SFT adapter
β
β Sizzing/aws-rl-sft-qwen25coder3b-adapter
βΌ
STAGE 2 β GRPO RL train/train_grpo_lora.ipynb
G=8 parallel rollouts Β· multi-turn Β· reward = env return
Optuna over (lr, Ξ², G, T, top_p, lora_r, max_turns)
```
### Numbers worth knowing
| | |
|---|---|
| **Base model** | `unsloth/Qwen2.5-Coder-3B-Instruct-bnb-4bit` β picked via [Through model evaluation](data/sft/MODEL_EVALUATION.md) |
| **SFT LoRA** | `r β {8,16,32}`, `lora_alpha = r Γ multiplier`, target = attention only, dropout `[0.005, 0.031]` |
| **GRPO config** | `G=8`, `Ξ²=0.04`, `lr=5e-6`, `T=0.9`, `top_p=0.95`, `max_turns=6`, loss=`dapo` |
| **Optuna search** | TPE sampler, 6 trials Γ 30 GRPO steps, frozen 10-task held-out val set |
| **Final training** | 200 GRPO steps with best config |
### Training graphs
> Embed once notebook is executed:
> 
> 
> 
> 
---
## 9. Parallel rollout architecture
GRPO needs `G` rollouts on the same task per training step. We run all G in parallel with **state isolation guaranteed**. Three coordinated pool layers make it work:
```
Trainer (G=8 generations needed per step)
β
ββββββββββββββββββββββΌβββββββββββββββββββββ
βΌ βΌ βΌ
MultiTurnEnvPool GrpoPool (in-process)
(train_grpo.py) (scripts/grpo_pool.py)
sync API async API
β β
ββββββββ 8 WebSocket connections βββββββββ
β
βΌ
FastAPI server :8000
+ OpenEnv max_concurrent_envs=8
β
βΌ
MiniStackPool (free-list, lock-guarded)
acquire(port) on connect, release on disconnect
β
βΌ
8 isolated MiniStack instances :4566..:4573
```
Wall-clock impact: an 8-rollout Γ 6-turn episode runs in ~300 ms of env time vs ~2.4 s sequential. Full mechanics, including the **all-or-nothing connect protocol** that prevents pool-slot leakage on flake, are in **[Scripts README file](scripts/README.md)**.
> 
---
## 10. MiniStack: vendored & customized
The simulator powering the env is **vendored** as a git subtree at [aws_infra/](aws_infra/), not pulled as a black-box dependency. We forked it because we needed:
1. A custom `/_ministack/state` JSON endpoint so the grader can read the entire infra inventory in **one HTTP call** instead of iterating 20+ list APIs per grading pass. Added in commit `a648c3a "feat: Add support for service state retrieval and action listing across multiple AWS services"`.
2. A reproducible build with no runtime network requirement β the Docker image bundles a specific MiniStack revision.
3. The freedom to extend service coverage on demand.
Custom commits live as small, isolated patches so periodic upstream syncs (`af2e945`, `579597b`) replay cleanly. To inspect:
```bash
git show a648c3a # the state-endpoint diff
git log --oneline -- aws_infra/ # only the aws_infra subtree history
```
Full subtree workflow + commit-by-commit detail in [server/README.md Β§5](server/README.md#5-ministack-vendored-fork--customizations). Upstream MiniStack docs (81 KB) are preserved at [aws_infra/README.md](aws_infra/README.md).
---
## 11. Results & Benchmarks
### Base-model selection
We evaluated 11 chat models on 27 held-out prompts. **Qwen2.5-Coder-3B-Instruct** wins on every metric that matters: 41% exact match (highest), 63% operation match (highest), 3.1 s/call (3Γ faster than the 4B runner-up). Full report:
> **[data/sft/MODEL_EVALUATION.md](data/sft/MODEL_EVALUATION.md)** β 270-line writeup, per-model verdicts, methodology
> 
### Base vs SFT β actual results
After running the SFT pipeline end-to-end, the eval delta on the same held-out prompts is striking:
| Metric | Base | Post-SFT | Delta |
|-----------------|:------:|:--------:|:-----------:|
| `format_pct` | 33.3% | **100.0%** | **+66.7 pp** |
| `exact_pct` | 38.9% | **88.9%** | **+50.0 pp** |
| `service_pct` | 77.8% | **88.9%** | +11.1 pp |
| `operation_pct` | 61.1% | **88.9%** | +27.8 pp |
| `avg_len` | 85.8 | 74.7 | β11 chars (tighter) |
> 
Every target from [data/sft/MODEL_EVALUATION.md Β§11](data/sft/MODEL_EVALUATION.md) is met or exceeded. Format compliance is now perfect; the model never wraps commands in fences or quotes after SFT. Exact-match jumped from 39% to 89% β the agent now emits the canonical command for ~9 of every 10 prompts.
The richer two-mode benchmark (dataset eval + live RL env eval) is in [compare/compare_base_vs_sft.ipynb](compare/compare_base_vs_sft.ipynb); methodology in [compare/README.md](compare/README.md).
> 
> 
### SFT training curves
> 
### Optuna SFT search
The best SFT trial (out of 6) used `lora_r=16, lora_alpha=16, dropout=0.0058, lr=4.03e-4, warmup=0.1` β see [train/README.md Β§3](train/README.md#3-optuna-hyperparameter-search) for the full Optuna study table.
> 
> 
### GRPO results (live multi-step env eval)
After 35 GRPO steps on top of the SFT adapter (best Optuna config: `lr=1.6e-5, Ξ²=0.0021, T=0.99`), we re-evaluated end-to-end on 100+ episodes:
| Metric | Base + SFT | Base + SFT + GRPO | Ξ |
|-------------------------------|:---------:|:-----------------:|:------------:|
| Overall success rate | 86.8% | 86.2% | β0.5 pp |
| Overall mean reward | 0.883 | 0.877 | β0.006 |
| Beginner success | 96.2% | **100.0%** | **+3.8 pp** |
| Intermediate success | 81.0% | **87.0%** | **+6.0 pp** |
| Warmup success | 96.0% | 90.2% | β5.8 pp |
| Expert success | 22.2% | 22.2% | flat |
| Drift repair rate | 22.2% | 22.2% | flat |
| Destructive-action fail rate | 15.1% | 14.7% | β0.4 pp |
| Steps to solve | 1.45 | 1.55 | +0.10 |
> 
> 
**Honest reading:** the 35-step GRPO run preserves the SFT gains and modestly improves the middle tiers (beginner +3.8 pp, intermediate +6.0 pp) β but does not crack the **expert-tier bottleneck** (22% success on SRE / drift / security-posture tasks). With longer GRPO runs and more curriculum exposure to expert tasks, this is the next gain to chase.
### GRPO training curves
Per-step training signals from the final 35-step GRPO run:
> 
> 
Optuna search across 4 trials picked the final config:
> 
> 
> 
### Qualitative rollouts (post-GRPO)
One sample episode per tier:
> 
---
## 12. Repository map
| Path | Purpose | Sub-README |
|--------------------------------|--------------------------------------------------------------------|-----------------------------------------|
| [server/](server/) | OpenEnv FastAPI server, env logic, services, web playground | [server/README.md](server/README.md) |
| [train/](train/) | SFT and GRPO training notebooks | [train/README.md](train/README.md) |
| [data/](data/) | SFT dataset, base-model selection, eval harness | [data/README.md](data/README.md) Β· [MODEL_EVALUATION.md](data/sft/MODEL_EVALUATION.md) |
| [compare/](compare/) | Base vs SFT side-by-side benchmark | [compare/README.md](compare/README.md) |
| [scripts/](scripts/) | Parallel-rollout architecture + multi-connection demo | [scripts/README.md](scripts/README.md) |
| [aws_infra/](aws_infra/) | Vendored MiniStack simulator (git subtree) | [aws_infra/README.md](aws_infra/README.md) |
| [tests/](tests/), [tests_tasks/](tests_tasks/) | Unit + tier-integration test suites | (see [Β§14](#14-testing)) |
| [models.py](models.py) | Pydantic data models for action/observation/task | (inline Β§6) |
| [client.py](client.py) | OpenEnv HTTP/WebSocket client wrapper | β |
| [inference.py](inference.py) | Single-model agent loop (matches RL eval mode of `compare/`) | β |
| [train_grpo.py](train_grpo.py) | GRPO trainer (1,283 LOC) β `MultiTurnEnvPool`, Optuna, plotting | (see [train/README.md](train/README.md)) |
| [aws_rl_env_colab.ipynb](aws_rl_env_colab.ipynb) | Colab driver for the full training pipeline | β |
| [docs/figures/](docs/figures/) | All README graphs and screenshots | β |
---
## 13. Configuration & Running
### Docker (recommended)
```bash
make docker-build # build the image
make docker-run # foreground on :8000
make docker-run-detach # background
make docker-health # liveness probe
```
### OpenEnv deployment
```bash
make openenv-validate # validate config
make openenv-build # build environment
make openenv-push # push to HuggingFace Spaces
```
### Environment variables
| Variable | Default | Description |
|-------------------------------------|--------------------------|-------------------------------------------------------------------|
| `AWS_INFRA_URL` | `http://localhost:4566` | MiniStack endpoint (used when `POOL_SIZE=1`) |
| `AWS_RL_ENV_POOL_SIZE` | `1` | **Server-side MiniStack pool size; set to 8 for GRPO training** |
| `AWS_RL_ENV_MINISTACK_BASE_PORT` | `4566` | First MiniStack port; pool covers `[BASE, BASE + POOL_SIZE)` |
| `BACKEND_TYPE` | `simulator` | `simulator` (MiniStack) or `aws` (real AWS, no pool) |
| `AWS_ACCESS_KEY_ID` | `test` | AWS credentials (any value works for the simulator) |
| `AWS_SECRET_ACCESS_KEY` | `test` | AWS credentials (any value works for the simulator) |
| `AWS_DEFAULT_REGION` | `us-east-1` | AWS region |
| `MAX_STEPS` | `15` | Max steps per episode |
| `API_BASE_URL` | β | LLM API endpoint for [inference.py](inference.py) |
| `MODEL_NAME` | β | LLM model name for [inference.py](inference.py) |
| `HF_TOKEN` | β | HuggingFace token (dataset/adapter access, push) |
| `TEMPERATURE` | `0.7` | LLM sampling temperature |
### Curriculum stats API
```python
curriculum.get_stats()
# {
# "episode_count": 42,
# "tier": "intermediate",
# "tier_episodes": 12,
# "tier_success_rate": 0.75,
# "graduated_tasks": [0, 2, 4],
# "weak_spots": [11, 12],
# "skill_profile": {0: 0.95, 1: 0.8, ...},
# "spaced_rep_due": [0, 2],
# "avg_reward_last_10": 0.65
# }
```
---
## 14. Testing
The test suite covers both isolated unit logic and end-to-end task execution against MiniStack.
### Unit tests β [tests/](tests/)
```bash
pytest tests/ -v
```
| File | Covers |
|----------------------------------------------------------------------------------------------|-----------------------------------------------------------------|
| [test_aws_rl_env_environment.py](tests/test_aws_rl_env_environment.py) | Environment lifecycle, reset/step semantics, reward integration |
| [test_task_grader.py](tests/test_task_grader.py) | All 5 grading strategies, partial progress, penalties, bonuses |
| [test_resource_verifier.py](tests/test_resource_verifier.py) | Per-service ground-truth verification (20+ services) |
| [test_episode_tracker.py](tests/test_episode_tracker.py) | Command parsing, dedup, monotonic progress, rollback detection |
| [test_episode_context.py](tests/test_episode_context.py) | Per-episode context lifecycle |
| [test_drift_engine.py](tests/test_drift_engine.py) | Random drift selection, mutation application |
| [test_hint_provider.py](tests/test_hint_provider.py) | Three-level progressive hints, decay computation |
| [test_environment_designer.py](tests/test_environment_designer.py) | Setup-command provisioning |
| [test_pool.py](tests/test_pool.py) | Server-side `MiniStackPool` acquire/release, exhaustion |
| [test_grpo_pool.py](tests/test_grpo_pool.py) | Client-side `GrpoPool` connect/close, all-or-nothing rollback |
### Tier integration tests β [tests_tasks/](tests_tasks/)
```bash
pytest tests_tasks/ -v
```
134 tasks exercised end-to-end:
| File | Tasks |
|-----------------------------------------------------------------------------------------------------|------:|
| [test_warmup_tasks.py](tests_tasks/test_warmup_tasks.py) | 25 |
| [test_beginner_tasks.py](tests_tasks/test_beginner_tasks.py) | 25 |
| [test_intermediate_tasks.py](tests_tasks/test_intermediate_tasks.py) | 25 |
| [test_advanced_tasks.py](tests_tasks/test_advanced_tasks.py) | 25 |
| [test_expert_tasks.py](tests_tasks/test_expert_tasks.py) | 24 |
| [test_drift_tasks.py](tests_tasks/test_drift_tasks.py) | 9 |
| **Total** | **133** |
These tests double as the source of truth for canonical solutions used by the SFT dataset generator (extracted via AST β see [data/README.md Β§1](data/README.md#1-sft-dataset-generation)).
---
## 15. Tech stack
- **Python 3.12**, [`uv`](https://github.com/astral-sh/uv) for dependency management, multi-stage Docker
- **FastAPI**, **OpenEnv** (HTTP + WebSocket env protocol), **uvicorn**
- **TRL β₯ 0.21** (`GRPOTrainer`, `GRPOConfig`)
- **PEFT** (LoRA), **Unsloth** (4-bit quantized base, fused training kernels)
- **Transformers β₯ 4.45**, **datasets β₯ 2.20**, **HuggingFace Hub β₯ 0.24**
- **Optuna β₯ 3.6** (TPE sampler, SQLite study storage)
- **asyncio** + **websockets** + **httpx** (parallel rollout orchestration)
- **MiniStack** (vendored at [aws_infra/](aws_infra/), 34 AWS services)
- **AWS CLI v2** (subprocess invocation against MiniStack endpoint)
- **matplotlib**, **plotly** (training curves, Optuna visualizations)
- **pytest** (16 test files, ~250 KB of test code)
---
## 16. Links
- **Live demo**: [sizzing-aws-rl-env.hf.space/web](https://sizzing-aws-rl-env.hf.space/web)
- **HF Space**: [huggingface.co/spaces/Sizzing/aws_rl_env](https://huggingface.co/spaces/Sizzing/aws_rl_env)
- **API docs**: [/docs](https://sizzing-aws-rl-env.hf.space/docs) Β· [/redoc](https://sizzing-aws-rl-env.hf.space/redoc)
- **SFT adapter**: [Sizzing/aws-rl-sft-qwen25coder3b-adapter](https://huggingface.co/Sizzing/aws-rl-sft-qwen25coder3b-adapter)
- **GRPO adapter**: [Sizzing/aws-rl-grpo-qwen25coder3b-adapter](https://huggingface.co/Sizzing/aws-rl-grpo-qwen25coder3b-adapter)
- **Dataset**: [Sizzing/aws-rl-sft](https://huggingface.co/datasets/Sizzing/aws-rl-sft)
- **GitHub**: [github.com/udaykiranpadhy/aws-rl-env](https://github.com/udaykiranpadhy/aws-rl-env)
---
## 17. Acknowledgments
- **MiniStack** β vendored at [aws_infra/](aws_infra/). Upstream license preserved. Custom modifications attributable to commits `a648c3a`, `a00e981`; periodic upstream syncs `af2e945`, `579597b`.
- **OpenEnv** β environment protocol and Python client framework.
- **TRL** (HuggingFace) β `GRPOTrainer` implementation.
- **Unsloth** β 4-bit quantized model loaders + fused training kernels.
- **Google Colab** for providing their infrastructure to train models.
- **AWS service icons** in [server/static/img/aws/](server/static/img/aws/) β used in the web playground.
---
## Sub-README index
For deep technical detail on any subsystem:
- [server/README.md](server/README.md) β environment internals (curriculum, reward shaping, anti-hacking, chaos, drift, MiniStack-fork detail)
- [train/README.md](train/README.md) β SFT + GRPO training pipeline (LoRA config, Optuna search, multi-turn rollouts)
- [scripts/README.md](scripts/README.md) β parallel-rollout architecture (3 pool layers, all-or-nothing connect, concurrency safety)
- [data/README.md](data/README.md) β dataset generation (5 trajectory types, AST extraction) + base-model selection summary
- [data/sft/MODEL_EVALUATION.md](data/sft/MODEL_EVALUATION.md) β full 11-model benchmark report
- [compare/README.md](compare/README.md) β base vs SFT comparison harness
- [aws_infra/README.md](aws_infra/README.md) β vendored MiniStack upstream documentation (81 KB)
## Small Video Explanation
- [Recorded Video explaining core functionality](https://share.zight.com/NQu0pLvQ) |