--- title: AWS RL Environment Server emoji: πŸ₯‡ colorFrom: pink colorTo: pink sdk: docker pinned: false app_port: 8000 base_path: /web tags: - openenv --- # AWS Cloud Operations β€” RL Environment & Training Pipeline > Cloud agents fail in production not because they don’t know the commands β€” but because state drifts, services hiccup, and reward signals get gamed. We built an environment that simulates all three: 120+ AWS tasks under chaos and drift, an 8-layer anti-reward-hacking stack, and an adversarial curriculum that targets the agent’s own weak spots. After SFT β†’ GRPO on a single GPU with 8 parallel rollouts, format compliance hit 100%, exact-match jumped 39% β†’ 89%, and intermediate-tier success climbed 81% β†’ 87%. | | | |---|---| | **Live demo** | [sizzing-aws-rl-env.hf.space/web](https://sizzing-aws-rl-env.hf.space/web) β€” try the playground in a browser | | **API docs** | [sizzing-aws-rl-env.hf.space/docs](https://sizzing-aws-rl-env.hf.space/docs) (Swagger), [/redoc](https://sizzing-aws-rl-env.hf.space/redoc) | | **HF Space** | [huggingface.co/spaces/Sizzing/aws_rl_env](https://huggingface.co/spaces/Sizzing/aws_rl_env) | | **SFT adapter**| [Sizzing/aws-rl-sft-qwen25coder3b-adapter](https://huggingface.co/Sizzing/aws-rl-sft-qwen25coder3b-adapter) | | **Dataset** | [Sizzing/aws-rl-sft](https://huggingface.co/datasets/Sizzing/aws-rl-sft) | --- ## Table of contents 1. [What this is & why it matters](#1-what-this-is--why-it-matters) 2. [Highlights β€” full feature inventory](#2-highlights--full-feature-inventory) 3. [Architecture](#3-architecture) 4. [Live demo & Quick Start](#4-live-demo--quick-start) 5. [Run on Colab](#5-run-on-colab) 6. [Action / Observation spec](#6-action--observation-spec) 7. [Curriculum & Reward (overview)](#7-curriculum--reward-overview) 8. [Training pipeline (SFT β†’ GRPO)](#8-training-pipeline-sft--grpo) 9. [Parallel rollout architecture](#9-parallel-rollout-architecture) 10. [MiniStack: vendored & customized](#10-ministack-vendored--customized) 11. [Results & Benchmarks](#11-results--benchmarks) 12. [Repository map](#12-repository-map) 13. [Configuration & Running](#13-configuration--running) 14. [Testing](#14-testing) 15. [Tech stack](#15-tech-stack) 16. [Links](#16-links) 17. [Acknowledgments](#17-acknowledgments) --- ## 1. What this is & why it matters Modern AI agents are increasingly asked to operate cloud infrastructure β€” provisioning resources, fixing misconfigurations, responding to drift. Training such agents needs (a) a realistic environment, (b) reliable reward signals, and (c) enough scale to make RL feasible. Existing options force a hard tradeoff: real AWS costs hundreds of dollars per training run and is impossible to reset; toy emulators don't behave like production AWS. **This project closes that gap.** We built: 1. **An OpenEnv-compatible RL environment** that speaks real AWS CLI semantics. The agent sends `aws s3 mb …`, `aws iam create-role …`, and so on β€” the exact same commands a human SRE would type. 2. **A vendored, customized MiniStack simulator** that responds with production-equivalent JSON, runs locally for zero cost, supports 34 AWS services, and exposes a single-call state-introspection endpoint we added so the grader has cheap ground-truth access. 3. **A 120+ task curriculum** across 5 tiers (warmup β†’ expert) with adaptive selection, mastery tracking, spaced repetition, chaos injection, and drift-detection scenarios β€” every feature designed to keep the reward signal honest and prevent the agent from gaming it. 4. **A complete SFT β†’ GRPO training pipeline.** A 1,500-row synthetic dataset spanning 5 trajectory shapes, an 11-model base benchmark, LoRA fine-tuning, and TRL GRPO with multi-turn rollouts and Optuna hyperparameter search. 5. **An 8-way parallel-rollout architecture.** Server-side MiniStack pool, client-side `GrpoPool`, in-process `MultiTurnEnvPool` β€” three coordinated layers that let G=8 concurrent rollouts run on one GPU without state contamination. Everything is reproducible: the dataset is generated by a deterministic script, the model selection is documented end-to-end, training entry points run on Colab, and the env runs locally in a single Docker container with no external network requirement. --- ## 2. Highlights β€” full feature inventory This is the complete surface area of the project. Each entry links to deeper documentation in the corresponding sub-README. ### Environment & Curriculum - **[120+ tasks across 5 tiers](server/services/tasks/)** β€” warmup (25), beginner (25), intermediate (25), advanced (25), expert (24), drift (9). YAML-defined task spec per tier. - **[Curriculum learning with priority scoring](server/README.md#7-curriculum-manager)** β€” `score = novelty + weakness βˆ’ recency + spaced_rep_bonus` drives task selection. - **[Mastery tracking](server/README.md#7-curriculum-manager)** β€” sliding 10-episode window, 0.7 threshold, 0.85 exponential decay, supports un-graduation. - **[Spaced repetition](server/README.md#7-curriculum-manager)** β€” graduated tasks resurface at intervals `[3, 6, 12, 24, 48]` to prevent forgetting. - **[Tier promotion](server/README.md#7-curriculum-manager)** β€” standard (min episodes + success rate) + fast-track (3 consecutive β‰₯90% episodes). - **[Strategy pattern: simulator vs real AWS](server/README.md#4-strategy-pattern-simulator-vs-real-aws)** β€” `BACKEND_TYPE=simulator` (default) or `aws`, no code fork. ### Reward shaping - **[Five grading strategies](server/README.md#8-reward-shaping--taskgrader)** β€” command-match (warmup), resource-creation (beginner), multi-step (intermediate), multi-step+services (advanced), state-checks (expert). - **[Dense partial-progress signal](server/README.md#8-reward-shaping--taskgrader)** β€” clamped to `[0.0, 0.99]`, `1.0` reserved for verified completion. - **[Rollback penalty](server/README.md#8-reward-shaping--taskgrader)** β€” `βˆ’0.1` per `(create-X, …, delete-X)` pair. - **[Idempotency bonus](server/README.md#8-reward-shaping--taskgrader)** β€” `+0.02` for graceful "already exists" retry. - **[Hint decay](server/README.md#13-hint-provider)** β€” three-level progressive hints with `0.85^n` reward multiplier. - **[Chaos survival bonus](server/README.md#11-chaos-engine)** β€” `Γ—1.05` if the agent completes a chaotic task. ### Resilience & adversarial features - **[Chaos injection](server/README.md#11-chaos-engine)** β€” silent mid-episode mutations, tier-scaled probabilities (10/20/30%) on services the task is touching. - **[Drift detection](server/README.md#12-drift-engine)** β€” 6 expert tasks, 2–3 random mutations from a per-task pool, randomized per episode (no memorization). - **[Security-posture audit tasks](server/README.md#17-security-posture-audit-examples)** β€” S3 public bucket lockdown, IAM least-privilege, Lambda secret rotation. - **[8-layer anti-reward-hacking](server/README.md#9-anti-reward-hacking--8-defense-layers)** β€” ground-truth verification, dedup, grader invisibility, command allow-list, no-credit-for-reads, monotonic progress, exact resource-name validation, final state checks. ### Training pipeline - **[Synthetic SFT dataset (1,500 rows)](data/README.md)** β€” 5 trajectory types: success / multi-step continuation / failure recovery / verification / hint usage. - **[Rigorous base-model selection](data/sft/MODEL_EVALUATION.md)** β€” 11 models Γ— 27 prompts, [Qwen2.5-Coder-3B-Instruct](https://huggingface.co/unsloth/Qwen2.5-Coder-3B-Instruct-bnb-4bit) wins. - **[LoRA SFT](train/README.md#1-sft-stage--supervised-lora)** β€” `r ∈ {8,16,32}`, `lora_alpha = r Γ— multiplier`, attention-only adaptation. - **[GRPO RL via TRL](train/README.md#2-grpo-stage--reinforcement-learning)** β€” group-relative advantages, KL to SFT reference, `dapo` loss, no critic. - **[Multi-turn rollouts](train/README.md#4-multi-turn-rollouts--parallel-envs)** β€” up to `MAX_TURNS=6`, observation fed back as next-turn user message. - **[Optuna hyperparameter search](train/README.md#3-optuna-hyperparameter-search)** β€” TPE sampler over 8-dim space, frozen held-out validation set. - **[HuggingFace integration](data/README.md#7-huggingface-publishing)** β€” adapter + dataset published to Hub, OpenEnv Space deployment. ### Parallel rollout architecture - **[Server-side MiniStack pool](server/README.md#6-server-side-ministack-pool-parallel-rollouts)** β€” `MiniStackPool` ([server/app.py](server/app.py)), free-list of ports, lock-guarded acquire/release. - **[Client-side GrpoPool](scripts/README.md#2-three-coordinated-pool-layers)** β€” async-native, all-or-nothing connect, asyncio.gather for concurrent rollouts. - **[In-process MultiTurnEnvPool](train/README.md#4-multi-turn-rollouts--parallel-envs)** β€” sync API, owns a background asyncio loop, used by the trainer. - **[8 isolated rollouts on one server](scripts/README.md#7-running-the-multi-connection-demo)** β€” proof in [scripts/TestMultipleConnects.ipynb](scripts/TestMultipleConnects.ipynb). ### Vendored simulator - **[MiniStack as git subtree](server/README.md#5-ministack-vendored-fork--customizations)** β€” vendored at [aws_infra/](aws_infra/) (commit `2c38c0b`). 34 AWS services. MIT. - **[Custom `/_ministack/state` endpoint](server/README.md#5-ministack-vendored-fork--customizations)** β€” added in commit `a648c3a`; returns full infra inventory in one call. - **[Upstream sync workflow](server/README.md#5-ministack-vendored-fork--customizations)** β€” periodic `git subtree pull`; isolated patches keep conflicts minimal. ### Operations & deployment - **[OpenEnv-compliant](https://github.com/openai/openenv)** β€” `/reset`, `/step`, `/state`, `/schema`, `/ws` HTTP+WebSocket endpoints. - **[Web playground UI](server/README.md#19-web-playground)** β€” `/web` route, 40 AWS service icons, Jinja2 + JS frontend. - **[Docker-first deployment](Dockerfile)** β€” multi-stage build, container ships server + N MiniStack instances + AWS CLI. - **[Comprehensive test suite](#14-testing)** β€” 10 unit tests + 6 tier-integration suites covering 134 tasks. --- ## 3. Architecture > ![System architecture](docs/figures/architecture_diagram.png) ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ Docker container ──────────────────────────────────┐ β”‚ β”‚ β”‚ FastAPI server (port 8000) β”‚ β”‚ β”œβ”€β”€ OpenEnv router /reset /step /state /schema /ws /health β”‚ β”‚ β”œβ”€β”€ Web playground /web (Jinja2 + 40 AWS icon SVGs) β”‚ β”‚ β”œβ”€β”€ env_factory per-WS-session AwsRlEnvironment instance β”‚ β”‚ β”‚ (acquires a MiniStack port from MiniStackPool) β”‚ β”‚ └── Services β”‚ β”‚ Curriculum Β· TaskGrader Β· ResourceVerifier Β· ChaosEngine Β· DriftEngine β”‚ β”‚ HintProvider Β· EpisodeTracker Β· EnvironmentDesigner Β· EnvironmentStrategy β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ MiniStack instances :4566 :4567 :4568 … :4566+POOL_SIZE-1 β”‚ β”‚ (vendored at aws_infra/, started by the Dockerfile entrypoint) β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β–² β–² β”‚ HTTP/WS β”‚ AWS CLI subprocess β”‚ β”‚ (AWS_ENDPOINT_URL=http://localhost:4566+i) β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ RL Agent β”‚ β”‚ AWS CLI commands β”‚ β”‚ the agent emits β”‚ β”‚ (client.py) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` ### Episode lifecycle 1. **`reset()`** β€” wipes simulator state, picks next task from the curriculum, runs `setup_commands`, applies drift if applicable, returns initial observation. 2. **`step(action)`** β€” validates the command (must start with `aws `), intercepts hint requests, executes via the strategy, records in tracker, grades with shaped reward, optionally injects chaos, returns observation. 3. **Hint** β€” agent sends `aws help --task-hint`; intercepted before reaching MiniStack; returns next-level hint, increments `hints_used` (which decays final reward by `0.85^n`). 4. **Termination** β€” `task_achieved=True` or `step_count >= MAX_STEPS` (default 15). Full mechanics in [At server/README.md file](server/README.md). --- ## 4. Live demo & Quick Start ### Try it in a browser The hosted playground lets you click around any task without writing code: > **[Hugging Face Spaces Playground](https://sizzing-aws-rl-env.hf.space/web#playground)** ### Python client ```python from aws_rl_env import AwsRlAction, AwsRlEnv with AwsRlEnv.from_docker_image("aws-rl-env:latest") as env: result = env.reset() print(f"Task: {result.observation.task.description}") result = env.step(AwsRlAction(command="aws s3 mb s3://my-bucket")) print(f"Reward: {result.reward}, Done: {result.done}") ``` Or against a running server: ```python env = AwsRlEnv(base_url="http://localhost:8000") result = env.reset() result = env.step(AwsRlAction(command="aws s3 ls")) ``` ### WebSocket API ```python import websockets, json async with websockets.connect("wss://sizzing-aws-rl-env.hf.space/ws") as ws: await ws.send(json.dumps({"type": "reset"})) obs = json.loads(await ws.recv()) await ws.send(json.dumps({"type": "step", "data": {"command": "aws s3 ls"}})) obs = json.loads(await ws.recv()) ``` ### Local Docker ```bash make docker-build # build the image make docker-run # foreground; serves on :8000 make docker-run-detach # background make docker-health # liveness probe ``` For training (8-way parallel rollouts): ```bash AWS_RL_ENV_POOL_SIZE=8 make run ``` --- ## 5. Run on Colab The full pipeline is reproducible on a Colab GPU runtime. Drop your token into Colab Secrets, set `ENV_BASE_URL` to your HF Space (or local with ngrok), and run. | Notebook | What it does | Open in Colab | |-------------------------------------------------------------------------------------|-------------------------------------------------------|----------------------------------------------| | [train/train_sft_lora.ipynb](train/train_sft_lora.ipynb) | Stage 1 β€” SFT LoRA fine-tuning of Qwen2.5-Coder-3B | https://colab.research.google.com/drive/1dm9sDaLxHX6s9zEG_SC0FQcKWKkc3TfL?usp=sharing| | [train/train_grpo_lora.ipynb](train/train_grpo_lora.ipynb) | Stage 2 β€” GRPO RL training with multi-turn rollouts | https://colab.research.google.com/drive/1NwiOM0h_JpXXGRxfY_xZtDiaigvIaKjx?usp=sharing | | [compare/compare_base_vs_sft.ipynb](compare/compare_base_vs_sft.ipynb) | Side-by-side: base model vs SFT adapter (dataset + RL env) | https://colab.research.google.com/drive/17406aiad8h4nAphV42vVNZ-a5SzZMIre?usp=sharing | Replace each `` with the Colab badge URL once published. --- ## 6. Action / Observation spec The full Pydantic data models β€” kept inline so any reader can wire up an agent without leaving this page. Source: [models.py](models.py). ### Action ```python class AwsRlAction(Action): command: str # AWS CLI command, e.g. "aws s3 ls" ``` The environment validates that `command` starts with `aws `; anything else is rejected with `success=False`. ### Observation ```python class AwsRlObservation(Observation): episode_id: EpisodeID step_count: StepCount command_success: bool # exit code == 0 command_output: str # stdout from the AWS CLI invocation error: str # stderr (empty if success) task: TaskInfo | None # masked task definition (no success criteria) task_achieved: bool partial_progress: float # current task progress in [0.0, 1.0] hints_used: int # cumulative hint count this episode hint_text: str # most recent hint text (if any) ``` ### State ```python class AwsRlState(State): current_task: Task | None # full task assigned for the episode tracker: TrackerState # episode tracker snapshot infra_state: dict # AWS infrastructure state keyed by service name chaos_occurred: bool # whether chaos was injected this episode current_tier: str # agent's current difficulty tier class TrackerState: step_count: int # steps taken this episode hints_used: int # hints requested this episode progress: float # current partial progress [0.0, 1.0] commands_executed: list[str] # commands executed this episode credited_operations: list[str] # (operation, resource) pairs that earned credit ``` ### Task definitions ```python class Task: task_id: TaskID difficulty: TaskDifficulty # warmup | beginner | intermediate | advanced | expert description: str # human-readable goal success_criteria: SuccessCriteria setup_commands: list[SetupCommand] # pre-provision for SRE tasks desired_state_spec: str | None # natural-language desired end state (drift tasks) possible_drifts: list[SetupCommand] # pool of mutations for DriftEngine class TaskInfo: """Agent-visible subset of Task β€” masks success_criteria, setup_commands, and possible_drifts.""" task_id: TaskID difficulty: TaskDifficulty description: str desired_state_spec: str | None class SuccessCriteria: command_contains: str | None # warmup/beginner operation: str | None # warmup/beginner resource_exists: ResourceExistsCheck | None # beginner steps: list[StepCriteria] # intermediate/advanced/expert services: list[AwsService] # advanced/expert state_checks: list[StateCheck] # expert (ground truth) ``` ### Curriculum config ```python class TierConfig: min_episodes: int # minimum episodes before promotion advance_rate: float # tier success rate threshold (0.6 - 1.0) mastery_window: int # sliding window size (default: 10) mastery_threshold: float # per-task graduation threshold (default: 0.7) fast_track_rate: float # early promotion threshold (default: 0.9) chaos_probability: float # probability of chaos injection per step class SpacedRepState: interval: int # episodes until next re-test (3 β†’ 48) last_graduated_episode: int # when last graduated ``` --- ## 7. Curriculum & Reward (overview) The curriculum and reward stack is the heart of the project. This section is the elevator pitch; **the full mechanics β€” priority scoring math, anti-reward-hacking layers, chaos engine, drift engine β€” live in [server/README.md](server/README.md)**. ### Priority scoring (one-formula task selection) ``` score = novelty_bonus # +100 if never attempted + weakness_weight # +50 Γ— (1 βˆ’ task_success_rate) + spaced_rep_bonus # +30 if a graduated task is "due" for re-test βˆ’ recency_penalty # βˆ’20 if attempted in the last 2 episodes ``` Exploration, weakness-targeting, anti-forgetting, and variety β€” all balanced by one weighted sum. ### Reward shaping ``` if task_achieved: reward = 1.0 if survived_chaos: reward *= 1.05 # chaos survival bonus else: reward = partial_progress * 0.8 # ≀ 0.8 from steps alone if progress_increased: reward += 0.1 # dense progress signal if command_failed: reward *= 0.5 # error penalty reward -= 0.1 * rollback_count # waste penalty reward += 0.02 * idempotent_retries # graceful retry bonus reward = clamp(reward, 0.0, 0.99) # 1.0 reserved for completion reward *= 0.85 ** hints_used # hint decay applied last ``` The agent's loss surface is intentionally narrow: only doing the task earns full reward, and every reward-hacking shortcut we identified during design has a defense layer (full list in [Server's Readme file section Β§9](server/README.md#9-anti-reward-hacking--8-defense-layers)). > ![Curriculum progression: 5 tiers, priority scoring formula, mastery + spaced rep + fast-track](docs/figures/curriculum_progression.png) --- ## 8. Training pipeline (SFT β†’ GRPO) The training pipeline runs in two stages, both reproducible on Colab. Full detail in **[train/README.md](train/README.md)**. ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ data/sft/ ──────────┐ β”‚ 1,500 train Β· 150 val rows β”‚ β”‚ 5 trajectory types β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β–Ό STAGE 1 β€” Supervised Fine-Tuning train/train_sft_lora.ipynb Qwen2.5-Coder-3B-Instruct + LoRA r=8/16/32 (Optuna) β†’ SFT adapter β”‚ β”‚ Sizzing/aws-rl-sft-qwen25coder3b-adapter β–Ό STAGE 2 β€” GRPO RL train/train_grpo_lora.ipynb G=8 parallel rollouts Β· multi-turn Β· reward = env return Optuna over (lr, Ξ², G, T, top_p, lora_r, max_turns) ``` ### Numbers worth knowing | | | |---|---| | **Base model** | `unsloth/Qwen2.5-Coder-3B-Instruct-bnb-4bit` β€” picked via [Through model evaluation](data/sft/MODEL_EVALUATION.md) | | **SFT LoRA** | `r ∈ {8,16,32}`, `lora_alpha = r Γ— multiplier`, target = attention only, dropout `[0.005, 0.031]` | | **GRPO config** | `G=8`, `Ξ²=0.04`, `lr=5e-6`, `T=0.9`, `top_p=0.95`, `max_turns=6`, loss=`dapo` | | **Optuna search** | TPE sampler, 6 trials Γ— 30 GRPO steps, frozen 10-task held-out val set | | **Final training** | 200 GRPO steps with best config | ### Training graphs > Embed once notebook is executed: > ![SFT loss curve](docs/figures/sft_loss_curve.png) > ![GRPO mean reward over training](docs/figures/grpo_reward_curve.png) > ![Per-rollout reward by curriculum tier](docs/figures/grpo_per_tier_curve.png) > ![Optuna parameter importance](docs/figures/optuna_param_importance.png) --- ## 9. Parallel rollout architecture GRPO needs `G` rollouts on the same task per training step. We run all G in parallel with **state isolation guaranteed**. Three coordinated pool layers make it work: ``` Trainer (G=8 generations needed per step) β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β–Ό β–Ό β–Ό MultiTurnEnvPool GrpoPool (in-process) (train_grpo.py) (scripts/grpo_pool.py) sync API async API β”‚ β”‚ └─────── 8 WebSocket connections β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό FastAPI server :8000 + OpenEnv max_concurrent_envs=8 β”‚ β–Ό MiniStackPool (free-list, lock-guarded) acquire(port) on connect, release on disconnect β”‚ β–Ό 8 isolated MiniStack instances :4566..:4573 ``` Wall-clock impact: an 8-rollout Γ— 6-turn episode runs in ~300 ms of env time vs ~2.4 s sequential. Full mechanics, including the **all-or-nothing connect protocol** that prevents pool-slot leakage on flake, are in **[Scripts README file](scripts/README.md)**. > ![Parallel rollout: 3 coordinated pool layers](docs/figures/parallel_rollout_diagram.png) --- ## 10. MiniStack: vendored & customized The simulator powering the env is **vendored** as a git subtree at [aws_infra/](aws_infra/), not pulled as a black-box dependency. We forked it because we needed: 1. A custom `/_ministack/state` JSON endpoint so the grader can read the entire infra inventory in **one HTTP call** instead of iterating 20+ list APIs per grading pass. Added in commit `a648c3a "feat: Add support for service state retrieval and action listing across multiple AWS services"`. 2. A reproducible build with no runtime network requirement β€” the Docker image bundles a specific MiniStack revision. 3. The freedom to extend service coverage on demand. Custom commits live as small, isolated patches so periodic upstream syncs (`af2e945`, `579597b`) replay cleanly. To inspect: ```bash git show a648c3a # the state-endpoint diff git log --oneline -- aws_infra/ # only the aws_infra subtree history ``` Full subtree workflow + commit-by-commit detail in [server/README.md Β§5](server/README.md#5-ministack-vendored-fork--customizations). Upstream MiniStack docs (81 KB) are preserved at [aws_infra/README.md](aws_infra/README.md). --- ## 11. Results & Benchmarks ### Base-model selection We evaluated 11 chat models on 27 held-out prompts. **Qwen2.5-Coder-3B-Instruct** wins on every metric that matters: 41% exact match (highest), 63% operation match (highest), 3.1 s/call (3Γ— faster than the 4B runner-up). Full report: > **[data/sft/MODEL_EVALUATION.md](data/sft/MODEL_EVALUATION.md)** β€” 270-line writeup, per-model verdicts, methodology > ![Top 4 candidate models on the held-out benchmark](docs/figures/model_eval_chart.png) ### Base vs SFT β€” actual results After running the SFT pipeline end-to-end, the eval delta on the same held-out prompts is striking: | Metric | Base | Post-SFT | Delta | |-----------------|:------:|:--------:|:-----------:| | `format_pct` | 33.3% | **100.0%** | **+66.7 pp** | | `exact_pct` | 38.9% | **88.9%** | **+50.0 pp** | | `service_pct` | 77.8% | **88.9%** | +11.1 pp | | `operation_pct` | 61.1% | **88.9%** | +27.8 pp | | `avg_len` | 85.8 | 74.7 | βˆ’11 chars (tighter) | > ![Base vs SFT eval-metrics comparison](docs/figures/base_vs_sft_success.png) Every target from [data/sft/MODEL_EVALUATION.md Β§11](data/sft/MODEL_EVALUATION.md) is met or exceeded. Format compliance is now perfect; the model never wraps commands in fences or quotes after SFT. Exact-match jumped from 39% to 89% β€” the agent now emits the canonical command for ~9 of every 10 prompts. The richer two-mode benchmark (dataset eval + live RL env eval) is in [compare/compare_base_vs_sft.ipynb](compare/compare_base_vs_sft.ipynb); methodology in [compare/README.md](compare/README.md). > ![Dataset comparison: base vs SFT (per-row scores)](docs/figures/compare_dataset.png) > ![RL env comparison: base vs SFT (per-episode rewards)](docs/figures/compare_rl_env.png) ### SFT training curves > ![SFT loss curve over training](docs/figures/sft_loss_curve.png) ### Optuna SFT search The best SFT trial (out of 6) used `lora_r=16, lora_alpha=16, dropout=0.0058, lr=4.03e-4, warmup=0.1` β€” see [train/README.md Β§3](train/README.md#3-optuna-hyperparameter-search) for the full Optuna study table. > ![Optuna parameter importances](docs/figures/optuna_param_importance.png) > ![Optuna optimization history](docs/figures/optuna_history.png) ### GRPO results (live multi-step env eval) After 35 GRPO steps on top of the SFT adapter (best Optuna config: `lr=1.6e-5, Ξ²=0.0021, T=0.99`), we re-evaluated end-to-end on 100+ episodes: | Metric | Base + SFT | Base + SFT + GRPO | Ξ” | |-------------------------------|:---------:|:-----------------:|:------------:| | Overall success rate | 86.8% | 86.2% | βˆ’0.5 pp | | Overall mean reward | 0.883 | 0.877 | βˆ’0.006 | | Beginner success | 96.2% | **100.0%** | **+3.8 pp** | | Intermediate success | 81.0% | **87.0%** | **+6.0 pp** | | Warmup success | 96.0% | 90.2% | βˆ’5.8 pp | | Expert success | 22.2% | 22.2% | flat | | Drift repair rate | 22.2% | 22.2% | flat | | Destructive-action fail rate | 15.1% | 14.7% | βˆ’0.4 pp | | Steps to solve | 1.45 | 1.55 | +0.10 | > ![SFT vs GRPO metrics grid](docs/figures/sft_vs_grpo_metrics_grid.png) > ![SFT vs GRPO by tier](docs/figures/sft_vs_grpo_by_tier.png) **Honest reading:** the 35-step GRPO run preserves the SFT gains and modestly improves the middle tiers (beginner +3.8 pp, intermediate +6.0 pp) β€” but does not crack the **expert-tier bottleneck** (22% success on SRE / drift / security-posture tasks). With longer GRPO runs and more curriculum exposure to expert tasks, this is the next gain to chase. ### GRPO training curves Per-step training signals from the final 35-step GRPO run: > ![GRPO final per-step training signals](docs/figures/grpo_final_per_step.png) > ![GRPO env reward over training](docs/figures/grpo_reward_curve.png) Optuna search across 4 trials picked the final config: > ![GRPO Optuna trial comparison](docs/figures/grpo_optuna_trials_comparison.png) > ![GRPO Optuna parameter importances](docs/figures/grpo_optuna_importances.png) > ![GRPO Optuna optimization history](docs/figures/grpo_optuna_history.png) ### Qualitative rollouts (post-GRPO) One sample episode per tier: > ![Qualitative rollouts on representative tasks](docs/figures/qualitative_rollouts.png) --- ## 12. Repository map | Path | Purpose | Sub-README | |--------------------------------|--------------------------------------------------------------------|-----------------------------------------| | [server/](server/) | OpenEnv FastAPI server, env logic, services, web playground | [server/README.md](server/README.md) | | [train/](train/) | SFT and GRPO training notebooks | [train/README.md](train/README.md) | | [data/](data/) | SFT dataset, base-model selection, eval harness | [data/README.md](data/README.md) Β· [MODEL_EVALUATION.md](data/sft/MODEL_EVALUATION.md) | | [compare/](compare/) | Base vs SFT side-by-side benchmark | [compare/README.md](compare/README.md) | | [scripts/](scripts/) | Parallel-rollout architecture + multi-connection demo | [scripts/README.md](scripts/README.md) | | [aws_infra/](aws_infra/) | Vendored MiniStack simulator (git subtree) | [aws_infra/README.md](aws_infra/README.md) | | [tests/](tests/), [tests_tasks/](tests_tasks/) | Unit + tier-integration test suites | (see [Β§14](#14-testing)) | | [models.py](models.py) | Pydantic data models for action/observation/task | (inline Β§6) | | [client.py](client.py) | OpenEnv HTTP/WebSocket client wrapper | β€” | | [inference.py](inference.py) | Single-model agent loop (matches RL eval mode of `compare/`) | β€” | | [train_grpo.py](train_grpo.py) | GRPO trainer (1,283 LOC) β€” `MultiTurnEnvPool`, Optuna, plotting | (see [train/README.md](train/README.md)) | | [aws_rl_env_colab.ipynb](aws_rl_env_colab.ipynb) | Colab driver for the full training pipeline | β€” | | [docs/figures/](docs/figures/) | All README graphs and screenshots | β€” | --- ## 13. Configuration & Running ### Docker (recommended) ```bash make docker-build # build the image make docker-run # foreground on :8000 make docker-run-detach # background make docker-health # liveness probe ``` ### OpenEnv deployment ```bash make openenv-validate # validate config make openenv-build # build environment make openenv-push # push to HuggingFace Spaces ``` ### Environment variables | Variable | Default | Description | |-------------------------------------|--------------------------|-------------------------------------------------------------------| | `AWS_INFRA_URL` | `http://localhost:4566` | MiniStack endpoint (used when `POOL_SIZE=1`) | | `AWS_RL_ENV_POOL_SIZE` | `1` | **Server-side MiniStack pool size; set to 8 for GRPO training** | | `AWS_RL_ENV_MINISTACK_BASE_PORT` | `4566` | First MiniStack port; pool covers `[BASE, BASE + POOL_SIZE)` | | `BACKEND_TYPE` | `simulator` | `simulator` (MiniStack) or `aws` (real AWS, no pool) | | `AWS_ACCESS_KEY_ID` | `test` | AWS credentials (any value works for the simulator) | | `AWS_SECRET_ACCESS_KEY` | `test` | AWS credentials (any value works for the simulator) | | `AWS_DEFAULT_REGION` | `us-east-1` | AWS region | | `MAX_STEPS` | `15` | Max steps per episode | | `API_BASE_URL` | β€” | LLM API endpoint for [inference.py](inference.py) | | `MODEL_NAME` | β€” | LLM model name for [inference.py](inference.py) | | `HF_TOKEN` | β€” | HuggingFace token (dataset/adapter access, push) | | `TEMPERATURE` | `0.7` | LLM sampling temperature | ### Curriculum stats API ```python curriculum.get_stats() # { # "episode_count": 42, # "tier": "intermediate", # "tier_episodes": 12, # "tier_success_rate": 0.75, # "graduated_tasks": [0, 2, 4], # "weak_spots": [11, 12], # "skill_profile": {0: 0.95, 1: 0.8, ...}, # "spaced_rep_due": [0, 2], # "avg_reward_last_10": 0.65 # } ``` --- ## 14. Testing The test suite covers both isolated unit logic and end-to-end task execution against MiniStack. ### Unit tests β€” [tests/](tests/) ```bash pytest tests/ -v ``` | File | Covers | |----------------------------------------------------------------------------------------------|-----------------------------------------------------------------| | [test_aws_rl_env_environment.py](tests/test_aws_rl_env_environment.py) | Environment lifecycle, reset/step semantics, reward integration | | [test_task_grader.py](tests/test_task_grader.py) | All 5 grading strategies, partial progress, penalties, bonuses | | [test_resource_verifier.py](tests/test_resource_verifier.py) | Per-service ground-truth verification (20+ services) | | [test_episode_tracker.py](tests/test_episode_tracker.py) | Command parsing, dedup, monotonic progress, rollback detection | | [test_episode_context.py](tests/test_episode_context.py) | Per-episode context lifecycle | | [test_drift_engine.py](tests/test_drift_engine.py) | Random drift selection, mutation application | | [test_hint_provider.py](tests/test_hint_provider.py) | Three-level progressive hints, decay computation | | [test_environment_designer.py](tests/test_environment_designer.py) | Setup-command provisioning | | [test_pool.py](tests/test_pool.py) | Server-side `MiniStackPool` acquire/release, exhaustion | | [test_grpo_pool.py](tests/test_grpo_pool.py) | Client-side `GrpoPool` connect/close, all-or-nothing rollback | ### Tier integration tests β€” [tests_tasks/](tests_tasks/) ```bash pytest tests_tasks/ -v ``` 134 tasks exercised end-to-end: | File | Tasks | |-----------------------------------------------------------------------------------------------------|------:| | [test_warmup_tasks.py](tests_tasks/test_warmup_tasks.py) | 25 | | [test_beginner_tasks.py](tests_tasks/test_beginner_tasks.py) | 25 | | [test_intermediate_tasks.py](tests_tasks/test_intermediate_tasks.py) | 25 | | [test_advanced_tasks.py](tests_tasks/test_advanced_tasks.py) | 25 | | [test_expert_tasks.py](tests_tasks/test_expert_tasks.py) | 24 | | [test_drift_tasks.py](tests_tasks/test_drift_tasks.py) | 9 | | **Total** | **133** | These tests double as the source of truth for canonical solutions used by the SFT dataset generator (extracted via AST β€” see [data/README.md Β§1](data/README.md#1-sft-dataset-generation)). --- ## 15. Tech stack - **Python 3.12**, [`uv`](https://github.com/astral-sh/uv) for dependency management, multi-stage Docker - **FastAPI**, **OpenEnv** (HTTP + WebSocket env protocol), **uvicorn** - **TRL β‰₯ 0.21** (`GRPOTrainer`, `GRPOConfig`) - **PEFT** (LoRA), **Unsloth** (4-bit quantized base, fused training kernels) - **Transformers β‰₯ 4.45**, **datasets β‰₯ 2.20**, **HuggingFace Hub β‰₯ 0.24** - **Optuna β‰₯ 3.6** (TPE sampler, SQLite study storage) - **asyncio** + **websockets** + **httpx** (parallel rollout orchestration) - **MiniStack** (vendored at [aws_infra/](aws_infra/), 34 AWS services) - **AWS CLI v2** (subprocess invocation against MiniStack endpoint) - **matplotlib**, **plotly** (training curves, Optuna visualizations) - **pytest** (16 test files, ~250 KB of test code) --- ## 16. Links - **Live demo**: [sizzing-aws-rl-env.hf.space/web](https://sizzing-aws-rl-env.hf.space/web) - **HF Space**: [huggingface.co/spaces/Sizzing/aws_rl_env](https://huggingface.co/spaces/Sizzing/aws_rl_env) - **API docs**: [/docs](https://sizzing-aws-rl-env.hf.space/docs) Β· [/redoc](https://sizzing-aws-rl-env.hf.space/redoc) - **SFT adapter**: [Sizzing/aws-rl-sft-qwen25coder3b-adapter](https://huggingface.co/Sizzing/aws-rl-sft-qwen25coder3b-adapter) - **GRPO adapter**: [Sizzing/aws-rl-grpo-qwen25coder3b-adapter](https://huggingface.co/Sizzing/aws-rl-grpo-qwen25coder3b-adapter) - **Dataset**: [Sizzing/aws-rl-sft](https://huggingface.co/datasets/Sizzing/aws-rl-sft) - **GitHub**: [github.com/udaykiranpadhy/aws-rl-env](https://github.com/udaykiranpadhy/aws-rl-env) --- ## 17. Acknowledgments - **MiniStack** β€” vendored at [aws_infra/](aws_infra/). Upstream license preserved. Custom modifications attributable to commits `a648c3a`, `a00e981`; periodic upstream syncs `af2e945`, `579597b`. - **OpenEnv** β€” environment protocol and Python client framework. - **TRL** (HuggingFace) β€” `GRPOTrainer` implementation. - **Unsloth** β€” 4-bit quantized model loaders + fused training kernels. - **Google Colab** for providing their infrastructure to train models. - **AWS service icons** in [server/static/img/aws/](server/static/img/aws/) β€” used in the web playground. --- ## Sub-README index For deep technical detail on any subsystem: - [server/README.md](server/README.md) β€” environment internals (curriculum, reward shaping, anti-hacking, chaos, drift, MiniStack-fork detail) - [train/README.md](train/README.md) β€” SFT + GRPO training pipeline (LoRA config, Optuna search, multi-turn rollouts) - [scripts/README.md](scripts/README.md) β€” parallel-rollout architecture (3 pool layers, all-or-nothing connect, concurrency safety) - [data/README.md](data/README.md) β€” dataset generation (5 trajectory types, AST extraction) + base-model selection summary - [data/sft/MODEL_EVALUATION.md](data/sft/MODEL_EVALUATION.md) β€” full 11-model benchmark report - [compare/README.md](compare/README.md) β€” base vs SFT comparison harness - [aws_infra/README.md](aws_infra/README.md) β€” vendored MiniStack upstream documentation (81 KB) ## Small Video Explanation - [Recorded Video explaining core functionality](https://share.zight.com/NQu0pLvQ)