Spaces:
Running
Running
| # `server/` β AWS RL Environment Internals | |
| [β back to main README](../README.md) | |
| This directory implements the **OpenEnv-compatible FastAPI server** that powers the AWS RL Environment. The server exposes HTTP and WebSocket endpoints to a training agent, executes AWS CLI commands against a backing simulator (or real AWS), runs a reward / curriculum stack, and returns shaped observations. | |
| If you only have time for the headline numbers, read [the main README](../README.md). This document is the reference for **how** the environment actually works β every defended invariant, every edge case, every config knob. | |
| --- | |
| ## Table of contents | |
| 1. [Architecture overview](#1-architecture-overview) | |
| 2. [HTTP / WebSocket endpoints](#2-http--websocket-endpoints) | |
| 3. [Episode lifecycle](#3-episode-lifecycle) | |
| 4. [Strategy pattern: Simulator vs Real AWS](#4-strategy-pattern-simulator-vs-real-aws) | |
| 5. [MiniStack: vendored fork & customizations](#5-ministack-vendored-fork--customizations) | |
| 6. [Server-side MiniStack pool (parallel rollouts)](#6-server-side-ministack-pool-parallel-rollouts) | |
| 7. [Curriculum manager](#7-curriculum-manager) | |
| 8. [Reward shaping & TaskGrader](#8-reward-shaping--taskgrader) | |
| 9. [Anti-reward-hacking β 8 defense layers](#9-anti-reward-hacking--8-defense-layers) | |
| 10. [Resource verifier](#10-resource-verifier) | |
| 11. [Chaos engine](#11-chaos-engine) | |
| 12. [Drift engine](#12-drift-engine) | |
| 13. [Hint provider](#13-hint-provider) | |
| 14. [Episode tracker](#14-episode-tracker) | |
| 15. [Environment designer](#15-environment-designer) | |
| 16. [Task definitions (YAML schema)](#16-task-definitions-yaml-schema) | |
| 17. [Security-posture audit examples](#17-security-posture-audit-examples) | |
| 18. [Curriculum stats API](#18-curriculum-stats-api) | |
| 19. [Web playground](#19-web-playground) | |
| --- | |
| ## 1. Architecture overview | |
| ``` | |
| βββββββββββββββββββββββββββββββββ server/ process βββββββββββββββββββββββββββββββββ | |
| β β | |
| β FastAPI app (server/app.py) β | |
| β βββ OpenEnv router /reset /step /state /schema /ws /health β | |
| β βββ Web router /web /web/reset /web/step /web/state /web/solution β | |
| β βββ env_factory βββΊ AwsRlEnvironment(strategy=β¦) β | |
| β β β | |
| β βββ EpisodeTracker (per-episode state) β | |
| β βββ Curriculum (priority + mastery) β | |
| β βββ EnvironmentDesigner (setup commands) β | |
| β βββ HintProvider (3-level hints) β | |
| β βββ ChaosEngine (mid-episode mutations) β | |
| β βββ DriftEngine (drift-task injection) β | |
| β βββ TaskGrader (5-strategy dispatcher) β | |
| β βββ ResourceVerifier (ground-truth state) β | |
| β βββ EnvironmentStrategy βββΊ SimulatorStrategy β | |
| β β² (talks to MiniStack) β | |
| β β² AwsStrategy β | |
| β (talks to real AWS) β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| MiniStack process(es) on :4566+ | |
| (own port per pool slot when AWS_RL_ENV_POOL_SIZE > 1) | |
| ``` | |
| Files: | |
| - [server/app.py](app.py) β FastAPI app, OpenEnv integration, MiniStack pool, web routes | |
| - [server/aws_rl_env_environment.py](aws_rl_env_environment.py) β main `AwsRlEnvironment` orchestrator | |
| - [server/services/](services/) β pluggable services (one concern per file, listed in Β§7βΒ§16) | |
| - [server/services/tasks/](services/tasks/) β YAML task definitions, one file per tier | |
| - [server/templates/index.html](templates/index.html) β playground HTML | |
| - [server/static/](static/) β playground JS/CSS, 40 AWS service icons | |
| --- | |
| ## 2. HTTP / WebSocket endpoints | |
| OpenEnv-compatible (created via `openenv.core.env_server.http_server.create_app`): | |
| | Method | Path | Purpose | | |
| |--------|----------|-----------------------------------------------------------------| | |
| | POST | `/reset` | Wipe infra, pick next task from curriculum, return observation | | |
| | POST | `/step` | Execute action, grade, optionally inject chaos, return obs | | |
| | GET | `/state` | Full `AwsRlState` snapshot (current task, tracker, infra state) | | |
| | GET | `/schema`| JSON schemas for `AwsRlAction` / `AwsRlObservation` | | |
| | GET | `/health`| Liveness probe | | |
| | WS | `/ws` | Persistent session (one MiniStack acquired per connection) | | |
| Web playground (always mounted; backed by a dedicated lazy MiniStack β see Β§6): | |
| | Method | Path | Purpose | | |
| |--------|------------------|-----------------------------------------------------------| | |
| | GET | `/` | Redirect β `/web` | | |
| | GET | `/web` | HTML playground (Jinja2 template `index.html`) | | |
| | POST | `/web/reset` | Stateful reset for the playground's shared env | | |
| | POST | `/web/step` | Stateful step for the playground's shared env | | |
| | GET | `/web/state` | Current `AwsRlState` for the shared env | | |
| | GET | `/web/solution` | Reveal next canonical solution command (debug aid) | | |
| Auto-generated docs: `/docs` (Swagger), `/redoc` (ReDoc). | |
| --- | |
| ## 3. Episode lifecycle | |
| 1. **`reset()`** | |
| 1. `EnvironmentStrategy.reset_environment()` β wipes simulator state (no-op for real AWS) | |
| 2. `Curriculum.next_task()` β picks the next task (see Β§7 priority scoring) | |
| 3. `EnvironmentDesigner.provision(task.setup_commands)` β runs preflight CLI commands to create the broken / insecure infra the agent must fix (used by SRE, drift, security-posture tasks) | |
| 4. `DriftEngine.inject(task)` β for drift tasks, randomly applies 2β3 mutations from `task.possible_drifts` | |
| 5. `EpisodeTracker.start(task)` β fresh tracker | |
| 6. Returns initial `AwsRlObservation` with the masked `TaskInfo` (task description but **not** success criteria) | |
| 2. **`step(action)`** | |
| 1. **Validate** β only commands starting with `aws ` are accepted (see Β§9 layer 4) | |
| 2. **Intercept hint requests** β `aws help --task-hint` returns next-level hint, increments `hints_used`, never reaches the simulator | |
| 3. `EnvironmentStrategy.execute(command)` β runs the AWS CLI invocation, returns stdout / stderr / exit_code | |
| 4. `EpisodeTracker.record(...)` β parses command, dedup-checks, updates `partial_progress` | |
| 5. `TaskGrader.grade(...)` β returns shaped reward (see Β§8) | |
| 6. `ChaosEngine.maybe_inject(...)` β at tier-scaled probability, executes a destructive mutation on a resource the agent just touched | |
| 7. `Curriculum.record_step(...)` β accumulates step-level signal | |
| 8. Returns updated `AwsRlObservation` | |
| 3. **Termination** | |
| - `obs.task_achieved == True`, **or** | |
| - `step_count >= MAX_STEPS` (default 15, configurable via env var) | |
| - On terminate: `Curriculum.record_result(task, achieved, reward)` updates per-task mastery and may promote the agent's tier | |
| --- | |
| ## 4. Strategy pattern: Simulator vs Real AWS | |
| The environment supports two backends, swapped via the `BACKEND_TYPE` env var (default `simulator`): | |
| ### `SimulatorStrategy` β [services/simulator_strategy.py](services/simulator_strategy.py) | |
| - Talks to a MiniStack instance over HTTP (`AWS_INFRA_URL`, default `http://localhost:4566`) | |
| - AWS CLI invocations are subprocessed with `AWS_ENDPOINT_URL` set so they hit MiniStack | |
| - `reset_environment()` calls MiniStack's `/_ministack/reset` endpoint to wipe state | |
| - `get_state()` reads the **custom** `/_ministack/state` endpoint (see Β§5) β one HTTP call returns the entire infra inventory used by `ResourceVerifier` | |
| ### `AwsStrategy` β [services/aws_strategy.py](services/aws_strategy.py) | |
| - Uses ambient AWS credentials (whatever the standard AWS CLI credential chain finds) | |
| - No `AWS_ENDPOINT_URL` override β commands hit real AWS | |
| - `reset_environment()` is a **no-op** (we cannot wipe a real AWS account; expert-level task scenarios assume a clean / sandboxed sub-account) | |
| - Useful for end-to-end demonstrations, less so for RL training | |
| Switching backends: | |
| ```bash | |
| export BACKEND_TYPE=aws # or "simulator" (default) | |
| make run | |
| ``` | |
| The factory in [server/app.py](app.py) wires the right strategy at startup. | |
| --- | |
| ## 5. MiniStack: vendored fork & customizations | |
| > **Why this matters:** the simulator that the grader queries is not a black-box pip dependency β it's vendored in-tree as a git subtree at [aws_infra/](../aws_infra/) so we can extend it. The custom endpoints we added there are how `ResourceVerifier` and the grader can read full infra state in a single round-trip. | |
| ### Vendored as a git subtree | |
| `aws_infra/` was imported via `git subtree add` in commit **[`2c38c0b` "Bring mini stack to local"](../aws_infra/)** (PR #5). Upstream is the public MiniStack project. The full upstream README is preserved at [aws_infra/README.md](../aws_infra/README.md) (81 KB). | |
| Why we vendored instead of taking a pip dependency: | |
| 1. **Custom endpoints**: we needed JSON state-introspection endpoints (`/_ministack/state`, `/_ministack/actions`) that upstream did not ship. These are the integration seams between our env grader and the simulator. | |
| 2. **Reproducible builds**: the Docker image ships a specific MiniStack revision; no runtime network fetch, identical behavior across environments. | |
| 3. **Service-coverage extensions**: occasional patches to individual service handlers (e.g. RDS state retrieval used by `ResourceVerifier`). | |
| ### Custom modifications on top of upstream | |
| Each modification is a separate, cleanly-cherry-pickable commit so future upstream syncs are low-conflict. | |
| | Commit | Title | What it adds | | |
| |-----------|----------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------| | |
| | `a648c3a` | feat: Add support for service state retrieval and action listing across multiple AWS services | `/_ministack/state` returns the entire infra inventory as JSON in one call (the grader's primary read path). `/_ministack/actions` lists every supported operation per service β used by tooling and tests. | | |
| | `a00e981` | chor: Small Fixes | Tightening / typo fixes on top of `a648c3a`. | | |
| | `af2e945` | Sync MiniStack with latest changes | Periodic upstream sync. Replays our custom commits cleanly because they are isolated and well-scoped. | | |
| | `579597b` | Sync MiniStack with latest changes | Subsequent upstream sync. | | |
| To inspect any of these: | |
| ```bash | |
| git show a648c3a # see the full diff for the state endpoint | |
| git log --oneline -- aws_infra/ # see only the aws_infra/ history | |
| ``` | |
| ### Build integration | |
| - [aws_infra/pyproject.toml](../aws_infra/pyproject.toml) declares MiniStack as its own package; we install it as an editable dependency via `make install-all`. | |
| - The [Dockerfile](../Dockerfile) stages MiniStack explicitly so the resulting container has no external network requirement at runtime. | |
| - The [aws_infra/Makefile](../aws_infra/Makefile) provides `make build` and `make test` targets if you want to work on MiniStack itself. | |
| - `aws_infra/docker-compose.yml` lets you run MiniStack alone for debugging. | |
| ### Upstream sync workflow | |
| ```bash | |
| # From the repo root | |
| git subtree pull --prefix=aws_infra <upstream-remote> main --squash | |
| # Resolve any conflicts (rare, because our patches live in identifiable commits) | |
| # Test: | |
| pytest tests/ -k "verifier or grader" | |
| ``` | |
| --- | |
| ## 6. Server-side MiniStack pool (parallel rollouts) | |
| > **Why:** GRPO training generates `G=8` rollouts per step on the same task and computes group-relative advantages. To run those 8 rollouts truly in parallel **without state bleed**, every rollout needs its own AWS world. The server-side pool makes that possible. | |
| ### Design β [server/app.py:75β138](app.py) | |
| When the server boots, `make_env_factory(POOL_SIZE, BASE_PORT, BACKEND_TYPE)` decides which factory to install: | |
| | Mode | What gets created | | |
| |-------------------------------------------------|--------------------------------------------------------------------------------| | |
| | `BACKEND_TYPE=aws` | No pool. All sessions share `AwsStrategy`. Pool would be meaningless on real AWS. | | |
| | `AWS_RL_ENV_POOL_SIZE=1` (default) | No pool object; one shared `SimulatorStrategy` on the default port. | | |
| | `AWS_RL_ENV_POOL_SIZE=N` (`N>1`, simulator) | A `MiniStackPool` (thread-safe free-list of ports `BASE..BASE+N-1`). Each WebSocket session calls `pool.acquire()` to get its own MiniStack port; on disconnect `env.close()` triggers `pool.release(port)`. | | |
| The pool's `acquire()` raises `RuntimeError("MiniStack pool exhausted")` if a 9th client tries to connect when `POOL_SIZE=8`. OpenEnv's `create_app(..., max_concurrent_envs=POOL_SIZE)` enforces the same cap upstream so callers see a clean 503 instead. | |
| ### The Dockerfile launches N MiniStacks | |
| The container's entrypoint starts `POOL_SIZE` MiniStack processes on ports `4566..4566+POOL_SIZE-1` before the FastAPI server is ready to accept connections. Each MiniStack runs the same image but has its own in-memory state β so the 8 rollouts cannot accidentally see each other's S3 buckets, IAM roles, etc. | |
| ### Web playground gets its own MiniStack (lazy, on a constant port) | |
| The pool owns `[BASE..BASE+N-1]` for WebSocket sessions. The web playground's shared `_env` cannot share those ports β a `/web/step` would clobber whichever rollout currently holds the same MiniStack. Instead, the web UI uses a **dedicated MiniStack on a constant port outside the pool's range** (`AWS_RL_ENV_WEB_MINISTACK_PORT`, default `4565`). The pool is constructed as `range(BASE, BASE+N)`, so `pool.acquire()` can never hand out the web port. | |
| That dedicated MiniStack is **spawned lazily** by the FastAPI server on the first `/web/*` request (`subprocess.Popen(["ministack", "-d"], env={"GATEWAY_PORT": "4565", ...})`). Training-only deployments β the common case β pay zero cost: the extra MiniStack only exists if a user actually opens the playground. First request takes ~1β3s for the bind; subsequent requests are fast (cached `_env`). A startup assertion refuses to boot if `AWS_RL_ENV_WEB_MINISTACK_PORT` falls inside the pool's range. | |
| `POOL_SIZE=1` keeps the legacy single-MiniStack path: the web env shares `:4566` with the lone pool MiniStack β no extra process, no extra port. | |
| ### Configuration | |
| | Env var | Default | Purpose | | |
| |------------------------------------|---------|---------------------------------------------------------------| | |
| | `AWS_RL_ENV_POOL_SIZE` | `1` | Number of MiniStack instances + WebSocket session capacity | | |
| | `AWS_RL_ENV_MINISTACK_BASE_PORT` | `4566` | First MiniStack port; pool covers `[BASE, BASE + N)` | | |
| | `AWS_RL_ENV_WEB_MINISTACK_PORT` | `4565` | Web playground's dedicated MiniStack port (lazy spawn; must lie outside the pool's range when `POOL_SIZE>1`) | | |
| | `BACKEND_TYPE` | `simulator` | `simulator` (default, MiniStack) or `aws` (real AWS, pool disabled) | | |
| ### Cross-link | |
| The **client side** of this pool β the `GrpoPool` and `MultiTurnEnvPool` that open N persistent WebSocket connections and run rollouts concurrently β is documented in [scripts/README.md](../scripts/README.md). Read that doc for the full multi-turn + multi-rollout walkthrough. | |
| --- | |
| ## 7. Curriculum manager | |
| >  | |
| [services/curriculum.py](services/curriculum.py) β 536 LOC. Adaptive task selection with mastery tracking, spaced repetition, and tier promotion. | |
| ### Per-tier configuration | |
| | Tier | min_episodes | advance_rate | mastery_window | mastery_threshold | fast_track_rate | chaos_probability | | |
| |--------------|:------------:|:------------:|:--------------:|:-----------------:|:---------------:|:-----------------:| | |
| | warmup | 5 | 0.6 | 10 | 0.7 | 0.9 | 0.0 | | |
| | beginner | 10 | 0.65 | 10 | 0.7 | 0.9 | 0.0 | | |
| | intermediate | 15 | 0.65 | 10 | 0.7 | 0.9 | 0.10 | | |
| | advanced | 15 | 0.7 | 10 | 0.7 | 0.9 | 0.20 | | |
| | expert | 20 | 0.7 | 10 | 0.7 | 0.9 | 0.30 | | |
| ### Priority scoring | |
| For each episode the curriculum picks the highest-scored task within the agent's current tier: | |
| ``` | |
| score = novelty_bonus # +100 if never attempted | |
| + weakness_weight # +50 Γ (1 β task_success_rate) | |
| + spaced_rep_bonus # +30 if a graduated task is "due" for re-test | |
| β recency_penalty # β20 if attempted in the last 2 episodes | |
| ``` | |
| This single formula simultaneously enforces exploration (novelty), targets weak spots (weakness), prevents forgetting (spaced rep), and avoids rut behavior (recency). No hand-coded scheduling β it falls out of the score. | |
| ### Mastery model | |
| - **Window**: the last 10 episodes for each task | |
| - **Threshold**: a task graduates when its weighted success rate crosses 0.7 | |
| - **Decay**: `0.85` exponential β recent results count for more | |
| - **Un-graduation**: if a graduated task drops back below threshold, it loses graduation and re-enters the rotation | |
| ### Spaced repetition | |
| Graduated tasks resurface at intervals `[3, 6, 12, 24, 48]` episodes. Pass on re-test β interval doubles (capped at 48). Fail β interval resets to 3. The `+30` priority bonus in the scoring formula is what surfaces them. | |
| ### Tier promotion | |
| Two paths: | |
| - **Standard**: `tier_episodes >= min_episodes` and `tier_success_rate >= advance_rate` | |
| - **Fast-track**: 3 consecutive episodes at β₯ `fast_track_rate` (0.9) β bypasses the minimum | |
| Demotion is **not** supported β the agent's "ratchet" only goes up. (Mastery on individual tasks does decay; the *tier* does not.) | |
| ### Notable APIs | |
| - `Curriculum.next_task() -> Task` β selection | |
| - `Curriculum.record_result(task, achieved, reward)` β episode-level callback | |
| - `Curriculum.get_task_by_id(task_id) -> Task` β used by the GRPO validation harness for frozen held-out tasks | |
| - `Curriculum.get_stats() -> dict` β see Β§18 | |
| --- | |
| ## 8. Reward shaping & TaskGrader | |
| [services/task_grader.py](services/task_grader.py) β 264 LOC. The grader is the single source of reward truth. | |
| ### Reward formula | |
| ``` | |
| if task_achieved: | |
| reward = 1.0 | |
| if survived_chaos: reward *= 1.05 # β€ 1.05 cap | |
| else: | |
| reward = partial_progress * 0.8 # β€ 0.8 from steps alone | |
| if progress_increased: reward += 0.1 # dense progress signal | |
| if command_failed: reward *= 0.5 # error penalty | |
| reward -= 0.1 * rollback_count # createβdelete pairs | |
| reward += 0.02 * idempotent_retries # graceful "already exists" | |
| reward = clamp(reward, 0.0, 0.99) # 1.0 reserved for completion | |
| reward *= 0.85 ** hints_used # hint decay applied last | |
| ``` | |
| This is **dense by design** β the agent gets meaningful feedback on every step, not just at episode end. | |
| ### Five grading strategies (dispatcher pattern) | |
| `TaskGrader.grade()` dispatches on `task.success_criteria.grading_strategy`: | |
| | Tier | Strategy | Mechanism | Partial-progress source | | |
| |--------------|---------------------------|--------------------------------------------------------------------------------------------|--------------------------------------| | |
| | Warmup | `command_match` | Latest command contains correct service + operation | Binary 0 or 1.0 | | |
| | Beginner | `resource_creation` | Command match (0.5) + `ResourceVerifier` confirms exact resource exists in state (1.0) | Two-stage (0.5 β 1.0) | | |
| | Intermediate | `multi_step` | Ordered list of `(operation, resource)` pairs; credit each new step | `completed_steps / total_steps` | | |
| | Advanced | `multi_step + services` | Same as multi_step **and** all `services_required` must be touched | `completed_steps / total_steps` (capped until services satisfied) | | |
| | Expert | `state_checks` | `ResourceVerifier` runs arbitrary AWS CLI commands at grading time and asserts on output | `0.7 Γ steps + 0.3 Γ state_checks` | | |
| State-check assertions support two forms: | |
| - `output_contains: <substring>` β substring match on stdout | |
| - `json_path: <jq-style path>` + `expected: <value>` β JSON extraction with expected value | |
| This per-tier polymorphism is critical: a single grading rule would be too lax for warmup or too crude for SRE tasks. | |
| ### Chaos survival bonus | |
| If `ChaosEngine` injected a mutation during the episode and the agent still completed, reward is `1.05` instead of `1.0` (5% bonus) β and that bonus *stacks under* hint decay (so the agent that solves a chaotic task without hints gets the maximum). | |
| ### Rollback penalty & idempotency bonus | |
| - **Rollback** (`-0.1` per pair): `EpisodeTracker.detect_rollbacks()` scans the command history for `(create-X, β¦ , delete-X)` pairs on the same resource. Production-style waste β heavily penalized. | |
| - **Idempotency** (`+0.02`): if a command fails with a known "already exists" pattern (`BucketAlreadyExists`, `ResourceInUseException`, etc.) and the next command continues productively, the agent is rewarded for graceful retry behavior. | |
| This is the first RL environment we know of that rewards *operational discipline* directly. | |
| --- | |
| ## 9. Anti-reward-hacking β 8 defense layers | |
| The agent's only loss surface is the reward signal. We harden it so that the cheapest path to a high reward is *actually doing the task* β not gaming the grader. | |
| ### Layer 1 β Ground-truth verification via MiniStack | |
| The grader **never trusts agent command output**. For every resource check it asks `ResourceVerifier` to query MiniStack (or AWS) directly. Even if the agent crafts a perfectly-formed fake JSON response, the grader looks at server-side state. | |
| > ResourceVerifier covers 20+ services with bespoke verification methods (S3, DynamoDB, Lambda, SQS, SNS, IAM, Secrets Manager, API Gateway, Cognito, RDS, EFS, ElastiCache, EC2, Step Functions, Glue, Athena, EMR, ECS, EKS, EventBridge, Kinesis, β¦). | |
| ### Layer 2 β Deduplication | |
| `EpisodeTracker.has_executed_operation()` records every `(operation, resource)` pair that earned credit. Re-running the same successful command does **not** re-earn `partial_progress`. Each unique operation pays out exactly once. | |
| ### Layer 3 β Grader invisibility | |
| The CLI commands run by `ResourceVerifier` and `state_checks` happen **server-side** and are not echoed into the agent's observation. The agent never sees which queries the grader is making, so it cannot reverse-engineer "fake outputs" that match the grader's expectations. | |
| ### Layer 4 β Command allow-listing | |
| `step()` rejects anything that doesn't start with `aws ` (`success=False`, no execution). No shell metacharacters, no piping, no redirection, no escape from the AWS CLI sandbox. | |
| ### Layer 5 β No verification reward | |
| If the agent's command exactly matches one of the task's `state_checks` commands (e.g. `aws s3api get-bucket-versioning --bucket app-config-store`), it gets **zero** progress credit. Only mutating commands (create / put / update / delete) earn credit. Read-only auditing is freely allowed but not rewarded β exactly mirroring the grader's behavior. | |
| ### Layer 6 β Monotonic progress | |
| `partial_progress` only ever increases within an episode. It is clamped at `0.99`; reaching `1.0` requires fully verified completion. The agent cannot lose progress, but it also cannot re-earn lost progress, so cycling strategies (create β delete β create) yield zero net gain. | |
| ### Layer 7 β Resource-name validation | |
| `ResourceVerifier` checks the **exact** resource name from the task definition. Creating `my-test-bucket-2` does not satisfy a check for `my-test-bucket`. The agent cannot creatively name its way around the spec. | |
| ### Layer 8 β State checks verify the final state | |
| For expert SRE tasks, the grader runs the canonical `state_checks` commands at grading time against the live MiniStack. The grade is "what is true now?", not "what did the agent claim?". This is the single hardest layer to circumvent. | |
| These layers compose: even if one is bypassed (e.g. a clever exact-match name), the others independently still produce the right reward. | |
| --- | |
| ## 10. Resource verifier | |
| [services/resource_verifier.py](services/resource_verifier.py) β 362 LOC. | |
| - **Per-service `verify_*` methods** for 20+ AWS services. Each method knows which API calls expose state for that service and how to read the response (e.g. `verify_s3_bucket(name)` calls `s3api list-buckets`, `verify_dynamodb_table(name)` calls `dynamodb describe-table`, etc.). | |
| - **Single-shot state path**: when called via `SimulatorStrategy.get_state()`, the verifier reads MiniStack's custom `/_ministack/state` endpoint (added in commit `a648c3a`, see Β§5) which returns the full infra inventory in one HTTP call. This is dramatically faster than iterating 20+ list APIs per grading pass. | |
| - **State-check evaluator**: handles `output_contains` (substring) and `json_path` + `expected` (JSON extraction with deep-path support) assertion types used by expert-tier tasks. | |
| - **Live ground-truth source** β the verifier never consumes the agent's stdout. Always fresh state from the simulator. | |
| --- | |
| ## 11. Chaos engine | |
| [services/chaos_engine.py](services/chaos_engine.py) β 168 LOC. | |
| Probabilistically perturbs AWS resource state mid-episode. Tests whether the agent can detect and recover from unexpected drift β a critical SRE skill. | |
| - **Tier-scaled probability**: 0% warmup/beginner, 10% intermediate, 20% advanced, 30% expert | |
| - **Service-scoped templates**: a chaos roll only fires on services the current task is touching. Resource names are extracted from the agent's recent successful commands via service-specific regex (e.g. `aws s3 mb s3://(\S+)` β bucket name). | |
| - **Five service templates**: S3 policy / versioning changes, DynamoDB throughput modifications, Lambda configuration alterations, IAM detach-role-policy, SNS subscription mutations | |
| - **Silent**: chaos commands run server-side; the agent observes only the *consequence* (a state inconsistency), never the cause | |
| - **Reward bonus**: surviving chaos and completing the task pays `1.05` instead of `1.0` | |
| The combination of "tier-scaled probability" + "task-scoped resource selection" means chaos is rare for warmup tasks (0%) and frequent for SRE tasks (30%) β exactly where it matters. | |
| --- | |
| ## 12. Drift engine | |
| [services/drift_engine.py](services/drift_engine.py) β 67 LOC. | |
| Specialised for the 6 drift-detection expert tasks defined in [services/tasks/drift.yaml](services/tasks/drift.yaml). | |
| - Each drift task ships a pool of `possible_drifts` (each a small list of CLI commands that mutates a resource away from the desired spec). | |
| - On `reset()`, the engine **randomly selects 2β3 drifts** from that pool and applies them after the setup-command phase. | |
| - The agent sees a `desired_state_spec` (natural language) and must audit the environment, identify which resources drifted, and fix only those. | |
| - Random selection per episode means **no memorization** β the agent must reason about desired vs actual state, not recall a fix script. | |
| - Examples: S3 versioning/encryption drift, DynamoDB throughput changes, SNS subscription modifications, Lambda env-var tampering. | |
| --- | |
| ## 13. Hint provider | |
| [services/hint_provider.py](services/hint_provider.py) β 137 LOC. | |
| Three-level progressive hints, requested via the special action `aws help --task-hint`: | |
| | Level | What it reveals | Example | | |
| |-------|---------------------------------------|----------------------------------------------------------| | |
| | 1 | Required AWS services | "You'll need IAM and Lambda" | | |
| | 2 | Operation sequence | "Start with `create-role`, then `put-role-policy`" | | |
| | 3 | Near-complete command structure | "Use: `aws iam create-role --role-name β¦`" | | |
| - Hints are **auto-derived** from the `SuccessCriteria` fields (services list, ordered steps, operation names) β no hand-written hint text per task. | |
| - Reward decay: `final_reward *= 0.85 ** hints_used`. With three hints (max), the agent caps at `0.85Β³ β 0.614` of normal reward. | |
| - The hint command is **intercepted before reaching MiniStack** so it does not consume an episode step nor affect simulator state. | |
| --- | |
| ## 14. Episode tracker | |
| [services/episode_tracker.py](services/episode_tracker.py) β 241 LOC. | |
| Single source of per-episode state. Maintains: | |
| - Step count, hint count, command history (raw + parsed) | |
| - `partial_progress: float β [0, 1]` (monotonic β see anti-hack layer 6) | |
| - `credited_operations: set[(operation, resource)]` (for dedup β anti-hack layer 2) | |
| - Rollback detection: scans history for `(create-X, β¦, delete-X)` pairs on same resource | |
| - Idempotency detection: looks for known "already exists" error patterns | |
| Parses each AWS CLI invocation into a structured tuple `(service, operation, resource_name)` for downstream services to query without re-parsing. | |
| --- | |
| ## 15. Environment designer | |
| [services/environment_designer.py](services/environment_designer.py) β 99 LOC. | |
| Provisioning helper for SRE / security-posture / drift tasks. A task can declare `setup_commands: list[SetupCommand]` β these are executed (server-side) **before** the agent starts so the world begins in a deliberately broken / insecure / over-provisioned state. Examples: | |
| - "Public S3 bucket lockdown" (Β§17): creates `public-assets` with a wide-open bucket policy | |
| - "IAM least-privilege": creates `app-role` with `Action: *` / `Resource: *` | |
| - Drift tasks: provision the *correct* infra so the drift engine can mutate it | |
| Setup failures abort the reset β partial setup is never exposed to the agent. | |
| --- | |
| ## 16. Task definitions (YAML schema) | |
| [services/tasks/](services/tasks/) β one YAML file per tier: | |
| - [warmup.yaml](services/tasks/warmup.yaml) β 25 listing tasks | |
| - [beginner.yaml](services/tasks/beginner.yaml) β 25 single-resource creation tasks | |
| - [intermediate.yaml](services/tasks/intermediate.yaml) β 25 multi-step workflows | |
| - [advanced.yaml](services/tasks/advanced.yaml) β 25 cross-service architectures | |
| - [expert.yaml](services/tasks/expert.yaml) β 24 SRE / security tasks | |
| - [drift.yaml](services/tasks/drift.yaml) β 9 drift detection tasks | |
| Sample task: | |
| ```yaml | |
| - task_id: 42 | |
| description: Create an S3 bucket named my-app-data and enable versioning on it. | |
| difficulty: intermediate | |
| success_criteria: | |
| grading_strategy: multi_step | |
| steps: | |
| - operation: create-bucket | |
| resource: my-app-data | |
| - operation: put-bucket-versioning | |
| resource: my-app-data | |
| services: [s3] | |
| setup_commands: [] | |
| possible_drifts: [] | |
| ``` | |
| Expert / drift tasks add `state_checks`, `desired_state_spec`, and `setup_commands`. | |
| --- | |
| ## 17. Security-posture audit examples | |
| These three expert-tier tasks test reasoning about *configuration state* β the infra is functional but insecure. The agent must read existing config and recognize the vulnerability. | |
| ### Public S3 bucket lockdown | |
| - **Setup**: bucket `public-assets` is provisioned with a bucket policy granting `Principal: *` access | |
| - **Task**: replace the policy so only IAM role `app-role` can `s3:GetObject` | |
| - **State checks**: bucket policy denies `Principal: *`, allows only `app-role` | |
| ### IAM least privilege | |
| - **Setup**: role `app-role` exists with an inline policy `Action: *, Resource: *` | |
| - **Task**: replace with a least-privilege policy allowing only `dynamodb:GetItem` and `dynamodb:PutItem` on the users table | |
| - **State checks**: policy document matches the expected ARN-scoped permissions | |
| ### Lambda secret rotation | |
| - **Setup**: Lambda `data-processor` has env var `DB_PASSWORD=hunter2` (plaintext) | |
| - **Task**: create a Secrets Manager secret, add `SECRET_ARN` env var, remove `DB_PASSWORD` | |
| - **State checks**: secret exists, Lambda has `SECRET_ARN`, no `DB_PASSWORD` remains | |
| These are not hypothetical scenarios β they're the most common cloud-misconfiguration findings in real audits. | |
| --- | |
| ## 18. Curriculum stats API | |
| `Curriculum.get_stats()` returns: | |
| ```python | |
| { | |
| "episode_count": 42, | |
| "tier": "intermediate", | |
| "tier_episodes": 12, | |
| "tier_success_rate": 0.75, | |
| "graduated_tasks": [0, 2, 4], | |
| "weak_spots": [11, 12], | |
| "skill_profile": {0: 0.95, 1: 0.8, ...}, # per-task weighted success | |
| "spaced_rep_due": [0, 2], # graduated tasks due for re-test | |
| "avg_reward_last_10": 0.65, | |
| } | |
| ``` | |
| Useful for: | |
| - Dashboarding training progress | |
| - Logging into the GRPO `EpisodeLogger` CSV (see [train_grpo.py:635](../train_grpo.py)) | |
| - Driving the web playground's progress bar | |
| --- | |
| ## 19. Web playground | |
| Always mounted at [http://localhost:8000/web](http://localhost:8000/web). When `POOL_SIZE>1` the playground is backed by a **dedicated lazy-spawned MiniStack** on `AWS_RL_ENV_WEB_MINISTACK_PORT` (default `4565`) β see Β§6. First request takes ~1β3s while that MiniStack binds; subsequent requests are fast. | |
| - HTML: [server/templates/index.html](templates/index.html) | |
| - Static assets: [server/static/](static/) β CSS, JS, and **40 AWS service icons** in [server/static/img/aws/](static/img/aws/) | |
| - The playground talks to `/web/reset`, `/web/step`, `/web/state`, and `/web/solution` (the last one reveals the next canonical solution command β handy for demos and debugging task definitions). | |
| The playground runs a **single shared environment instance** on its own MiniStack (or, with `POOL_SIZE=1`, the lone pool MiniStack on `:4566`). It is intentionally separate from the per-WebSocket sessions used during training so a curious user clicking around the web UI cannot interfere with an active GRPO rollout. | |
| --- | |
| ## See also | |
| - [Main README](../README.md) β project overview, results, Colab links | |
| - [scripts/README.md](../scripts/README.md) β client-side parallel rollout pool (`GrpoPool`, `MultiTurnEnvPool`, asyncio orchestration) | |
| - [train/README.md](../train/README.md) β SFT + GRPO training pipeline | |
| - [data/README.md](../data/README.md) β dataset generation + base-model selection | |
| - [aws_infra/README.md](../aws_infra/README.md) β vendored MiniStack upstream docs (81 KB) | |