aws_rl_env / server /README.md
Sizzing's picture
Upload folder using huggingface_hub
71e54ee verified

server/ β€” AWS RL Environment Internals

← back to main README

This directory implements the OpenEnv-compatible FastAPI server that powers the AWS RL Environment. The server exposes HTTP and WebSocket endpoints to a training agent, executes AWS CLI commands against a backing simulator (or real AWS), runs a reward / curriculum stack, and returns shaped observations.

If you only have time for the headline numbers, read the main README. This document is the reference for how the environment actually works β€” every defended invariant, every edge case, every config knob.


Table of contents

  1. Architecture overview
  2. HTTP / WebSocket endpoints
  3. Episode lifecycle
  4. Strategy pattern: Simulator vs Real AWS
  5. MiniStack: vendored fork & customizations
  6. Server-side MiniStack pool (parallel rollouts)
  7. Curriculum manager
  8. Reward shaping & TaskGrader
  9. Anti-reward-hacking β€” 8 defense layers
  10. Resource verifier
  11. Chaos engine
  12. Drift engine
  13. Hint provider
  14. Episode tracker
  15. Environment designer
  16. Task definitions (YAML schema)
  17. Security-posture audit examples
  18. Curriculum stats API
  19. Web playground

1. Architecture overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ server/ process ────────────────────────────────┐
β”‚                                                                                 β”‚
β”‚   FastAPI app  (server/app.py)                                                  β”‚
β”‚   β”œβ”€β”€ OpenEnv router  /reset  /step  /state  /schema  /ws  /health              β”‚
β”‚   β”œβ”€β”€ Web router      /web  /web/reset  /web/step  /web/state  /web/solution    β”‚
β”‚   └── env_factory ──► AwsRlEnvironment(strategy=…)                              β”‚
β”‚                          β”‚                                                      β”‚
β”‚                          β”œβ”€β”€ EpisodeTracker          (per-episode state)        β”‚
β”‚                          β”œβ”€β”€ Curriculum              (priority + mastery)       β”‚
β”‚                          β”œβ”€β”€ EnvironmentDesigner     (setup commands)           β”‚
β”‚                          β”œβ”€β”€ HintProvider            (3-level hints)            β”‚
β”‚                          β”œβ”€β”€ ChaosEngine             (mid-episode mutations)    β”‚
β”‚                          β”œβ”€β”€ DriftEngine             (drift-task injection)     β”‚
β”‚                          β”œβ”€β”€ TaskGrader              (5-strategy dispatcher)    β”‚
β”‚                          β”œβ”€β”€ ResourceVerifier        (ground-truth state)       β”‚
β”‚                          └── EnvironmentStrategy ──► SimulatorStrategy          β”‚
β”‚                                                  β•²   (talks to MiniStack)      β”‚
β”‚                                                   β•²  AwsStrategy               β”‚
β”‚                                                       (talks to real AWS)       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                        β”‚
                                        β–Ό
                          MiniStack process(es) on :4566+
                          (own port per pool slot when AWS_RL_ENV_POOL_SIZE > 1)

Files:


2. HTTP / WebSocket endpoints

OpenEnv-compatible (created via openenv.core.env_server.http_server.create_app):

Method Path Purpose
POST /reset Wipe infra, pick next task from curriculum, return observation
POST /step Execute action, grade, optionally inject chaos, return obs
GET /state Full AwsRlState snapshot (current task, tracker, infra state)
GET /schema JSON schemas for AwsRlAction / AwsRlObservation
GET /health Liveness probe
WS /ws Persistent session (one MiniStack acquired per connection)

Web playground (always mounted; backed by a dedicated lazy MiniStack β€” see Β§6):

Method Path Purpose
GET / Redirect β†’ /web
GET /web HTML playground (Jinja2 template index.html)
POST /web/reset Stateful reset for the playground's shared env
POST /web/step Stateful step for the playground's shared env
GET /web/state Current AwsRlState for the shared env
GET /web/solution Reveal next canonical solution command (debug aid)

Auto-generated docs: /docs (Swagger), /redoc (ReDoc).


3. Episode lifecycle

  1. reset()

    1. EnvironmentStrategy.reset_environment() β€” wipes simulator state (no-op for real AWS)
    2. Curriculum.next_task() β€” picks the next task (see Β§7 priority scoring)
    3. EnvironmentDesigner.provision(task.setup_commands) β€” runs preflight CLI commands to create the broken / insecure infra the agent must fix (used by SRE, drift, security-posture tasks)
    4. DriftEngine.inject(task) β€” for drift tasks, randomly applies 2–3 mutations from task.possible_drifts
    5. EpisodeTracker.start(task) β€” fresh tracker
    6. Returns initial AwsRlObservation with the masked TaskInfo (task description but not success criteria)
  2. step(action)

    1. Validate β€” only commands starting with aws are accepted (see Β§9 layer 4)
    2. Intercept hint requests β€” aws help --task-hint returns next-level hint, increments hints_used, never reaches the simulator
    3. EnvironmentStrategy.execute(command) β€” runs the AWS CLI invocation, returns stdout / stderr / exit_code
    4. EpisodeTracker.record(...) β€” parses command, dedup-checks, updates partial_progress
    5. TaskGrader.grade(...) β€” returns shaped reward (see Β§8)
    6. ChaosEngine.maybe_inject(...) β€” at tier-scaled probability, executes a destructive mutation on a resource the agent just touched
    7. Curriculum.record_step(...) β€” accumulates step-level signal
    8. Returns updated AwsRlObservation
  3. Termination

    • obs.task_achieved == True, or
    • step_count >= MAX_STEPS (default 15, configurable via env var)
    • On terminate: Curriculum.record_result(task, achieved, reward) updates per-task mastery and may promote the agent's tier

4. Strategy pattern: Simulator vs Real AWS

The environment supports two backends, swapped via the BACKEND_TYPE env var (default simulator):

SimulatorStrategy β€” services/simulator_strategy.py

  • Talks to a MiniStack instance over HTTP (AWS_INFRA_URL, default http://localhost:4566)
  • AWS CLI invocations are subprocessed with AWS_ENDPOINT_URL set so they hit MiniStack
  • reset_environment() calls MiniStack's /_ministack/reset endpoint to wipe state
  • get_state() reads the custom /_ministack/state endpoint (see Β§5) β€” one HTTP call returns the entire infra inventory used by ResourceVerifier

AwsStrategy β€” services/aws_strategy.py

  • Uses ambient AWS credentials (whatever the standard AWS CLI credential chain finds)
  • No AWS_ENDPOINT_URL override β€” commands hit real AWS
  • reset_environment() is a no-op (we cannot wipe a real AWS account; expert-level task scenarios assume a clean / sandboxed sub-account)
  • Useful for end-to-end demonstrations, less so for RL training

Switching backends:

export BACKEND_TYPE=aws  # or "simulator" (default)
make run

The factory in server/app.py wires the right strategy at startup.


5. MiniStack: vendored fork & customizations

Why this matters: the simulator that the grader queries is not a black-box pip dependency β€” it's vendored in-tree as a git subtree at aws_infra/ so we can extend it. The custom endpoints we added there are how ResourceVerifier and the grader can read full infra state in a single round-trip.

Vendored as a git subtree

aws_infra/ was imported via git subtree add in commit 2c38c0b "Bring mini stack to local" (PR #5). Upstream is the public MiniStack project. The full upstream README is preserved at aws_infra/README.md (81 KB).

Why we vendored instead of taking a pip dependency:

  1. Custom endpoints: we needed JSON state-introspection endpoints (/_ministack/state, /_ministack/actions) that upstream did not ship. These are the integration seams between our env grader and the simulator.
  2. Reproducible builds: the Docker image ships a specific MiniStack revision; no runtime network fetch, identical behavior across environments.
  3. Service-coverage extensions: occasional patches to individual service handlers (e.g. RDS state retrieval used by ResourceVerifier).

Custom modifications on top of upstream

Each modification is a separate, cleanly-cherry-pickable commit so future upstream syncs are low-conflict.

Commit Title What it adds
a648c3a feat: Add support for service state retrieval and action listing across multiple AWS services /_ministack/state returns the entire infra inventory as JSON in one call (the grader's primary read path). /_ministack/actions lists every supported operation per service β€” used by tooling and tests.
a00e981 chor: Small Fixes Tightening / typo fixes on top of a648c3a.
af2e945 Sync MiniStack with latest changes Periodic upstream sync. Replays our custom commits cleanly because they are isolated and well-scoped.
579597b Sync MiniStack with latest changes Subsequent upstream sync.

To inspect any of these:

git show a648c3a                     # see the full diff for the state endpoint
git log --oneline -- aws_infra/      # see only the aws_infra/ history

Build integration

  • aws_infra/pyproject.toml declares MiniStack as its own package; we install it as an editable dependency via make install-all.
  • The Dockerfile stages MiniStack explicitly so the resulting container has no external network requirement at runtime.
  • The aws_infra/Makefile provides make build and make test targets if you want to work on MiniStack itself.
  • aws_infra/docker-compose.yml lets you run MiniStack alone for debugging.

Upstream sync workflow

# From the repo root
git subtree pull --prefix=aws_infra <upstream-remote> main --squash
# Resolve any conflicts (rare, because our patches live in identifiable commits)
# Test:
pytest tests/ -k "verifier or grader"

6. Server-side MiniStack pool (parallel rollouts)

Why: GRPO training generates G=8 rollouts per step on the same task and computes group-relative advantages. To run those 8 rollouts truly in parallel without state bleed, every rollout needs its own AWS world. The server-side pool makes that possible.

Design β€” server/app.py:75–138

When the server boots, make_env_factory(POOL_SIZE, BASE_PORT, BACKEND_TYPE) decides which factory to install:

Mode What gets created
BACKEND_TYPE=aws No pool. All sessions share AwsStrategy. Pool would be meaningless on real AWS.
AWS_RL_ENV_POOL_SIZE=1 (default) No pool object; one shared SimulatorStrategy on the default port.
AWS_RL_ENV_POOL_SIZE=N (N>1, simulator) A MiniStackPool (thread-safe free-list of ports BASE..BASE+N-1). Each WebSocket session calls pool.acquire() to get its own MiniStack port; on disconnect env.close() triggers pool.release(port).

The pool's acquire() raises RuntimeError("MiniStack pool exhausted") if a 9th client tries to connect when POOL_SIZE=8. OpenEnv's create_app(..., max_concurrent_envs=POOL_SIZE) enforces the same cap upstream so callers see a clean 503 instead.

The Dockerfile launches N MiniStacks

The container's entrypoint starts POOL_SIZE MiniStack processes on ports 4566..4566+POOL_SIZE-1 before the FastAPI server is ready to accept connections. Each MiniStack runs the same image but has its own in-memory state β€” so the 8 rollouts cannot accidentally see each other's S3 buckets, IAM roles, etc.

Web playground gets its own MiniStack (lazy, on a constant port)

The pool owns [BASE..BASE+N-1] for WebSocket sessions. The web playground's shared _env cannot share those ports β€” a /web/step would clobber whichever rollout currently holds the same MiniStack. Instead, the web UI uses a dedicated MiniStack on a constant port outside the pool's range (AWS_RL_ENV_WEB_MINISTACK_PORT, default 4565). The pool is constructed as range(BASE, BASE+N), so pool.acquire() can never hand out the web port.

That dedicated MiniStack is spawned lazily by the FastAPI server on the first /web/* request (subprocess.Popen(["ministack", "-d"], env={"GATEWAY_PORT": "4565", ...})). Training-only deployments β€” the common case β€” pay zero cost: the extra MiniStack only exists if a user actually opens the playground. First request takes ~1–3s for the bind; subsequent requests are fast (cached _env). A startup assertion refuses to boot if AWS_RL_ENV_WEB_MINISTACK_PORT falls inside the pool's range.

POOL_SIZE=1 keeps the legacy single-MiniStack path: the web env shares :4566 with the lone pool MiniStack β€” no extra process, no extra port.

Configuration

Env var Default Purpose
AWS_RL_ENV_POOL_SIZE 1 Number of MiniStack instances + WebSocket session capacity
AWS_RL_ENV_MINISTACK_BASE_PORT 4566 First MiniStack port; pool covers [BASE, BASE + N)
AWS_RL_ENV_WEB_MINISTACK_PORT 4565 Web playground's dedicated MiniStack port (lazy spawn; must lie outside the pool's range when POOL_SIZE>1)
BACKEND_TYPE simulator simulator (default, MiniStack) or aws (real AWS, pool disabled)

Cross-link

The client side of this pool β€” the GrpoPool and MultiTurnEnvPool that open N persistent WebSocket connections and run rollouts concurrently β€” is documented in scripts/README.md. Read that doc for the full multi-turn + multi-rollout walkthrough.


7. Curriculum manager

Curriculum progression β€” 5 tiers, priority scoring formula, mastery + spaced rep + fast-track

services/curriculum.py β€” 536 LOC. Adaptive task selection with mastery tracking, spaced repetition, and tier promotion.

Per-tier configuration

Tier min_episodes advance_rate mastery_window mastery_threshold fast_track_rate chaos_probability
warmup 5 0.6 10 0.7 0.9 0.0
beginner 10 0.65 10 0.7 0.9 0.0
intermediate 15 0.65 10 0.7 0.9 0.10
advanced 15 0.7 10 0.7 0.9 0.20
expert 20 0.7 10 0.7 0.9 0.30

Priority scoring

For each episode the curriculum picks the highest-scored task within the agent's current tier:

score = novelty_bonus          # +100 if never attempted
      + weakness_weight        # +50 Γ— (1 βˆ’ task_success_rate)
      + spaced_rep_bonus       # +30 if a graduated task is "due" for re-test
      βˆ’ recency_penalty        # βˆ’20 if attempted in the last 2 episodes

This single formula simultaneously enforces exploration (novelty), targets weak spots (weakness), prevents forgetting (spaced rep), and avoids rut behavior (recency). No hand-coded scheduling β€” it falls out of the score.

Mastery model

  • Window: the last 10 episodes for each task
  • Threshold: a task graduates when its weighted success rate crosses 0.7
  • Decay: 0.85 exponential β€” recent results count for more
  • Un-graduation: if a graduated task drops back below threshold, it loses graduation and re-enters the rotation

Spaced repetition

Graduated tasks resurface at intervals [3, 6, 12, 24, 48] episodes. Pass on re-test β†’ interval doubles (capped at 48). Fail β†’ interval resets to 3. The +30 priority bonus in the scoring formula is what surfaces them.

Tier promotion

Two paths:

  • Standard: tier_episodes >= min_episodes and tier_success_rate >= advance_rate
  • Fast-track: 3 consecutive episodes at β‰₯ fast_track_rate (0.9) β€” bypasses the minimum

Demotion is not supported β€” the agent's "ratchet" only goes up. (Mastery on individual tasks does decay; the tier does not.)

Notable APIs

  • Curriculum.next_task() -> Task β€” selection
  • Curriculum.record_result(task, achieved, reward) β€” episode-level callback
  • Curriculum.get_task_by_id(task_id) -> Task β€” used by the GRPO validation harness for frozen held-out tasks
  • Curriculum.get_stats() -> dict β€” see Β§18

8. Reward shaping & TaskGrader

services/task_grader.py β€” 264 LOC. The grader is the single source of reward truth.

Reward formula

if task_achieved:
    reward = 1.0
    if survived_chaos:    reward *= 1.05      # ≀ 1.05 cap
else:
    reward = partial_progress * 0.8           # ≀ 0.8 from steps alone
    if progress_increased: reward += 0.1      # dense progress signal
    if command_failed:     reward *= 0.5      # error penalty
    reward -= 0.1 * rollback_count            # create→delete pairs
    reward += 0.02 * idempotent_retries       # graceful "already exists"
    reward = clamp(reward, 0.0, 0.99)         # 1.0 reserved for completion

reward *= 0.85 ** hints_used                  # hint decay applied last

This is dense by design β€” the agent gets meaningful feedback on every step, not just at episode end.

Five grading strategies (dispatcher pattern)

TaskGrader.grade() dispatches on task.success_criteria.grading_strategy:

Tier Strategy Mechanism Partial-progress source
Warmup command_match Latest command contains correct service + operation Binary 0 or 1.0
Beginner resource_creation Command match (0.5) + ResourceVerifier confirms exact resource exists in state (1.0) Two-stage (0.5 β†’ 1.0)
Intermediate multi_step Ordered list of (operation, resource) pairs; credit each new step completed_steps / total_steps
Advanced multi_step + services Same as multi_step and all services_required must be touched completed_steps / total_steps (capped until services satisfied)
Expert state_checks ResourceVerifier runs arbitrary AWS CLI commands at grading time and asserts on output 0.7 Γ— steps + 0.3 Γ— state_checks

State-check assertions support two forms:

  • output_contains: <substring> β€” substring match on stdout
  • json_path: <jq-style path> + expected: <value> β€” JSON extraction with expected value

This per-tier polymorphism is critical: a single grading rule would be too lax for warmup or too crude for SRE tasks.

Chaos survival bonus

If ChaosEngine injected a mutation during the episode and the agent still completed, reward is 1.05 instead of 1.0 (5% bonus) β€” and that bonus stacks under hint decay (so the agent that solves a chaotic task without hints gets the maximum).

Rollback penalty & idempotency bonus

  • Rollback (-0.1 per pair): EpisodeTracker.detect_rollbacks() scans the command history for (create-X, … , delete-X) pairs on the same resource. Production-style waste β€” heavily penalized.
  • Idempotency (+0.02): if a command fails with a known "already exists" pattern (BucketAlreadyExists, ResourceInUseException, etc.) and the next command continues productively, the agent is rewarded for graceful retry behavior.

This is the first RL environment we know of that rewards operational discipline directly.


9. Anti-reward-hacking β€” 8 defense layers

The agent's only loss surface is the reward signal. We harden it so that the cheapest path to a high reward is actually doing the task β€” not gaming the grader.

Layer 1 β€” Ground-truth verification via MiniStack

The grader never trusts agent command output. For every resource check it asks ResourceVerifier to query MiniStack (or AWS) directly. Even if the agent crafts a perfectly-formed fake JSON response, the grader looks at server-side state.

ResourceVerifier covers 20+ services with bespoke verification methods (S3, DynamoDB, Lambda, SQS, SNS, IAM, Secrets Manager, API Gateway, Cognito, RDS, EFS, ElastiCache, EC2, Step Functions, Glue, Athena, EMR, ECS, EKS, EventBridge, Kinesis, …).

Layer 2 β€” Deduplication

EpisodeTracker.has_executed_operation() records every (operation, resource) pair that earned credit. Re-running the same successful command does not re-earn partial_progress. Each unique operation pays out exactly once.

Layer 3 β€” Grader invisibility

The CLI commands run by ResourceVerifier and state_checks happen server-side and are not echoed into the agent's observation. The agent never sees which queries the grader is making, so it cannot reverse-engineer "fake outputs" that match the grader's expectations.

Layer 4 β€” Command allow-listing

step() rejects anything that doesn't start with aws (success=False, no execution). No shell metacharacters, no piping, no redirection, no escape from the AWS CLI sandbox.

Layer 5 β€” No verification reward

If the agent's command exactly matches one of the task's state_checks commands (e.g. aws s3api get-bucket-versioning --bucket app-config-store), it gets zero progress credit. Only mutating commands (create / put / update / delete) earn credit. Read-only auditing is freely allowed but not rewarded β€” exactly mirroring the grader's behavior.

Layer 6 β€” Monotonic progress

partial_progress only ever increases within an episode. It is clamped at 0.99; reaching 1.0 requires fully verified completion. The agent cannot lose progress, but it also cannot re-earn lost progress, so cycling strategies (create β†’ delete β†’ create) yield zero net gain.

Layer 7 β€” Resource-name validation

ResourceVerifier checks the exact resource name from the task definition. Creating my-test-bucket-2 does not satisfy a check for my-test-bucket. The agent cannot creatively name its way around the spec.

Layer 8 β€” State checks verify the final state

For expert SRE tasks, the grader runs the canonical state_checks commands at grading time against the live MiniStack. The grade is "what is true now?", not "what did the agent claim?". This is the single hardest layer to circumvent.

These layers compose: even if one is bypassed (e.g. a clever exact-match name), the others independently still produce the right reward.


10. Resource verifier

services/resource_verifier.py β€” 362 LOC.

  • Per-service verify_* methods for 20+ AWS services. Each method knows which API calls expose state for that service and how to read the response (e.g. verify_s3_bucket(name) calls s3api list-buckets, verify_dynamodb_table(name) calls dynamodb describe-table, etc.).
  • Single-shot state path: when called via SimulatorStrategy.get_state(), the verifier reads MiniStack's custom /_ministack/state endpoint (added in commit a648c3a, see Β§5) which returns the full infra inventory in one HTTP call. This is dramatically faster than iterating 20+ list APIs per grading pass.
  • State-check evaluator: handles output_contains (substring) and json_path + expected (JSON extraction with deep-path support) assertion types used by expert-tier tasks.
  • Live ground-truth source β€” the verifier never consumes the agent's stdout. Always fresh state from the simulator.

11. Chaos engine

services/chaos_engine.py β€” 168 LOC.

Probabilistically perturbs AWS resource state mid-episode. Tests whether the agent can detect and recover from unexpected drift β€” a critical SRE skill.

  • Tier-scaled probability: 0% warmup/beginner, 10% intermediate, 20% advanced, 30% expert
  • Service-scoped templates: a chaos roll only fires on services the current task is touching. Resource names are extracted from the agent's recent successful commands via service-specific regex (e.g. aws s3 mb s3://(\S+) β†’ bucket name).
  • Five service templates: S3 policy / versioning changes, DynamoDB throughput modifications, Lambda configuration alterations, IAM detach-role-policy, SNS subscription mutations
  • Silent: chaos commands run server-side; the agent observes only the consequence (a state inconsistency), never the cause
  • Reward bonus: surviving chaos and completing the task pays 1.05 instead of 1.0

The combination of "tier-scaled probability" + "task-scoped resource selection" means chaos is rare for warmup tasks (0%) and frequent for SRE tasks (30%) β€” exactly where it matters.


12. Drift engine

services/drift_engine.py β€” 67 LOC.

Specialised for the 6 drift-detection expert tasks defined in services/tasks/drift.yaml.

  • Each drift task ships a pool of possible_drifts (each a small list of CLI commands that mutates a resource away from the desired spec).
  • On reset(), the engine randomly selects 2–3 drifts from that pool and applies them after the setup-command phase.
  • The agent sees a desired_state_spec (natural language) and must audit the environment, identify which resources drifted, and fix only those.
  • Random selection per episode means no memorization β€” the agent must reason about desired vs actual state, not recall a fix script.
  • Examples: S3 versioning/encryption drift, DynamoDB throughput changes, SNS subscription modifications, Lambda env-var tampering.

13. Hint provider

services/hint_provider.py β€” 137 LOC.

Three-level progressive hints, requested via the special action aws help --task-hint:

Level What it reveals Example
1 Required AWS services "You'll need IAM and Lambda"
2 Operation sequence "Start with create-role, then put-role-policy"
3 Near-complete command structure "Use: aws iam create-role --role-name …"
  • Hints are auto-derived from the SuccessCriteria fields (services list, ordered steps, operation names) β€” no hand-written hint text per task.
  • Reward decay: final_reward *= 0.85 ** hints_used. With three hints (max), the agent caps at 0.85Β³ β‰ˆ 0.614 of normal reward.
  • The hint command is intercepted before reaching MiniStack so it does not consume an episode step nor affect simulator state.

14. Episode tracker

services/episode_tracker.py β€” 241 LOC.

Single source of per-episode state. Maintains:

  • Step count, hint count, command history (raw + parsed)
  • partial_progress: float ∈ [0, 1] (monotonic β€” see anti-hack layer 6)
  • credited_operations: set[(operation, resource)] (for dedup β€” anti-hack layer 2)
  • Rollback detection: scans history for (create-X, …, delete-X) pairs on same resource
  • Idempotency detection: looks for known "already exists" error patterns

Parses each AWS CLI invocation into a structured tuple (service, operation, resource_name) for downstream services to query without re-parsing.


15. Environment designer

services/environment_designer.py β€” 99 LOC.

Provisioning helper for SRE / security-posture / drift tasks. A task can declare setup_commands: list[SetupCommand] β€” these are executed (server-side) before the agent starts so the world begins in a deliberately broken / insecure / over-provisioned state. Examples:

  • "Public S3 bucket lockdown" (Β§17): creates public-assets with a wide-open bucket policy
  • "IAM least-privilege": creates app-role with Action: * / Resource: *
  • Drift tasks: provision the correct infra so the drift engine can mutate it

Setup failures abort the reset β€” partial setup is never exposed to the agent.


16. Task definitions (YAML schema)

services/tasks/ β€” one YAML file per tier:

Sample task:

- task_id: 42
  description: Create an S3 bucket named my-app-data and enable versioning on it.
  difficulty: intermediate
  success_criteria:
    grading_strategy: multi_step
    steps:
      - operation: create-bucket
        resource: my-app-data
      - operation: put-bucket-versioning
        resource: my-app-data
    services: [s3]
  setup_commands: []
  possible_drifts: []

Expert / drift tasks add state_checks, desired_state_spec, and setup_commands.


17. Security-posture audit examples

These three expert-tier tasks test reasoning about configuration state β€” the infra is functional but insecure. The agent must read existing config and recognize the vulnerability.

Public S3 bucket lockdown

  • Setup: bucket public-assets is provisioned with a bucket policy granting Principal: * access
  • Task: replace the policy so only IAM role app-role can s3:GetObject
  • State checks: bucket policy denies Principal: *, allows only app-role

IAM least privilege

  • Setup: role app-role exists with an inline policy Action: *, Resource: *
  • Task: replace with a least-privilege policy allowing only dynamodb:GetItem and dynamodb:PutItem on the users table
  • State checks: policy document matches the expected ARN-scoped permissions

Lambda secret rotation

  • Setup: Lambda data-processor has env var DB_PASSWORD=hunter2 (plaintext)
  • Task: create a Secrets Manager secret, add SECRET_ARN env var, remove DB_PASSWORD
  • State checks: secret exists, Lambda has SECRET_ARN, no DB_PASSWORD remains

These are not hypothetical scenarios β€” they're the most common cloud-misconfiguration findings in real audits.


18. Curriculum stats API

Curriculum.get_stats() returns:

{
    "episode_count": 42,
    "tier": "intermediate",
    "tier_episodes": 12,
    "tier_success_rate": 0.75,
    "graduated_tasks": [0, 2, 4],
    "weak_spots": [11, 12],
    "skill_profile": {0: 0.95, 1: 0.8, ...},   # per-task weighted success
    "spaced_rep_due": [0, 2],                   # graduated tasks due for re-test
    "avg_reward_last_10": 0.65,
}

Useful for:

  • Dashboarding training progress
  • Logging into the GRPO EpisodeLogger CSV (see train_grpo.py:635)
  • Driving the web playground's progress bar

19. Web playground

Always mounted at http://localhost:8000/web. When POOL_SIZE>1 the playground is backed by a dedicated lazy-spawned MiniStack on AWS_RL_ENV_WEB_MINISTACK_PORT (default 4565) β€” see Β§6. First request takes ~1–3s while that MiniStack binds; subsequent requests are fast.

  • HTML: server/templates/index.html
  • Static assets: server/static/ β€” CSS, JS, and 40 AWS service icons in server/static/img/aws/
  • The playground talks to /web/reset, /web/step, /web/state, and /web/solution (the last one reveals the next canonical solution command β€” handy for demos and debugging task definitions).

The playground runs a single shared environment instance on its own MiniStack (or, with POOL_SIZE=1, the lone pool MiniStack on :4566). It is intentionally separate from the per-WebSocket sessions used during training so a curious user clicking around the web UI cannot interfere with an active GRPO rollout.


See also