Spaces:
Running
server/ β AWS RL Environment Internals
This directory implements the OpenEnv-compatible FastAPI server that powers the AWS RL Environment. The server exposes HTTP and WebSocket endpoints to a training agent, executes AWS CLI commands against a backing simulator (or real AWS), runs a reward / curriculum stack, and returns shaped observations.
If you only have time for the headline numbers, read the main README. This document is the reference for how the environment actually works β every defended invariant, every edge case, every config knob.
Table of contents
- Architecture overview
- HTTP / WebSocket endpoints
- Episode lifecycle
- Strategy pattern: Simulator vs Real AWS
- MiniStack: vendored fork & customizations
- Server-side MiniStack pool (parallel rollouts)
- Curriculum manager
- Reward shaping & TaskGrader
- Anti-reward-hacking β 8 defense layers
- Resource verifier
- Chaos engine
- Drift engine
- Hint provider
- Episode tracker
- Environment designer
- Task definitions (YAML schema)
- Security-posture audit examples
- Curriculum stats API
- Web playground
1. Architecture overview
βββββββββββββββββββββββββββββββββ server/ process βββββββββββββββββββββββββββββββββ
β β
β FastAPI app (server/app.py) β
β βββ OpenEnv router /reset /step /state /schema /ws /health β
β βββ Web router /web /web/reset /web/step /web/state /web/solution β
β βββ env_factory βββΊ AwsRlEnvironment(strategy=β¦) β
β β β
β βββ EpisodeTracker (per-episode state) β
β βββ Curriculum (priority + mastery) β
β βββ EnvironmentDesigner (setup commands) β
β βββ HintProvider (3-level hints) β
β βββ ChaosEngine (mid-episode mutations) β
β βββ DriftEngine (drift-task injection) β
β βββ TaskGrader (5-strategy dispatcher) β
β βββ ResourceVerifier (ground-truth state) β
β βββ EnvironmentStrategy βββΊ SimulatorStrategy β
β β² (talks to MiniStack) β
β β² AwsStrategy β
β (talks to real AWS) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
MiniStack process(es) on :4566+
(own port per pool slot when AWS_RL_ENV_POOL_SIZE > 1)
Files:
- server/app.py β FastAPI app, OpenEnv integration, MiniStack pool, web routes
- server/aws_rl_env_environment.py β main
AwsRlEnvironmentorchestrator - server/services/ β pluggable services (one concern per file, listed in Β§7βΒ§16)
- server/services/tasks/ β YAML task definitions, one file per tier
- server/templates/index.html β playground HTML
- server/static/ β playground JS/CSS, 40 AWS service icons
2. HTTP / WebSocket endpoints
OpenEnv-compatible (created via openenv.core.env_server.http_server.create_app):
| Method | Path | Purpose |
|---|---|---|
| POST | /reset |
Wipe infra, pick next task from curriculum, return observation |
| POST | /step |
Execute action, grade, optionally inject chaos, return obs |
| GET | /state |
Full AwsRlState snapshot (current task, tracker, infra state) |
| GET | /schema |
JSON schemas for AwsRlAction / AwsRlObservation |
| GET | /health |
Liveness probe |
| WS | /ws |
Persistent session (one MiniStack acquired per connection) |
Web playground (always mounted; backed by a dedicated lazy MiniStack β see Β§6):
| Method | Path | Purpose |
|---|---|---|
| GET | / |
Redirect β /web |
| GET | /web |
HTML playground (Jinja2 template index.html) |
| POST | /web/reset |
Stateful reset for the playground's shared env |
| POST | /web/step |
Stateful step for the playground's shared env |
| GET | /web/state |
Current AwsRlState for the shared env |
| GET | /web/solution |
Reveal next canonical solution command (debug aid) |
Auto-generated docs: /docs (Swagger), /redoc (ReDoc).
3. Episode lifecycle
reset()EnvironmentStrategy.reset_environment()β wipes simulator state (no-op for real AWS)Curriculum.next_task()β picks the next task (see Β§7 priority scoring)EnvironmentDesigner.provision(task.setup_commands)β runs preflight CLI commands to create the broken / insecure infra the agent must fix (used by SRE, drift, security-posture tasks)DriftEngine.inject(task)β for drift tasks, randomly applies 2β3 mutations fromtask.possible_driftsEpisodeTracker.start(task)β fresh tracker- Returns initial
AwsRlObservationwith the maskedTaskInfo(task description but not success criteria)
step(action)- Validate β only commands starting with
awsare accepted (see Β§9 layer 4) - Intercept hint requests β
aws help --task-hintreturns next-level hint, incrementshints_used, never reaches the simulator EnvironmentStrategy.execute(command)β runs the AWS CLI invocation, returns stdout / stderr / exit_codeEpisodeTracker.record(...)β parses command, dedup-checks, updatespartial_progressTaskGrader.grade(...)β returns shaped reward (see Β§8)ChaosEngine.maybe_inject(...)β at tier-scaled probability, executes a destructive mutation on a resource the agent just touchedCurriculum.record_step(...)β accumulates step-level signal- Returns updated
AwsRlObservation
- Validate β only commands starting with
Termination
obs.task_achieved == True, orstep_count >= MAX_STEPS(default 15, configurable via env var)- On terminate:
Curriculum.record_result(task, achieved, reward)updates per-task mastery and may promote the agent's tier
4. Strategy pattern: Simulator vs Real AWS
The environment supports two backends, swapped via the BACKEND_TYPE env var (default simulator):
SimulatorStrategy β services/simulator_strategy.py
- Talks to a MiniStack instance over HTTP (
AWS_INFRA_URL, defaulthttp://localhost:4566) - AWS CLI invocations are subprocessed with
AWS_ENDPOINT_URLset so they hit MiniStack reset_environment()calls MiniStack's/_ministack/resetendpoint to wipe stateget_state()reads the custom/_ministack/stateendpoint (see Β§5) β one HTTP call returns the entire infra inventory used byResourceVerifier
AwsStrategy β services/aws_strategy.py
- Uses ambient AWS credentials (whatever the standard AWS CLI credential chain finds)
- No
AWS_ENDPOINT_URLoverride β commands hit real AWS reset_environment()is a no-op (we cannot wipe a real AWS account; expert-level task scenarios assume a clean / sandboxed sub-account)- Useful for end-to-end demonstrations, less so for RL training
Switching backends:
export BACKEND_TYPE=aws # or "simulator" (default)
make run
The factory in server/app.py wires the right strategy at startup.
5. MiniStack: vendored fork & customizations
Why this matters: the simulator that the grader queries is not a black-box pip dependency β it's vendored in-tree as a git subtree at aws_infra/ so we can extend it. The custom endpoints we added there are how
ResourceVerifierand the grader can read full infra state in a single round-trip.
Vendored as a git subtree
aws_infra/ was imported via git subtree add in commit 2c38c0b "Bring mini stack to local" (PR #5). Upstream is the public MiniStack project. The full upstream README is preserved at aws_infra/README.md (81 KB).
Why we vendored instead of taking a pip dependency:
- Custom endpoints: we needed JSON state-introspection endpoints (
/_ministack/state,/_ministack/actions) that upstream did not ship. These are the integration seams between our env grader and the simulator. - Reproducible builds: the Docker image ships a specific MiniStack revision; no runtime network fetch, identical behavior across environments.
- Service-coverage extensions: occasional patches to individual service handlers (e.g. RDS state retrieval used by
ResourceVerifier).
Custom modifications on top of upstream
Each modification is a separate, cleanly-cherry-pickable commit so future upstream syncs are low-conflict.
| Commit | Title | What it adds |
|---|---|---|
a648c3a |
feat: Add support for service state retrieval and action listing across multiple AWS services | /_ministack/state returns the entire infra inventory as JSON in one call (the grader's primary read path). /_ministack/actions lists every supported operation per service β used by tooling and tests. |
a00e981 |
chor: Small Fixes | Tightening / typo fixes on top of a648c3a. |
af2e945 |
Sync MiniStack with latest changes | Periodic upstream sync. Replays our custom commits cleanly because they are isolated and well-scoped. |
579597b |
Sync MiniStack with latest changes | Subsequent upstream sync. |
To inspect any of these:
git show a648c3a # see the full diff for the state endpoint
git log --oneline -- aws_infra/ # see only the aws_infra/ history
Build integration
- aws_infra/pyproject.toml declares MiniStack as its own package; we install it as an editable dependency via
make install-all. - The Dockerfile stages MiniStack explicitly so the resulting container has no external network requirement at runtime.
- The aws_infra/Makefile provides
make buildandmake testtargets if you want to work on MiniStack itself. aws_infra/docker-compose.ymllets you run MiniStack alone for debugging.
Upstream sync workflow
# From the repo root
git subtree pull --prefix=aws_infra <upstream-remote> main --squash
# Resolve any conflicts (rare, because our patches live in identifiable commits)
# Test:
pytest tests/ -k "verifier or grader"
6. Server-side MiniStack pool (parallel rollouts)
Why: GRPO training generates
G=8rollouts per step on the same task and computes group-relative advantages. To run those 8 rollouts truly in parallel without state bleed, every rollout needs its own AWS world. The server-side pool makes that possible.
Design β server/app.py:75β138
When the server boots, make_env_factory(POOL_SIZE, BASE_PORT, BACKEND_TYPE) decides which factory to install:
| Mode | What gets created |
|---|---|
BACKEND_TYPE=aws |
No pool. All sessions share AwsStrategy. Pool would be meaningless on real AWS. |
AWS_RL_ENV_POOL_SIZE=1 (default) |
No pool object; one shared SimulatorStrategy on the default port. |
AWS_RL_ENV_POOL_SIZE=N (N>1, simulator) |
A MiniStackPool (thread-safe free-list of ports BASE..BASE+N-1). Each WebSocket session calls pool.acquire() to get its own MiniStack port; on disconnect env.close() triggers pool.release(port). |
The pool's acquire() raises RuntimeError("MiniStack pool exhausted") if a 9th client tries to connect when POOL_SIZE=8. OpenEnv's create_app(..., max_concurrent_envs=POOL_SIZE) enforces the same cap upstream so callers see a clean 503 instead.
The Dockerfile launches N MiniStacks
The container's entrypoint starts POOL_SIZE MiniStack processes on ports 4566..4566+POOL_SIZE-1 before the FastAPI server is ready to accept connections. Each MiniStack runs the same image but has its own in-memory state β so the 8 rollouts cannot accidentally see each other's S3 buckets, IAM roles, etc.
Web playground gets its own MiniStack (lazy, on a constant port)
The pool owns [BASE..BASE+N-1] for WebSocket sessions. The web playground's shared _env cannot share those ports β a /web/step would clobber whichever rollout currently holds the same MiniStack. Instead, the web UI uses a dedicated MiniStack on a constant port outside the pool's range (AWS_RL_ENV_WEB_MINISTACK_PORT, default 4565). The pool is constructed as range(BASE, BASE+N), so pool.acquire() can never hand out the web port.
That dedicated MiniStack is spawned lazily by the FastAPI server on the first /web/* request (subprocess.Popen(["ministack", "-d"], env={"GATEWAY_PORT": "4565", ...})). Training-only deployments β the common case β pay zero cost: the extra MiniStack only exists if a user actually opens the playground. First request takes ~1β3s for the bind; subsequent requests are fast (cached _env). A startup assertion refuses to boot if AWS_RL_ENV_WEB_MINISTACK_PORT falls inside the pool's range.
POOL_SIZE=1 keeps the legacy single-MiniStack path: the web env shares :4566 with the lone pool MiniStack β no extra process, no extra port.
Configuration
| Env var | Default | Purpose |
|---|---|---|
AWS_RL_ENV_POOL_SIZE |
1 |
Number of MiniStack instances + WebSocket session capacity |
AWS_RL_ENV_MINISTACK_BASE_PORT |
4566 |
First MiniStack port; pool covers [BASE, BASE + N) |
AWS_RL_ENV_WEB_MINISTACK_PORT |
4565 |
Web playground's dedicated MiniStack port (lazy spawn; must lie outside the pool's range when POOL_SIZE>1) |
BACKEND_TYPE |
simulator |
simulator (default, MiniStack) or aws (real AWS, pool disabled) |
Cross-link
The client side of this pool β the GrpoPool and MultiTurnEnvPool that open N persistent WebSocket connections and run rollouts concurrently β is documented in scripts/README.md. Read that doc for the full multi-turn + multi-rollout walkthrough.
7. Curriculum manager
services/curriculum.py β 536 LOC. Adaptive task selection with mastery tracking, spaced repetition, and tier promotion.
Per-tier configuration
| Tier | min_episodes | advance_rate | mastery_window | mastery_threshold | fast_track_rate | chaos_probability |
|---|---|---|---|---|---|---|
| warmup | 5 | 0.6 | 10 | 0.7 | 0.9 | 0.0 |
| beginner | 10 | 0.65 | 10 | 0.7 | 0.9 | 0.0 |
| intermediate | 15 | 0.65 | 10 | 0.7 | 0.9 | 0.10 |
| advanced | 15 | 0.7 | 10 | 0.7 | 0.9 | 0.20 |
| expert | 20 | 0.7 | 10 | 0.7 | 0.9 | 0.30 |
Priority scoring
For each episode the curriculum picks the highest-scored task within the agent's current tier:
score = novelty_bonus # +100 if never attempted
+ weakness_weight # +50 Γ (1 β task_success_rate)
+ spaced_rep_bonus # +30 if a graduated task is "due" for re-test
β recency_penalty # β20 if attempted in the last 2 episodes
This single formula simultaneously enforces exploration (novelty), targets weak spots (weakness), prevents forgetting (spaced rep), and avoids rut behavior (recency). No hand-coded scheduling β it falls out of the score.
Mastery model
- Window: the last 10 episodes for each task
- Threshold: a task graduates when its weighted success rate crosses 0.7
- Decay:
0.85exponential β recent results count for more - Un-graduation: if a graduated task drops back below threshold, it loses graduation and re-enters the rotation
Spaced repetition
Graduated tasks resurface at intervals [3, 6, 12, 24, 48] episodes. Pass on re-test β interval doubles (capped at 48). Fail β interval resets to 3. The +30 priority bonus in the scoring formula is what surfaces them.
Tier promotion
Two paths:
- Standard:
tier_episodes >= min_episodesandtier_success_rate >= advance_rate - Fast-track: 3 consecutive episodes at β₯
fast_track_rate(0.9) β bypasses the minimum
Demotion is not supported β the agent's "ratchet" only goes up. (Mastery on individual tasks does decay; the tier does not.)
Notable APIs
Curriculum.next_task() -> Taskβ selectionCurriculum.record_result(task, achieved, reward)β episode-level callbackCurriculum.get_task_by_id(task_id) -> Taskβ used by the GRPO validation harness for frozen held-out tasksCurriculum.get_stats() -> dictβ see Β§18
8. Reward shaping & TaskGrader
services/task_grader.py β 264 LOC. The grader is the single source of reward truth.
Reward formula
if task_achieved:
reward = 1.0
if survived_chaos: reward *= 1.05 # β€ 1.05 cap
else:
reward = partial_progress * 0.8 # β€ 0.8 from steps alone
if progress_increased: reward += 0.1 # dense progress signal
if command_failed: reward *= 0.5 # error penalty
reward -= 0.1 * rollback_count # createβdelete pairs
reward += 0.02 * idempotent_retries # graceful "already exists"
reward = clamp(reward, 0.0, 0.99) # 1.0 reserved for completion
reward *= 0.85 ** hints_used # hint decay applied last
This is dense by design β the agent gets meaningful feedback on every step, not just at episode end.
Five grading strategies (dispatcher pattern)
TaskGrader.grade() dispatches on task.success_criteria.grading_strategy:
| Tier | Strategy | Mechanism | Partial-progress source |
|---|---|---|---|
| Warmup | command_match |
Latest command contains correct service + operation | Binary 0 or 1.0 |
| Beginner | resource_creation |
Command match (0.5) + ResourceVerifier confirms exact resource exists in state (1.0) |
Two-stage (0.5 β 1.0) |
| Intermediate | multi_step |
Ordered list of (operation, resource) pairs; credit each new step |
completed_steps / total_steps |
| Advanced | multi_step + services |
Same as multi_step and all services_required must be touched |
completed_steps / total_steps (capped until services satisfied) |
| Expert | state_checks |
ResourceVerifier runs arbitrary AWS CLI commands at grading time and asserts on output |
0.7 Γ steps + 0.3 Γ state_checks |
State-check assertions support two forms:
output_contains: <substring>β substring match on stdoutjson_path: <jq-style path>+expected: <value>β JSON extraction with expected value
This per-tier polymorphism is critical: a single grading rule would be too lax for warmup or too crude for SRE tasks.
Chaos survival bonus
If ChaosEngine injected a mutation during the episode and the agent still completed, reward is 1.05 instead of 1.0 (5% bonus) β and that bonus stacks under hint decay (so the agent that solves a chaotic task without hints gets the maximum).
Rollback penalty & idempotency bonus
- Rollback (
-0.1per pair):EpisodeTracker.detect_rollbacks()scans the command history for(create-X, β¦ , delete-X)pairs on the same resource. Production-style waste β heavily penalized. - Idempotency (
+0.02): if a command fails with a known "already exists" pattern (BucketAlreadyExists,ResourceInUseException, etc.) and the next command continues productively, the agent is rewarded for graceful retry behavior.
This is the first RL environment we know of that rewards operational discipline directly.
9. Anti-reward-hacking β 8 defense layers
The agent's only loss surface is the reward signal. We harden it so that the cheapest path to a high reward is actually doing the task β not gaming the grader.
Layer 1 β Ground-truth verification via MiniStack
The grader never trusts agent command output. For every resource check it asks ResourceVerifier to query MiniStack (or AWS) directly. Even if the agent crafts a perfectly-formed fake JSON response, the grader looks at server-side state.
ResourceVerifier covers 20+ services with bespoke verification methods (S3, DynamoDB, Lambda, SQS, SNS, IAM, Secrets Manager, API Gateway, Cognito, RDS, EFS, ElastiCache, EC2, Step Functions, Glue, Athena, EMR, ECS, EKS, EventBridge, Kinesis, β¦).
Layer 2 β Deduplication
EpisodeTracker.has_executed_operation() records every (operation, resource) pair that earned credit. Re-running the same successful command does not re-earn partial_progress. Each unique operation pays out exactly once.
Layer 3 β Grader invisibility
The CLI commands run by ResourceVerifier and state_checks happen server-side and are not echoed into the agent's observation. The agent never sees which queries the grader is making, so it cannot reverse-engineer "fake outputs" that match the grader's expectations.
Layer 4 β Command allow-listing
step() rejects anything that doesn't start with aws (success=False, no execution). No shell metacharacters, no piping, no redirection, no escape from the AWS CLI sandbox.
Layer 5 β No verification reward
If the agent's command exactly matches one of the task's state_checks commands (e.g. aws s3api get-bucket-versioning --bucket app-config-store), it gets zero progress credit. Only mutating commands (create / put / update / delete) earn credit. Read-only auditing is freely allowed but not rewarded β exactly mirroring the grader's behavior.
Layer 6 β Monotonic progress
partial_progress only ever increases within an episode. It is clamped at 0.99; reaching 1.0 requires fully verified completion. The agent cannot lose progress, but it also cannot re-earn lost progress, so cycling strategies (create β delete β create) yield zero net gain.
Layer 7 β Resource-name validation
ResourceVerifier checks the exact resource name from the task definition. Creating my-test-bucket-2 does not satisfy a check for my-test-bucket. The agent cannot creatively name its way around the spec.
Layer 8 β State checks verify the final state
For expert SRE tasks, the grader runs the canonical state_checks commands at grading time against the live MiniStack. The grade is "what is true now?", not "what did the agent claim?". This is the single hardest layer to circumvent.
These layers compose: even if one is bypassed (e.g. a clever exact-match name), the others independently still produce the right reward.
10. Resource verifier
services/resource_verifier.py β 362 LOC.
- Per-service
verify_*methods for 20+ AWS services. Each method knows which API calls expose state for that service and how to read the response (e.g.verify_s3_bucket(name)callss3api list-buckets,verify_dynamodb_table(name)callsdynamodb describe-table, etc.). - Single-shot state path: when called via
SimulatorStrategy.get_state(), the verifier reads MiniStack's custom/_ministack/stateendpoint (added in commita648c3a, see Β§5) which returns the full infra inventory in one HTTP call. This is dramatically faster than iterating 20+ list APIs per grading pass. - State-check evaluator: handles
output_contains(substring) andjson_path+expected(JSON extraction with deep-path support) assertion types used by expert-tier tasks. - Live ground-truth source β the verifier never consumes the agent's stdout. Always fresh state from the simulator.
11. Chaos engine
services/chaos_engine.py β 168 LOC.
Probabilistically perturbs AWS resource state mid-episode. Tests whether the agent can detect and recover from unexpected drift β a critical SRE skill.
- Tier-scaled probability: 0% warmup/beginner, 10% intermediate, 20% advanced, 30% expert
- Service-scoped templates: a chaos roll only fires on services the current task is touching. Resource names are extracted from the agent's recent successful commands via service-specific regex (e.g.
aws s3 mb s3://(\S+)β bucket name). - Five service templates: S3 policy / versioning changes, DynamoDB throughput modifications, Lambda configuration alterations, IAM detach-role-policy, SNS subscription mutations
- Silent: chaos commands run server-side; the agent observes only the consequence (a state inconsistency), never the cause
- Reward bonus: surviving chaos and completing the task pays
1.05instead of1.0
The combination of "tier-scaled probability" + "task-scoped resource selection" means chaos is rare for warmup tasks (0%) and frequent for SRE tasks (30%) β exactly where it matters.
12. Drift engine
services/drift_engine.py β 67 LOC.
Specialised for the 6 drift-detection expert tasks defined in services/tasks/drift.yaml.
- Each drift task ships a pool of
possible_drifts(each a small list of CLI commands that mutates a resource away from the desired spec). - On
reset(), the engine randomly selects 2β3 drifts from that pool and applies them after the setup-command phase. - The agent sees a
desired_state_spec(natural language) and must audit the environment, identify which resources drifted, and fix only those. - Random selection per episode means no memorization β the agent must reason about desired vs actual state, not recall a fix script.
- Examples: S3 versioning/encryption drift, DynamoDB throughput changes, SNS subscription modifications, Lambda env-var tampering.
13. Hint provider
services/hint_provider.py β 137 LOC.
Three-level progressive hints, requested via the special action aws help --task-hint:
| Level | What it reveals | Example |
|---|---|---|
| 1 | Required AWS services | "You'll need IAM and Lambda" |
| 2 | Operation sequence | "Start with create-role, then put-role-policy" |
| 3 | Near-complete command structure | "Use: aws iam create-role --role-name β¦" |
- Hints are auto-derived from the
SuccessCriteriafields (services list, ordered steps, operation names) β no hand-written hint text per task. - Reward decay:
final_reward *= 0.85 ** hints_used. With three hints (max), the agent caps at0.85Β³ β 0.614of normal reward. - The hint command is intercepted before reaching MiniStack so it does not consume an episode step nor affect simulator state.
14. Episode tracker
services/episode_tracker.py β 241 LOC.
Single source of per-episode state. Maintains:
- Step count, hint count, command history (raw + parsed)
partial_progress: float β [0, 1](monotonic β see anti-hack layer 6)credited_operations: set[(operation, resource)](for dedup β anti-hack layer 2)- Rollback detection: scans history for
(create-X, β¦, delete-X)pairs on same resource - Idempotency detection: looks for known "already exists" error patterns
Parses each AWS CLI invocation into a structured tuple (service, operation, resource_name) for downstream services to query without re-parsing.
15. Environment designer
services/environment_designer.py β 99 LOC.
Provisioning helper for SRE / security-posture / drift tasks. A task can declare setup_commands: list[SetupCommand] β these are executed (server-side) before the agent starts so the world begins in a deliberately broken / insecure / over-provisioned state. Examples:
- "Public S3 bucket lockdown" (Β§17): creates
public-assetswith a wide-open bucket policy - "IAM least-privilege": creates
app-rolewithAction: */Resource: * - Drift tasks: provision the correct infra so the drift engine can mutate it
Setup failures abort the reset β partial setup is never exposed to the agent.
16. Task definitions (YAML schema)
services/tasks/ β one YAML file per tier:
- warmup.yaml β 25 listing tasks
- beginner.yaml β 25 single-resource creation tasks
- intermediate.yaml β 25 multi-step workflows
- advanced.yaml β 25 cross-service architectures
- expert.yaml β 24 SRE / security tasks
- drift.yaml β 9 drift detection tasks
Sample task:
- task_id: 42
description: Create an S3 bucket named my-app-data and enable versioning on it.
difficulty: intermediate
success_criteria:
grading_strategy: multi_step
steps:
- operation: create-bucket
resource: my-app-data
- operation: put-bucket-versioning
resource: my-app-data
services: [s3]
setup_commands: []
possible_drifts: []
Expert / drift tasks add state_checks, desired_state_spec, and setup_commands.
17. Security-posture audit examples
These three expert-tier tasks test reasoning about configuration state β the infra is functional but insecure. The agent must read existing config and recognize the vulnerability.
Public S3 bucket lockdown
- Setup: bucket
public-assetsis provisioned with a bucket policy grantingPrincipal: *access - Task: replace the policy so only IAM role
app-rolecans3:GetObject - State checks: bucket policy denies
Principal: *, allows onlyapp-role
IAM least privilege
- Setup: role
app-roleexists with an inline policyAction: *, Resource: * - Task: replace with a least-privilege policy allowing only
dynamodb:GetItemanddynamodb:PutItemon the users table - State checks: policy document matches the expected ARN-scoped permissions
Lambda secret rotation
- Setup: Lambda
data-processorhas env varDB_PASSWORD=hunter2(plaintext) - Task: create a Secrets Manager secret, add
SECRET_ARNenv var, removeDB_PASSWORD - State checks: secret exists, Lambda has
SECRET_ARN, noDB_PASSWORDremains
These are not hypothetical scenarios β they're the most common cloud-misconfiguration findings in real audits.
18. Curriculum stats API
Curriculum.get_stats() returns:
{
"episode_count": 42,
"tier": "intermediate",
"tier_episodes": 12,
"tier_success_rate": 0.75,
"graduated_tasks": [0, 2, 4],
"weak_spots": [11, 12],
"skill_profile": {0: 0.95, 1: 0.8, ...}, # per-task weighted success
"spaced_rep_due": [0, 2], # graduated tasks due for re-test
"avg_reward_last_10": 0.65,
}
Useful for:
- Dashboarding training progress
- Logging into the GRPO
EpisodeLoggerCSV (see train_grpo.py:635) - Driving the web playground's progress bar
19. Web playground
Always mounted at http://localhost:8000/web. When POOL_SIZE>1 the playground is backed by a dedicated lazy-spawned MiniStack on AWS_RL_ENV_WEB_MINISTACK_PORT (default 4565) β see Β§6. First request takes ~1β3s while that MiniStack binds; subsequent requests are fast.
- HTML: server/templates/index.html
- Static assets: server/static/ β CSS, JS, and 40 AWS service icons in server/static/img/aws/
- The playground talks to
/web/reset,/web/step,/web/state, and/web/solution(the last one reveals the next canonical solution command β handy for demos and debugging task definitions).
The playground runs a single shared environment instance on its own MiniStack (or, with POOL_SIZE=1, the lone pool MiniStack on :4566). It is intentionally separate from the per-WebSocket sessions used during training so a curious user clicking around the web UI cannot interfere with an active GRPO rollout.
See also
- Main README β project overview, results, Colab links
- scripts/README.md β client-side parallel rollout pool (
GrpoPool,MultiTurnEnvPool, asyncio orchestration) - train/README.md β SFT + GRPO training pipeline
- data/README.md β dataset generation + base-model selection
- aws_infra/README.md β vendored MiniStack upstream docs (81 KB)
