Spaces:

SandyTheAdventurer
/

openenv

No application file

App Files Files Community

openenv / COEnv_Project_Documentation.md

Sandeep Suresh

feat: Update class names and enhance documentation for CoEnv integration

d95de8e 8 days ago

preview code

raw

history blame contribute delete

33.1 kB

coenv — Project Documentation

1. What Is This Project?

coenv is a Reinforcement Learning environment that simulates real-world Kubernetes cluster operations. An AI agent (LLM) is placed inside a broken or degraded Kubernetes cluster and must figure out the right sequence of operations to fix it — just like a real Site Reliability Engineer (SRE) would.

This is built for the Meta × Hugging Face OpenEnv RL Hackathon, which requires:

A real-world task simulation (not games or toys)
Full OpenEnv interface implementation (step(), reset(), state())
At least 3 tasks with programmatic graders (easy → medium → hard)
A meaningful reward function that gives partial credit throughout the episode
A working inference.py that runs an LLM agent and logs structured output
Deployment on Hugging Face Spaces with a working Dockerfile

In simple terms: We fake a Kubernetes cluster in Python memory, break it in specific ways, and then let an LLM try to fix it step by step — scoring it on how well it does.

2. Why Kubernetes?

Kubernetes (k8s) is the industry-standard container orchestration system used by virtually every tech company running production software. Managing it is genuinely difficult and is a daily job for SREs and DevOps engineers worldwide.

Why it's a perfect RL environment:

RL Concept	Kubernetes Equivalent
State	Cluster state (pod statuses, node health, resource usage)
Action	kubectl commands (scale, patch, delete, restart)
Reward	How close the cluster is to a healthy target state
Episode	One incident recovery scenario
Done	All SLOs restored / all pods healthy

Why it's novel for OpenEnv: None of Meta's reference environments (calendar, REPL, browser, CARLA, reasoning gym) touch infrastructure operations. This fills a real gap.

Why it's practical: Companies would immediately use an environment like this to train or evaluate agents that assist SREs — the real-world utility score (30% of judging) is very high.

3. How It Works — The Big Picture

Think of the project as three concentric layers:

┌─────────────────────────────────────────────────────────┐
│                   LAYER 1 — RL ENVIRONMENT               │
│  inference.py  ←→  main.py (FastAPI)  ←→  tasks/graders │
│                      (Sandeep)                           │
├─────────────────────────────────────────────────────────┤
│                LAYER 2 — SIMULATION ENGINE               │
│       world.py  ←→  models.py  ←→  conditions/          │
│                        (You)                             │
├─────────────────────────────────────────────────────────┤
│                LAYER 3 — ACTION SPACE                    │
│    worker.py  ←→  executor.py  ←→  actions/  ←→  validator│
│                    (Third Person)                        │
└─────────────────────────────────────────────────────────┘

Layer 1 (Sandeep) is what the judges see — the API endpoints, the inference script, the task definitions, the graders, the README.

Layer 2 (You) is the fake Kubernetes cluster. It holds the state of the cluster, knows how pods transition between statuses, and can inject failures. Everything sits in Python dictionaries — no real Kubernetes cluster runs.

Layer 3 (Third Person) is the action space — the specific operations the LLM agent is allowed to perform, and the validation/execution bridge that translates those actions into state changes in the simulator.

4. The Three Layers Explained

Layer 1 — RL Environment (Sandeep)

This layer is the public contract of the project. It's what OpenEnv's validate command checks, what the judges' scripts call, and what the LLM agent talks to.

main.py — FastAPI application

The central API server. It exposes exactly three mandatory endpoints:

POST /reset — Starts a new episode. Sets up a broken cluster using one of the condition injectors. Returns the initial ClusterObservation (what the agent sees first).
POST /step — Receives an action from the agent. Validates it, executes it on the simulated cluster, advances time by one tick, and returns the new observation + reward + done flag + info.
GET /state — Returns the full current cluster state. Used for debugging and grading.

inference.py — LLM agent runner

This is the script the hackathon validators actually run. It:

Reads API_BASE_URL, MODEL_NAME, HF_TOKEN from environment variables
Calls /reset to start an episode
Feeds the observation to the LLM using the OpenAI client
Parses the LLM's response as a structured action
Calls /step with that action

Prints structured stdout logs after every step:

[START] task=pod-recovery env=coenv model=Qwen3-VL-30B
[STEP] step=1 action=delete_pod('frontend-7d9f-xkp2') reward=0.20 done=false error=null
[STEP] step=2 action=scale('frontend',3) reward=0.60 done=false error=null
[END] success=true steps=2 rewards=0.20,0.60

Repeats until done=true or max_steps is reached

openenv.yaml — Spec metadata

Required for openenv validate to pass. Contains:

Environment name, version, description
List of task IDs with difficulty labels
References to the action schema and observation schema

classes/tasks/ — Task definitions

Three Python files, each defining one task:

What the broken state looks like (which condition to inject)
What the agent's objective is (in plain English, passed to the LLM as a prompt)
What counts as success
Maximum number of steps allowed

classes/graders/ — Reward graders

Three Python files, each implementing a grade(world_state) -> float function. Graders must be fully deterministic — same world state always returns same score. They implement partial credit: a grader doesn't just say "fixed or not fixed" but scores partial progress (e.g., 2 out of 5 pods fixed = 0.4).

Dockerfile

Single-stage Python container. Installs requirements.txt, copies the project, exposes port 8000, runs uvicorn main:app. Must build and run cleanly — this is a hard pass/fail gate.

README.md

Mandatory documentation. Must include: environment overview, motivation, action space definition, observation space definition, task descriptions with difficulty labels, setup instructions, baseline scores table.

Layer 2 — Simulation Engine (You)

This is the most important layer technically. It's what makes the environment believable. Since we cannot run a real Kubernetes cluster inside a 2 vCPU / 8 GB HF Space container, the entire cluster is simulated as an in-memory Python object.

classes/world.py — The cluster simulator

This is the brain of the project. It maintains the complete cluster state as a Python dictionary, structured like a real Kubernetes API response:

cluster_state = {
    "nodes": [
        {"name": "node-1", "status": "Ready", "cpu_capacity": 4, "mem_capacity": 8192},
        {"name": "node-2", "status": "NotReady", "cpu_capacity": 4, "mem_capacity": 8192}
    ],
    "deployments": [
        {"name": "frontend", "desired_replicas": 3, "available_replicas": 1, "image": "nginx:1.21"}
    ],
    "pods": [
        {"name": "frontend-7d9f-xkp2", "status": "CrashLoopBackOff", "node": "node-1", "restarts": 7},
        {"name": "frontend-7d9f-ab3c", "status": "Running", "node": "node-1", "restarts": 0},
        {"name": "frontend-7d9f-mn8x", "status": "Pending", "node": None, "restarts": 0}
    ],
    "services": [...],
    "configmaps": [...],
    "hpa": [...]
}

Key methods:

reset(condition) — Wipes state, injects a failure condition, returns initial observation
get_pods(namespace, selector) — Returns filtered pod list (mimics kubectl get pods)
apply_patch(resource_type, name, patch) — Applies a patch to a resource
scale(deployment_name, replicas) — Changes replica count
delete_pod(pod_name) — Removes a pod (it gets recreated by the deployment controller on next tick)
tick() — Advances simulated time by one step. Pods in CrashLoopBackOff increment their restart counter. Pending pods on ready nodes eventually transition to Running. Dead nodes stay dead unless drained.
get_observation() — Serialises the current state into a ClusterObservation Pydantic model

classes/models.py — Pydantic typed models

All data structures are defined here. This is mandatory for OpenEnv spec compliance — typed models enforce the action/observation contract.

class PodStatus(BaseModel):
    name: str
    status: Literal["Running", "Pending", "CrashLoopBackOff", "OOMKilled", "Terminating", "Unknown"]
    node: Optional[str]
    restarts: int
    cpu_usage: float
    mem_usage: float

class NodeStatus(BaseModel):
    name: str
    status: Literal["Ready", "NotReady", "SchedulingDisabled"]
    cpu_capacity: float
    mem_capacity: float
    cpu_usage: float
    mem_usage: float

class ClusterObservation(BaseModel):
    nodes: List[NodeStatus]
    pods: List[PodStatus]
    deployments: List[DeploymentStatus]
    services: List[ServiceStatus]
    events: List[ClusterEvent]          # recent k8s events (error messages, warnings)
    step: int
    objective: str                      # plain English description of what to fix

class RewardSignal(BaseModel):
    reward: float                       # 0.0 to 1.0 incremental reward this step
    cumulative: float                   # total reward so far
    done: bool
    info: Dict[str, Any]               # breakdown: why this reward was given

classes/conditions/ — Failure injectors

Each condition is a Python class with a single inject(cluster_state) -> cluster_state method that takes a healthy cluster and returns a broken one. This is how each task starts with a specific failure scenario:

crash_loop.py — Sets 3 pods to CrashLoopBackOff with high restart counts. Simulates a bad image tag or missing environment variable.
oom_kill.py — Sets pods to OOMKilled. Memory limits are set too low in the deployment spec. Pods keep restarting.
node_failure.py — Sets one node to NotReady. All pods on that node go to Unknown. New pods are Pending (no space to schedule).
cascade_failure.py — Combines multiple failures: one OOMKilled service causes downstream 503s in two dependent services, creating a cascading failure across 3 deployments.

classes/utils.py — Probability and simulation helpers

Utility functions that make the simulation feel realistic:

sample_cpu_usage(base_load, noise_factor) — Returns a slightly randomised CPU % (real clusters are never exactly at baseline)
sample_latency(healthy_latency, degradation_factor) — Simulates p95 request latency under load
should_pod_recover(restarts, backoff_seconds) — Determines if a CrashLoopBackOff pod would naturally recover (it usually won't — that's the point)
generate_cluster_events(pod_list) — Creates realistic k8s event messages like "Back-off restarting failed container" or "OOMKilled: container exceeded memory limit"

config.json — Cluster defaults

Single source of truth for all simulation parameters:

{
  "cluster": {
    "num_nodes": 3,
    "cpu_per_node": 4,
    "mem_per_node_gb": 8
  },
  "tasks": {
    "pod_recovery": { "max_steps": 15, "success_threshold": 0.9 },
    "autoscaling":  { "max_steps": 20, "success_threshold": 0.85 },
    "incident":     { "max_steps": 30, "success_threshold": 0.80 }
  },
  "simulation": {
    "tick_interval_seconds": 30,
    "crash_backoff_max_seconds": 300,
    "hpa_cooldown_seconds": 180
  }
}

Layer 3 — Action Space & Workers (Third Person)

This layer defines what the LLM is allowed to do, makes sure it's valid, and executes it against the simulator.

classes/actions/ — Typed action definitions

Each action is a Pydantic model. The LLM must output one of these (Sandeep's inference.py prompts it to respond in JSON matching one of these schemas):

class ScaleAction(BaseModel):
    action_type: Literal["scale"]
    deployment: str          # e.g. "frontend"
    replicas: int            # e.g. 3

class DeletePodAction(BaseModel):
    action_type: Literal["delete_pod"]
    pod_name: str            # e.g. "frontend-7d9f-xkp2"

class PatchAction(BaseModel):
    action_type: Literal["patch"]
    resource_type: str       # "deployment" | "configmap" | "service"
    name: str
    patch: Dict[str, Any]   # the fields to update

class RolloutRestartAction(BaseModel):
    action_type: Literal["rollout_restart"]
    deployment: str

class SetHPAAction(BaseModel):
    action_type: Literal["set_hpa"]
    deployment: str
    min_replicas: int
    max_replicas: int
    cpu_target_percent: int

class DrainNodeAction(BaseModel):
    action_type: Literal["drain_node"]
    node_name: str

class DescribeAction(BaseModel):
    action_type: Literal["describe"]
    resource_type: str
    name: str               # "investigation" action — no state change, returns detail

classes/validator.py — Action validation

Before any action touches the world state, the validator checks it:

Does the target resource exist? (Can't delete a pod that doesn't exist)
Is the scale value sane? (Can't scale to 0 or to 1000 replicas)
Is the node already drained? (Can't drain twice)
Is the deployment name a real deployment?

If validation fails, it returns an error string. This flows directly into the [STEP] error= field in stdout logs. The step still counts against the agent's limit — bad actions are penalised by wasting steps.

classes/executor.py — Action execution bridge

Maps each validated action type to the correct world.py method call:

def execute(action: KubeAction, world: World) -> ExecutionResult:
    if action.action_type == "scale":
        world.scale(action.deployment, action.replicas)
    elif action.action_type == "delete_pod":
        world.delete_pod(action.pod_name)
    elif action.action_type == "rollout_restart":
        world.rollout_restart(action.deployment)
    ...
    world.tick()   # always advance time after an action
    return ExecutionResult(observation=world.get_observation(), ...)

classes/worker.py — Agent episode loop

Manages the full lifecycle of a single episode. Sandeep's inference.py calls this:

class Worker:
    def run_episode(self, task_id, world, max_steps) -> EpisodeResult:
        obs = world.reset(task=task_id)
        rewards = []
        for step in range(1, max_steps + 1):
            action = self.get_action(obs)          # calls LLM
            result = executor.execute(action, world)
            rewards.append(result.reward)
            if result.done:
                break
        return EpisodeResult(rewards=rewards, steps=step, success=result.done)

5. Team Ownership

Module	Owner	Why It's Their Responsibility
`main.py`	Sandeep	He owns the public API contract
`inference.py`	Sandeep	He owns the hackathon submission script
`openenv.yaml`	Sandeep	He owns spec compliance
`Dockerfile`	Sandeep	He owns deployment
`README.md`	Sandeep	He owns documentation
`classes/tasks/`	Sandeep	He defines what success looks like
`classes/graders/`	Sandeep	He owns the scoring logic
`classes/world.py`	You	You own the cluster simulator
`classes/models.py`	You	You own all typed data models
`classes/utils.py`	You	You own simulation helpers
`classes/conditions/`	You	You own failure injection
`config.json`	You	You own all parameters
`classes/worker.py`	Third person	They own the episode loop
`classes/actions/`	Third person	They own the action space
`classes/executor.py`	Third person	They own action execution
`classes/validator.py`	Third person	They own action validation
`tests/`	All three	Each writes tests for their own module

6. Full Project Directory Structure

coenv/
├── .dockerignore                  # Docker build exclusions
├── __init__.py                    # Module exports
├── README.md                      # Project documentation
├── openenv.yaml                   # OpenEnv manifest
├── pyproject.toml                 # Project metadata and dependencies
├── uv.lock                        # Locked dependencies
├── client.py                      # CoEnv client / inference-side runner
├── models.py                      # Shared action and observation models
├── config.json                    # Cluster defaults and simulation params
├── mkdocs.yml                     # Docs site configuration
├── tests/                         # End-to-end and unit tests
│   ├── test_environment.py        # From test_world.py
│   ├── test_conditions.py         # From test_conditions.py
│   ├── test_models.py             # From test_models.py
│   ├── test_actions.py            # From test_actions.py
│   ├── test_executor.py           # From test_executor.py
│   ├── test_graders.py            # From test_graders.py
│   ├── test_tasks.py              # From test_tasks.py
│   └── test_integration.py        # End-to-end reset→step→state flow
└── server/
    ├── __init__.py                # Server module exports
    ├── coenv_environment.py       # Core environment logic
    ├── app.py                     # FastAPI app exposing /reset /step /state
    ├── Dockerfile                 # Container image definition
    ├── utils.py                   # Simulation helpers
    ├── validator.py               # Action validation
    ├── executor.py                # Action execution bridge
    ├── worker.py                  # Episode loop manager
    ├── tasks/
    │   ├── __init__.py
    │   ├── task_pod_recovery.py
    │   ├── task_autoscaling.py
    │   └── task_incident.py
    ├── graders/
    │   ├── __init__.py
    │   ├── grader_pod_recovery.py
    │   ├── grader_autoscaling.py
    │   └── grader_incident.py
    ├── conditions/
    │   ├── __init__.py
    │   ├── crash_loop.py
    │   ├── oom_kill.py
    │   ├── node_failure.py
    │   └── cascade_failure.py
    └── actions/
        ├── __init__.py
        ├── scale_action.py
        ├── patch_action.py
        ├── delete_pod_action.py
        ├── rollout_action.py
        ├── hpa_action.py
        ├── drain_action.py
        └── describe_action.py

7. The Three Tasks (Easy → Medium → Hard)

Task 1 — Pod Recovery (Easy)

What's broken: A frontend deployment has 3 pods stuck in CrashLoopBackOff. The restart count is climbing. The root cause is a wrong environment variable in the deployment spec pointing to a database host that doesn't exist.

What the agent must do:

Observe the broken pods and read the k8s events (which mention a connection refused error)
Identify the bad DB_HOST environment variable using a describe or patch inspect action
Patch the deployment with the correct DB_HOST value
Optionally delete the crash-looping pods to speed up recovery (they'll get recreated with the new config)
Verify all 3 pods reach Running state

Objective string shown to agent: "The frontend deployment is crash-looping. Diagnose and fix the root cause so that all pods reach Running state."

Max steps: 15
Success threshold: All 3 pods in Running state (score ≥ 0.9)

Partial rewards:

+0.1 for each pod that stops crash-looping
+0.2 for correctly patching the environment variable
+0.3 bonus for all pods Running within 10 steps

Task 2 — HPA Autoscaling Under Traffic Spike (Medium)

What's broken: The cluster is healthy but receiving 10× normal traffic. The deployment has no HPA configured, is running on fixed 2 replicas, and is already at 95% CPU. Request latency is climbing past the SLO threshold.

What the agent must do:

Observe high CPU usage and rising latency in the observation
Immediately scale up the deployment to handle current load
Configure a HorizontalPodAutoscaler (HPA) with appropriate min/max replicas and CPU target
Set correct CPU resource requests/limits on the deployment so HPA has a baseline to work with
Verify that latency drops back below the SLO threshold

Objective string shown to agent: "Traffic has spiked 10×. The api-server deployment is overloaded. Configure autoscaling and ensure p95 latency stays below 500ms."

Max steps: 20
Success threshold: p95 latency < 500ms, HPA configured, replicas ≥ 4 (score ≥ 0.85)

Partial rewards:

+0.15 for scaling up replicas immediately (within 3 steps)
+0.20 for configuring HPA correctly
+0.25 for latency dropping below 1000ms
+0.30 for latency dropping below 500ms (SLO met)
-0.10 penalty for scaling beyond 12 replicas unnecessarily (resource waste)

Task 3 — Multi-Service Cascading Incident (Hard)

What's broken: The auth-service deployment has pods getting OOMKilled because memory limits are set 4× too low relative to actual usage. This causes the api-gateway to fail authentication checks and return 503s. Downstream, the data-processor service is also throwing errors because it depends on the gateway. Three services are degraded simultaneously.

What the agent must do:

Identify the blast radius — which services are affected and why
Investigate auth-service to find the OOMKill root cause (memory limits too low)
Patch auth-service deployment with correct memory limits
Rollout restart auth-service so new pods come up with correct limits
Drain the partially-failed node where most OOMKilled pods were running, to force clean rescheduling
Verify api-gateway 503 errors stop (automatically once auth recovers)
Verify data-processor error rate drops (automatically once gateway recovers)
Confirm all three services are fully healthy

Objective string shown to agent: "A cascading incident has degraded auth-service, api-gateway, and data-processor. Identify the root cause and restore all three services to healthy state without data loss."

Max steps: 30
Success threshold: All 3 services healthy, error rate < 0.1% (score ≥ 0.80)

Partial rewards:

+0.10 for correctly identifying auth-service as the root cause (within 5 steps)
+0.15 for patching memory limits correctly
+0.15 for auth-service pods reaching Running
+0.20 for api-gateway 503s stopping
+0.20 for data-processor errors resolving
+0.10 for draining the bad node cleanly
-0.15 penalty for deleting services or breaking healthy components

8. Reward & Grading Design

The grading philosophy follows what the PS requires: reward signal over the full trajectory, not just at the end.

Reward Principles

Partial progress is always rewarded. If the agent fixes 1 out of 3 broken pods, it gets 1/3 of the maximum reward for that milestone — not zero.

Speed bonus. Fixing the issue in fewer steps earns a small bonus. This incentivises efficient reasoning.

Waste penalty. Unnecessary destructive actions (scaling to 0, deleting healthy pods, draining a healthy node) subtract from the reward. This teaches the agent to be surgical.

Idempotency. Repeating the same correct action doesn't give extra reward but doesn't penalise either (except for wasted steps).

Grader Implementation Pattern

Each grader implements:

def grade(world_state: dict, step: int, max_steps: int) -> float:
    score = 0.0

    # Milestone 1: Partial progress
    running_pods = [p for p in world_state["pods"] if p["status"] == "Running"]
    score += (len(running_pods) / total_expected_pods) * 0.5

    # Milestone 2: Full success
    if all(p["status"] == "Running" for p in world_state["pods"]):
        score += 0.4

    # Speed bonus
    efficiency = 1.0 - (step / max_steps)
    score += efficiency * 0.1

    return min(score, 1.0)  # always clamp to [0, 1]

9. The Complete Episode Flow

Here is the full step-by-step flow of one complete episode, from start to finish:

1. JUDGE / VALIDATOR runs:
   python inference.py

2. inference.py reads env vars:
   API_BASE_URL, MODEL_NAME, HF_TOKEN

3. inference.py calls:
   POST /reset  { "task": "pod_recovery" }

4. main.py receives /reset:
   → Calls task_pod_recovery.get_condition()  →  crash_loop.inject(cluster_state)
   → world.reset(broken_state)
   → Returns ClusterObservation (3 CrashLoopBackOff pods, events, objective string)

5. stdout prints:
   [START] task=pod-recovery env=coenv model=Qwen3-30B

6. inference.py builds LLM prompt:
   "You are an SRE. Current cluster state: [observation JSON].
    Objective: Fix the frontend deployment crash loop.
    Respond with a JSON action from the available action types."

7. LLM responds:
   { "action_type": "describe", "resource_type": "deployment", "name": "frontend" }

8. inference.py calls:
   POST /step  { action }

9. main.py receives /step:
   → validator.validate(action, world)  →  OK
   → executor.execute(action, world)
   → world.tick()
   → grader.grade(world.state, step=1) → reward=0.00 (just investigating)
   → Returns observation, reward=0.00, done=false, info={...}

10. stdout prints:
    [STEP] step=1 action=describe('deployment','frontend') reward=0.00 done=false error=null

11. LLM sees deployment spec, notices DB_HOST=wrong-host.internal
    LLM responds: { "action_type": "patch", "resource_type": "deployment",
                    "name": "frontend",
                    "patch": {"env": [{"name": "DB_HOST", "value": "db.prod.internal"}]} }

12. POST /step  { patch action }
    → executor patches deployment in world state
    → world.tick() — pods begin restarting with new config
    → grader → reward=0.20 (correct patch applied)

13. [STEP] step=2 action=patch('frontend',{env...}) reward=0.20 done=false error=null

14. LLM responds: { "action_type": "delete_pod", "pod_name": "frontend-7d9f-xkp2" }
    → world deletes pod, recreates with correct env, status → Running
    → grader → reward=0.40

15. Repeat for remaining 2 pods...

16. All 3 pods Running. grader → reward=1.0, done=true

17. stdout prints:
    [END] success=true steps=8 rewards=0.00,0.20,0.40,0.55,0.70,0.85,0.95,1.00

10. OpenEnv Spec Compliance Checklist

Requirement	File	Status
Typed Observation model	`classes/models.py` → `ClusterObservation`	Required
Typed Action model	`classes/models.py` → `KubeAction`	Required
Typed Reward model	`classes/models.py` → `RewardSignal`	Required
`step(action) → (obs, reward, done, info)`	`main.py` → `POST /step`	Required
`reset() → initial_observation`	`main.py` → `POST /reset`	Required
`state() → current_state`	`main.py` → `GET /state`	Required
`openenv.yaml` with metadata	`openenv.yaml`	Required
`openenv validate` passes	Tested via pre-validation script	Required
Min 3 tasks	`classes/tasks/` — 3 files	Required
Easy → medium → hard difficulty	task_pod_recovery / task_autoscaling / task_incident	Required
Graders return 0.0–1.0	`classes/graders/` — 3 graders	Required
Graders are deterministic	Pure functions, no randomness	Required
Partial reward signals	All 3 graders implement milestone scoring	Required
Penalise bad actions	validator.py + grader penalty terms	Required
`inference.py` in root	`inference.py`	Required
`[START]` log line	`inference.py` → `log_start()`	Required
`[STEP]` log per step	`inference.py` → `log_step()`	Required
`[END]` log always emitted	`inference.py` → `finally: log_end()`	Required
Reads `API_BASE_URL` with default	`inference.py`	Required
Reads `MODEL_NAME` with default	`inference.py`	Required
Reads `HF_TOKEN` (no default)	`inference.py`	Required
Uses OpenAI client	`from openai import OpenAI`	Required
`Dockerfile` builds cleanly	`Dockerfile`	Required
HF Space deploys and responds	Deployed on Hugging Face	Required
Inference runs in < 20 min	Max 30 steps × ~20s/step = ~10 min	Required
Runs in 2 vCPU / 8 GB RAM	Pure Python in-memory sim, no real k8s	Required
README with all required sections	`README.md`	Required

11. Submission Checklist

Before submitting, verify all of these:

inference.py is in the root directory (not inside classes/)
inference.py has default values for API_BASE_URL and MODEL_NAME
inference.py raises ValueError if HF_TOKEN is missing
[START], [STEP], [END] format matches the spec exactly (field names, order, lowercase booleans)
openenv validate passes locally
docker build completes without errors
docker run starts the server and responds to GET /state
HF Space is in Running state (not Building, not Stopped)
All 3 tasks can be reset and stepped without crashing
All 3 graders return a float between 0.0 and 1.0
Running inference.py end-to-end completes in under 20 minutes
README.md includes baseline scores table
tests/test_integration.py passes cleanly

12. Key Technical Decisions

Why a simulated cluster, not a real one?

Running kind or minikube inside a Hugging Face Space container with 2 vCPU / 8 GB RAM is not feasible. The Kubernetes control plane alone (etcd + apiserver + scheduler + controller-manager) consumes ~1.5–2 GB RAM before any workloads run. An in-memory Python simulator is the only viable approach within the hardware constraints. It is also faster (no scheduling latency), fully deterministic (same input = same output), and easier to test.

Why a constrained action space?

Free-form kubectl text strings are nearly impossible to grade deterministically. By defining ~7 typed Pydantic action models, we make the action space clear to the LLM (easier to prompt), easy to validate (Pydantic does the type checking), and easy to grade (executor calls predictable world methods). This also keeps the action space small enough that the LLM can reason about it effectively without getting lost in kubectl's hundreds of sub-commands.

Why FastAPI?

OpenEnv environments are expected to be HTTP servers. FastAPI gives automatic OpenAPI documentation (at /docs), Pydantic integration for request/response validation, async support for when we need it, and a clean decorator syntax that makes main.py easy to read. It is also trivial to run with uvicorn inside a Docker container.

Why partial rewards matter for the hackathon

The PS explicitly states: "The reward function must provide feedback throughout the task trajectory, not just at completion." Binary rewards (0 until success, then 1) are explicitly penalised in the environment design score. Our graders implement milestone-based partial rewards, which also makes the environment more useful for actual RL training — sparse rewards make training slow and unstable.

coenv — Meta × Hugging Face OpenEnv RL Hackathon
Team: Sandeep (RL environment) · You (Simulation) · Third Person (Actions & Workers)

coenv — Project Documentation

Meta × Hugging Face OpenEnv RL Hackathon

Table of Contents

1. What Is This Project?

2. Why Kubernetes?

3. How It Works — The Big Picture

4. The Three Layers Explained

Layer 1 — RL Environment (Sandeep)

Layer 2 — Simulation Engine (You)

Layer 3 — Action Space & Workers (Third Person)

5. Team Ownership

6. Full Project Directory Structure

7. The Three Tasks (Easy → Medium → Hard)

Task 1 — Pod Recovery (Easy)

Task 2 — HPA Autoscaling Under Traffic Spike (Medium)

Task 3 — Multi-Service Cascading Incident (Hard)

8. Reward & Grading Design

Reward Principles

Grader Implementation Pattern

9. The Complete Episode Flow

10. OpenEnv Spec Compliance Checklist

11. Submission Checklist

12. Key Technical Decisions

Why a simulated cluster, not a real one?

Why a constrained action space?

Why FastAPI?

Why partial rewards matter for the hackathon