openenv / COEnv_Project_Documentation.md
Sandeep Suresh
feat: Update class names and enhance documentation for CoEnv integration
d95de8e

coenv — Project Documentation

Meta × Hugging Face OpenEnv RL Hackathon


Table of Contents

  1. What Is This Project?
  2. Why Kubernetes?
  3. How It Works — The Big Picture
  4. The Three Layers Explained
  5. Team Ownership
  6. Full Project Directory Structure
  7. The Three Tasks (Easy → Medium → Hard)
  8. Reward & Grading Design
  9. The Complete Episode Flow
  10. OpenEnv Spec Compliance Checklist
  11. Submission Checklist
  12. Key Technical Decisions

1. What Is This Project?

coenv is a Reinforcement Learning environment that simulates real-world Kubernetes cluster operations. An AI agent (LLM) is placed inside a broken or degraded Kubernetes cluster and must figure out the right sequence of operations to fix it — just like a real Site Reliability Engineer (SRE) would.

This is built for the Meta × Hugging Face OpenEnv RL Hackathon, which requires:

  • A real-world task simulation (not games or toys)
  • Full OpenEnv interface implementation (step(), reset(), state())
  • At least 3 tasks with programmatic graders (easy → medium → hard)
  • A meaningful reward function that gives partial credit throughout the episode
  • A working inference.py that runs an LLM agent and logs structured output
  • Deployment on Hugging Face Spaces with a working Dockerfile

In simple terms: We fake a Kubernetes cluster in Python memory, break it in specific ways, and then let an LLM try to fix it step by step — scoring it on how well it does.


2. Why Kubernetes?

Kubernetes (k8s) is the industry-standard container orchestration system used by virtually every tech company running production software. Managing it is genuinely difficult and is a daily job for SREs and DevOps engineers worldwide.

Why it's a perfect RL environment:

RL Concept Kubernetes Equivalent
State Cluster state (pod statuses, node health, resource usage)
Action kubectl commands (scale, patch, delete, restart)
Reward How close the cluster is to a healthy target state
Episode One incident recovery scenario
Done All SLOs restored / all pods healthy

Why it's novel for OpenEnv: None of Meta's reference environments (calendar, REPL, browser, CARLA, reasoning gym) touch infrastructure operations. This fills a real gap.

Why it's practical: Companies would immediately use an environment like this to train or evaluate agents that assist SREs — the real-world utility score (30% of judging) is very high.


3. How It Works — The Big Picture

Think of the project as three concentric layers:

┌─────────────────────────────────────────────────────────┐
│                   LAYER 1 — RL ENVIRONMENT               │
│  inference.py  ←→  main.py (FastAPI)  ←→  tasks/graders │
│                      (Sandeep)                           │
├─────────────────────────────────────────────────────────┤
│                LAYER 2 — SIMULATION ENGINE               │
│       world.py  ←→  models.py  ←→  conditions/          │
│                        (You)                             │
├─────────────────────────────────────────────────────────┤
│                LAYER 3 — ACTION SPACE                    │
│    worker.py  ←→  executor.py  ←→  actions/  ←→  validator│
│                    (Third Person)                        │
└─────────────────────────────────────────────────────────┘

Layer 1 (Sandeep) is what the judges see — the API endpoints, the inference script, the task definitions, the graders, the README.

Layer 2 (You) is the fake Kubernetes cluster. It holds the state of the cluster, knows how pods transition between statuses, and can inject failures. Everything sits in Python dictionaries — no real Kubernetes cluster runs.

Layer 3 (Third Person) is the action space — the specific operations the LLM agent is allowed to perform, and the validation/execution bridge that translates those actions into state changes in the simulator.


4. The Three Layers Explained

Layer 1 — RL Environment (Sandeep)

This layer is the public contract of the project. It's what OpenEnv's validate command checks, what the judges' scripts call, and what the LLM agent talks to.

main.py — FastAPI application

The central API server. It exposes exactly three mandatory endpoints:

  • POST /reset — Starts a new episode. Sets up a broken cluster using one of the condition injectors. Returns the initial ClusterObservation (what the agent sees first).
  • POST /step — Receives an action from the agent. Validates it, executes it on the simulated cluster, advances time by one tick, and returns the new observation + reward + done flag + info.
  • GET /state — Returns the full current cluster state. Used for debugging and grading.

inference.py — LLM agent runner

This is the script the hackathon validators actually run. It:

  1. Reads API_BASE_URL, MODEL_NAME, HF_TOKEN from environment variables
  2. Calls /reset to start an episode
  3. Feeds the observation to the LLM using the OpenAI client
  4. Parses the LLM's response as a structured action
  5. Calls /step with that action
  6. Prints structured stdout logs after every step:
    [START] task=pod-recovery env=coenv model=Qwen3-VL-30B
    [STEP] step=1 action=delete_pod('frontend-7d9f-xkp2') reward=0.20 done=false error=null
    [STEP] step=2 action=scale('frontend',3) reward=0.60 done=false error=null
    [END] success=true steps=2 rewards=0.20,0.60
    
  7. Repeats until done=true or max_steps is reached

openenv.yaml — Spec metadata

Required for openenv validate to pass. Contains:

  • Environment name, version, description
  • List of task IDs with difficulty labels
  • References to the action schema and observation schema

classes/tasks/ — Task definitions

Three Python files, each defining one task:

  • What the broken state looks like (which condition to inject)
  • What the agent's objective is (in plain English, passed to the LLM as a prompt)
  • What counts as success
  • Maximum number of steps allowed

classes/graders/ — Reward graders

Three Python files, each implementing a grade(world_state) -> float function. Graders must be fully deterministic — same world state always returns same score. They implement partial credit: a grader doesn't just say "fixed or not fixed" but scores partial progress (e.g., 2 out of 5 pods fixed = 0.4).

Dockerfile

Single-stage Python container. Installs requirements.txt, copies the project, exposes port 8000, runs uvicorn main:app. Must build and run cleanly — this is a hard pass/fail gate.

README.md

Mandatory documentation. Must include: environment overview, motivation, action space definition, observation space definition, task descriptions with difficulty labels, setup instructions, baseline scores table.


Layer 2 — Simulation Engine (You)

This is the most important layer technically. It's what makes the environment believable. Since we cannot run a real Kubernetes cluster inside a 2 vCPU / 8 GB HF Space container, the entire cluster is simulated as an in-memory Python object.

classes/world.py — The cluster simulator

This is the brain of the project. It maintains the complete cluster state as a Python dictionary, structured like a real Kubernetes API response:

cluster_state = {
    "nodes": [
        {"name": "node-1", "status": "Ready", "cpu_capacity": 4, "mem_capacity": 8192},
        {"name": "node-2", "status": "NotReady", "cpu_capacity": 4, "mem_capacity": 8192}
    ],
    "deployments": [
        {"name": "frontend", "desired_replicas": 3, "available_replicas": 1, "image": "nginx:1.21"}
    ],
    "pods": [
        {"name": "frontend-7d9f-xkp2", "status": "CrashLoopBackOff", "node": "node-1", "restarts": 7},
        {"name": "frontend-7d9f-ab3c", "status": "Running", "node": "node-1", "restarts": 0},
        {"name": "frontend-7d9f-mn8x", "status": "Pending", "node": None, "restarts": 0}
    ],
    "services": [...],
    "configmaps": [...],
    "hpa": [...]
}

Key methods:

  • reset(condition) — Wipes state, injects a failure condition, returns initial observation
  • get_pods(namespace, selector) — Returns filtered pod list (mimics kubectl get pods)
  • apply_patch(resource_type, name, patch) — Applies a patch to a resource
  • scale(deployment_name, replicas) — Changes replica count
  • delete_pod(pod_name) — Removes a pod (it gets recreated by the deployment controller on next tick)
  • tick() — Advances simulated time by one step. Pods in CrashLoopBackOff increment their restart counter. Pending pods on ready nodes eventually transition to Running. Dead nodes stay dead unless drained.
  • get_observation() — Serialises the current state into a ClusterObservation Pydantic model

classes/models.py — Pydantic typed models

All data structures are defined here. This is mandatory for OpenEnv spec compliance — typed models enforce the action/observation contract.

class PodStatus(BaseModel):
    name: str
    status: Literal["Running", "Pending", "CrashLoopBackOff", "OOMKilled", "Terminating", "Unknown"]
    node: Optional[str]
    restarts: int
    cpu_usage: float
    mem_usage: float

class NodeStatus(BaseModel):
    name: str
    status: Literal["Ready", "NotReady", "SchedulingDisabled"]
    cpu_capacity: float
    mem_capacity: float
    cpu_usage: float
    mem_usage: float

class ClusterObservation(BaseModel):
    nodes: List[NodeStatus]
    pods: List[PodStatus]
    deployments: List[DeploymentStatus]
    services: List[ServiceStatus]
    events: List[ClusterEvent]          # recent k8s events (error messages, warnings)
    step: int
    objective: str                      # plain English description of what to fix

class RewardSignal(BaseModel):
    reward: float                       # 0.0 to 1.0 incremental reward this step
    cumulative: float                   # total reward so far
    done: bool
    info: Dict[str, Any]               # breakdown: why this reward was given

classes/conditions/ — Failure injectors

Each condition is a Python class with a single inject(cluster_state) -> cluster_state method that takes a healthy cluster and returns a broken one. This is how each task starts with a specific failure scenario:

  • crash_loop.py — Sets 3 pods to CrashLoopBackOff with high restart counts. Simulates a bad image tag or missing environment variable.
  • oom_kill.py — Sets pods to OOMKilled. Memory limits are set too low in the deployment spec. Pods keep restarting.
  • node_failure.py — Sets one node to NotReady. All pods on that node go to Unknown. New pods are Pending (no space to schedule).
  • cascade_failure.py — Combines multiple failures: one OOMKilled service causes downstream 503s in two dependent services, creating a cascading failure across 3 deployments.

classes/utils.py — Probability and simulation helpers

Utility functions that make the simulation feel realistic:

  • sample_cpu_usage(base_load, noise_factor) — Returns a slightly randomised CPU % (real clusters are never exactly at baseline)
  • sample_latency(healthy_latency, degradation_factor) — Simulates p95 request latency under load
  • should_pod_recover(restarts, backoff_seconds) — Determines if a CrashLoopBackOff pod would naturally recover (it usually won't — that's the point)
  • generate_cluster_events(pod_list) — Creates realistic k8s event messages like "Back-off restarting failed container" or "OOMKilled: container exceeded memory limit"

config.json — Cluster defaults

Single source of truth for all simulation parameters:

{
  "cluster": {
    "num_nodes": 3,
    "cpu_per_node": 4,
    "mem_per_node_gb": 8
  },
  "tasks": {
    "pod_recovery": { "max_steps": 15, "success_threshold": 0.9 },
    "autoscaling":  { "max_steps": 20, "success_threshold": 0.85 },
    "incident":     { "max_steps": 30, "success_threshold": 0.80 }
  },
  "simulation": {
    "tick_interval_seconds": 30,
    "crash_backoff_max_seconds": 300,
    "hpa_cooldown_seconds": 180
  }
}

Layer 3 — Action Space & Workers (Third Person)

This layer defines what the LLM is allowed to do, makes sure it's valid, and executes it against the simulator.

classes/actions/ — Typed action definitions

Each action is a Pydantic model. The LLM must output one of these (Sandeep's inference.py prompts it to respond in JSON matching one of these schemas):

class ScaleAction(BaseModel):
    action_type: Literal["scale"]
    deployment: str          # e.g. "frontend"
    replicas: int            # e.g. 3

class DeletePodAction(BaseModel):
    action_type: Literal["delete_pod"]
    pod_name: str            # e.g. "frontend-7d9f-xkp2"

class PatchAction(BaseModel):
    action_type: Literal["patch"]
    resource_type: str       # "deployment" | "configmap" | "service"
    name: str
    patch: Dict[str, Any]   # the fields to update

class RolloutRestartAction(BaseModel):
    action_type: Literal["rollout_restart"]
    deployment: str

class SetHPAAction(BaseModel):
    action_type: Literal["set_hpa"]
    deployment: str
    min_replicas: int
    max_replicas: int
    cpu_target_percent: int

class DrainNodeAction(BaseModel):
    action_type: Literal["drain_node"]
    node_name: str

class DescribeAction(BaseModel):
    action_type: Literal["describe"]
    resource_type: str
    name: str               # "investigation" action — no state change, returns detail

classes/validator.py — Action validation

Before any action touches the world state, the validator checks it:

  • Does the target resource exist? (Can't delete a pod that doesn't exist)
  • Is the scale value sane? (Can't scale to 0 or to 1000 replicas)
  • Is the node already drained? (Can't drain twice)
  • Is the deployment name a real deployment?

If validation fails, it returns an error string. This flows directly into the [STEP] error= field in stdout logs. The step still counts against the agent's limit — bad actions are penalised by wasting steps.

classes/executor.py — Action execution bridge

Maps each validated action type to the correct world.py method call:

def execute(action: KubeAction, world: World) -> ExecutionResult:
    if action.action_type == "scale":
        world.scale(action.deployment, action.replicas)
    elif action.action_type == "delete_pod":
        world.delete_pod(action.pod_name)
    elif action.action_type == "rollout_restart":
        world.rollout_restart(action.deployment)
    ...
    world.tick()   # always advance time after an action
    return ExecutionResult(observation=world.get_observation(), ...)

classes/worker.py — Agent episode loop

Manages the full lifecycle of a single episode. Sandeep's inference.py calls this:

class Worker:
    def run_episode(self, task_id, world, max_steps) -> EpisodeResult:
        obs = world.reset(task=task_id)
        rewards = []
        for step in range(1, max_steps + 1):
            action = self.get_action(obs)          # calls LLM
            result = executor.execute(action, world)
            rewards.append(result.reward)
            if result.done:
                break
        return EpisodeResult(rewards=rewards, steps=step, success=result.done)

5. Team Ownership

Module Owner Why It's Their Responsibility
main.py Sandeep He owns the public API contract
inference.py Sandeep He owns the hackathon submission script
openenv.yaml Sandeep He owns spec compliance
Dockerfile Sandeep He owns deployment
README.md Sandeep He owns documentation
classes/tasks/ Sandeep He defines what success looks like
classes/graders/ Sandeep He owns the scoring logic
classes/world.py You You own the cluster simulator
classes/models.py You You own all typed data models
classes/utils.py You You own simulation helpers
classes/conditions/ You You own failure injection
config.json You You own all parameters
classes/worker.py Third person They own the episode loop
classes/actions/ Third person They own the action space
classes/executor.py Third person They own action execution
classes/validator.py Third person They own action validation
tests/ All three Each writes tests for their own module

6. Full Project Directory Structure

coenv/
├── .dockerignore                  # Docker build exclusions
├── __init__.py                    # Module exports
├── README.md                      # Project documentation
├── openenv.yaml                   # OpenEnv manifest
├── pyproject.toml                 # Project metadata and dependencies
├── uv.lock                        # Locked dependencies
├── client.py                      # CoEnv client / inference-side runner
├── models.py                      # Shared action and observation models
├── config.json                    # Cluster defaults and simulation params
├── mkdocs.yml                     # Docs site configuration
├── tests/                         # End-to-end and unit tests
│   ├── test_environment.py        # From test_world.py
│   ├── test_conditions.py         # From test_conditions.py
│   ├── test_models.py             # From test_models.py
│   ├── test_actions.py            # From test_actions.py
│   ├── test_executor.py           # From test_executor.py
│   ├── test_graders.py            # From test_graders.py
│   ├── test_tasks.py              # From test_tasks.py
│   └── test_integration.py        # End-to-end reset→step→state flow
└── server/
    ├── __init__.py                # Server module exports
    ├── coenv_environment.py       # Core environment logic
    ├── app.py                     # FastAPI app exposing /reset /step /state
    ├── Dockerfile                 # Container image definition
    ├── utils.py                   # Simulation helpers
    ├── validator.py               # Action validation
    ├── executor.py                # Action execution bridge
    ├── worker.py                  # Episode loop manager
    ├── tasks/
    │   ├── __init__.py
    │   ├── task_pod_recovery.py
    │   ├── task_autoscaling.py
    │   └── task_incident.py
    ├── graders/
    │   ├── __init__.py
    │   ├── grader_pod_recovery.py
    │   ├── grader_autoscaling.py
    │   └── grader_incident.py
    ├── conditions/
    │   ├── __init__.py
    │   ├── crash_loop.py
    │   ├── oom_kill.py
    │   ├── node_failure.py
    │   └── cascade_failure.py
    └── actions/
        ├── __init__.py
        ├── scale_action.py
        ├── patch_action.py
        ├── delete_pod_action.py
        ├── rollout_action.py
        ├── hpa_action.py
        ├── drain_action.py
        └── describe_action.py

7. The Three Tasks (Easy → Medium → Hard)

Task 1 — Pod Recovery (Easy)

What's broken: A frontend deployment has 3 pods stuck in CrashLoopBackOff. The restart count is climbing. The root cause is a wrong environment variable in the deployment spec pointing to a database host that doesn't exist.

What the agent must do:

  1. Observe the broken pods and read the k8s events (which mention a connection refused error)
  2. Identify the bad DB_HOST environment variable using a describe or patch inspect action
  3. Patch the deployment with the correct DB_HOST value
  4. Optionally delete the crash-looping pods to speed up recovery (they'll get recreated with the new config)
  5. Verify all 3 pods reach Running state

Objective string shown to agent: "The frontend deployment is crash-looping. Diagnose and fix the root cause so that all pods reach Running state."

Max steps: 15
Success threshold: All 3 pods in Running state (score ≥ 0.9)

Partial rewards:

  • +0.1 for each pod that stops crash-looping
  • +0.2 for correctly patching the environment variable
  • +0.3 bonus for all pods Running within 10 steps

Task 2 — HPA Autoscaling Under Traffic Spike (Medium)

What's broken: The cluster is healthy but receiving 10× normal traffic. The deployment has no HPA configured, is running on fixed 2 replicas, and is already at 95% CPU. Request latency is climbing past the SLO threshold.

What the agent must do:

  1. Observe high CPU usage and rising latency in the observation
  2. Immediately scale up the deployment to handle current load
  3. Configure a HorizontalPodAutoscaler (HPA) with appropriate min/max replicas and CPU target
  4. Set correct CPU resource requests/limits on the deployment so HPA has a baseline to work with
  5. Verify that latency drops back below the SLO threshold

Objective string shown to agent: "Traffic has spiked 10×. The api-server deployment is overloaded. Configure autoscaling and ensure p95 latency stays below 500ms."

Max steps: 20
Success threshold: p95 latency < 500ms, HPA configured, replicas ≥ 4 (score ≥ 0.85)

Partial rewards:

  • +0.15 for scaling up replicas immediately (within 3 steps)
  • +0.20 for configuring HPA correctly
  • +0.25 for latency dropping below 1000ms
  • +0.30 for latency dropping below 500ms (SLO met)
  • -0.10 penalty for scaling beyond 12 replicas unnecessarily (resource waste)

Task 3 — Multi-Service Cascading Incident (Hard)

What's broken: The auth-service deployment has pods getting OOMKilled because memory limits are set 4× too low relative to actual usage. This causes the api-gateway to fail authentication checks and return 503s. Downstream, the data-processor service is also throwing errors because it depends on the gateway. Three services are degraded simultaneously.

What the agent must do:

  1. Identify the blast radius — which services are affected and why
  2. Investigate auth-service to find the OOMKill root cause (memory limits too low)
  3. Patch auth-service deployment with correct memory limits
  4. Rollout restart auth-service so new pods come up with correct limits
  5. Drain the partially-failed node where most OOMKilled pods were running, to force clean rescheduling
  6. Verify api-gateway 503 errors stop (automatically once auth recovers)
  7. Verify data-processor error rate drops (automatically once gateway recovers)
  8. Confirm all three services are fully healthy

Objective string shown to agent: "A cascading incident has degraded auth-service, api-gateway, and data-processor. Identify the root cause and restore all three services to healthy state without data loss."

Max steps: 30
Success threshold: All 3 services healthy, error rate < 0.1% (score ≥ 0.80)

Partial rewards:

  • +0.10 for correctly identifying auth-service as the root cause (within 5 steps)
  • +0.15 for patching memory limits correctly
  • +0.15 for auth-service pods reaching Running
  • +0.20 for api-gateway 503s stopping
  • +0.20 for data-processor errors resolving
  • +0.10 for draining the bad node cleanly
  • -0.15 penalty for deleting services or breaking healthy components

8. Reward & Grading Design

The grading philosophy follows what the PS requires: reward signal over the full trajectory, not just at the end.

Reward Principles

Partial progress is always rewarded. If the agent fixes 1 out of 3 broken pods, it gets 1/3 of the maximum reward for that milestone — not zero.

Speed bonus. Fixing the issue in fewer steps earns a small bonus. This incentivises efficient reasoning.

Waste penalty. Unnecessary destructive actions (scaling to 0, deleting healthy pods, draining a healthy node) subtract from the reward. This teaches the agent to be surgical.

Idempotency. Repeating the same correct action doesn't give extra reward but doesn't penalise either (except for wasted steps).

Grader Implementation Pattern

Each grader implements:

def grade(world_state: dict, step: int, max_steps: int) -> float:
    score = 0.0

    # Milestone 1: Partial progress
    running_pods = [p for p in world_state["pods"] if p["status"] == "Running"]
    score += (len(running_pods) / total_expected_pods) * 0.5

    # Milestone 2: Full success
    if all(p["status"] == "Running" for p in world_state["pods"]):
        score += 0.4

    # Speed bonus
    efficiency = 1.0 - (step / max_steps)
    score += efficiency * 0.1

    return min(score, 1.0)  # always clamp to [0, 1]

9. The Complete Episode Flow

Here is the full step-by-step flow of one complete episode, from start to finish:

1. JUDGE / VALIDATOR runs:
   python inference.py

2. inference.py reads env vars:
   API_BASE_URL, MODEL_NAME, HF_TOKEN

3. inference.py calls:
   POST /reset  { "task": "pod_recovery" }

4. main.py receives /reset:
   → Calls task_pod_recovery.get_condition()  →  crash_loop.inject(cluster_state)
   → world.reset(broken_state)
   → Returns ClusterObservation (3 CrashLoopBackOff pods, events, objective string)

5. stdout prints:
   [START] task=pod-recovery env=coenv model=Qwen3-30B

6. inference.py builds LLM prompt:
   "You are an SRE. Current cluster state: [observation JSON].
    Objective: Fix the frontend deployment crash loop.
    Respond with a JSON action from the available action types."

7. LLM responds:
   { "action_type": "describe", "resource_type": "deployment", "name": "frontend" }

8. inference.py calls:
   POST /step  { action }

9. main.py receives /step:
   → validator.validate(action, world)  →  OK
   → executor.execute(action, world)
   → world.tick()
   → grader.grade(world.state, step=1) → reward=0.00 (just investigating)
   → Returns observation, reward=0.00, done=false, info={...}

10. stdout prints:
    [STEP] step=1 action=describe('deployment','frontend') reward=0.00 done=false error=null

11. LLM sees deployment spec, notices DB_HOST=wrong-host.internal
    LLM responds: { "action_type": "patch", "resource_type": "deployment",
                    "name": "frontend",
                    "patch": {"env": [{"name": "DB_HOST", "value": "db.prod.internal"}]} }

12. POST /step  { patch action }
    → executor patches deployment in world state
    → world.tick() — pods begin restarting with new config
    → grader → reward=0.20 (correct patch applied)

13. [STEP] step=2 action=patch('frontend',{env...}) reward=0.20 done=false error=null

14. LLM responds: { "action_type": "delete_pod", "pod_name": "frontend-7d9f-xkp2" }
    → world deletes pod, recreates with correct env, status → Running
    → grader → reward=0.40

15. Repeat for remaining 2 pods...

16. All 3 pods Running. grader → reward=1.0, done=true

17. stdout prints:
    [END] success=true steps=8 rewards=0.00,0.20,0.40,0.55,0.70,0.85,0.95,1.00

10. OpenEnv Spec Compliance Checklist

Requirement File Status
Typed Observation model classes/models.pyClusterObservation Required
Typed Action model classes/models.pyKubeAction Required
Typed Reward model classes/models.pyRewardSignal Required
step(action) → (obs, reward, done, info) main.pyPOST /step Required
reset() → initial_observation main.pyPOST /reset Required
state() → current_state main.pyGET /state Required
openenv.yaml with metadata openenv.yaml Required
openenv validate passes Tested via pre-validation script Required
Min 3 tasks classes/tasks/ — 3 files Required
Easy → medium → hard difficulty task_pod_recovery / task_autoscaling / task_incident Required
Graders return 0.0–1.0 classes/graders/ — 3 graders Required
Graders are deterministic Pure functions, no randomness Required
Partial reward signals All 3 graders implement milestone scoring Required
Penalise bad actions validator.py + grader penalty terms Required
inference.py in root inference.py Required
[START] log line inference.pylog_start() Required
[STEP] log per step inference.pylog_step() Required
[END] log always emitted inference.pyfinally: log_end() Required
Reads API_BASE_URL with default inference.py Required
Reads MODEL_NAME with default inference.py Required
Reads HF_TOKEN (no default) inference.py Required
Uses OpenAI client from openai import OpenAI Required
Dockerfile builds cleanly Dockerfile Required
HF Space deploys and responds Deployed on Hugging Face Required
Inference runs in < 20 min Max 30 steps × ~20s/step = ~10 min Required
Runs in 2 vCPU / 8 GB RAM Pure Python in-memory sim, no real k8s Required
README with all required sections README.md Required

11. Submission Checklist

Before submitting, verify all of these:

  • inference.py is in the root directory (not inside classes/)
  • inference.py has default values for API_BASE_URL and MODEL_NAME
  • inference.py raises ValueError if HF_TOKEN is missing
  • [START], [STEP], [END] format matches the spec exactly (field names, order, lowercase booleans)
  • openenv validate passes locally
  • docker build completes without errors
  • docker run starts the server and responds to GET /state
  • HF Space is in Running state (not Building, not Stopped)
  • All 3 tasks can be reset and stepped without crashing
  • All 3 graders return a float between 0.0 and 1.0
  • Running inference.py end-to-end completes in under 20 minutes
  • README.md includes baseline scores table
  • tests/test_integration.py passes cleanly

12. Key Technical Decisions

Why a simulated cluster, not a real one?

Running kind or minikube inside a Hugging Face Space container with 2 vCPU / 8 GB RAM is not feasible. The Kubernetes control plane alone (etcd + apiserver + scheduler + controller-manager) consumes ~1.5–2 GB RAM before any workloads run. An in-memory Python simulator is the only viable approach within the hardware constraints. It is also faster (no scheduling latency), fully deterministic (same input = same output), and easier to test.

Why a constrained action space?

Free-form kubectl text strings are nearly impossible to grade deterministically. By defining ~7 typed Pydantic action models, we make the action space clear to the LLM (easier to prompt), easy to validate (Pydantic does the type checking), and easy to grade (executor calls predictable world methods). This also keeps the action space small enough that the LLM can reason about it effectively without getting lost in kubectl's hundreds of sub-commands.

Why FastAPI?

OpenEnv environments are expected to be HTTP servers. FastAPI gives automatic OpenAPI documentation (at /docs), Pydantic integration for request/response validation, async support for when we need it, and a clean decorator syntax that makes main.py easy to read. It is also trivial to run with uvicorn inside a Docker container.

Why partial rewards matter for the hackathon

The PS explicitly states: "The reward function must provide feedback throughout the task trajectory, not just at completion." Binary rewards (0 until success, then 1) are explicitly penalised in the environment design score. Our graders implement milestone-based partial rewards, which also makes the environment more useful for actual RL training — sparse rewards make training slow and unstable.


coenv — Meta × Hugging Face OpenEnv RL Hackathon
Team: Sandeep (RL environment) · You (Simulation) · Third Person (Actions & Workers)