Spaces:
No application file
coenv — Project Documentation
Meta × Hugging Face OpenEnv RL Hackathon
Table of Contents
- What Is This Project?
- Why Kubernetes?
- How It Works — The Big Picture
- The Three Layers Explained
- Team Ownership
- Full Project Directory Structure
- The Three Tasks (Easy → Medium → Hard)
- Reward & Grading Design
- The Complete Episode Flow
- OpenEnv Spec Compliance Checklist
- Submission Checklist
- Key Technical Decisions
1. What Is This Project?
coenv is a Reinforcement Learning environment that simulates real-world Kubernetes cluster operations. An AI agent (LLM) is placed inside a broken or degraded Kubernetes cluster and must figure out the right sequence of operations to fix it — just like a real Site Reliability Engineer (SRE) would.
This is built for the Meta × Hugging Face OpenEnv RL Hackathon, which requires:
- A real-world task simulation (not games or toys)
- Full OpenEnv interface implementation (
step(),reset(),state()) - At least 3 tasks with programmatic graders (easy → medium → hard)
- A meaningful reward function that gives partial credit throughout the episode
- A working
inference.pythat runs an LLM agent and logs structured output - Deployment on Hugging Face Spaces with a working Dockerfile
In simple terms: We fake a Kubernetes cluster in Python memory, break it in specific ways, and then let an LLM try to fix it step by step — scoring it on how well it does.
2. Why Kubernetes?
Kubernetes (k8s) is the industry-standard container orchestration system used by virtually every tech company running production software. Managing it is genuinely difficult and is a daily job for SREs and DevOps engineers worldwide.
Why it's a perfect RL environment:
| RL Concept | Kubernetes Equivalent |
|---|---|
| State | Cluster state (pod statuses, node health, resource usage) |
| Action | kubectl commands (scale, patch, delete, restart) |
| Reward | How close the cluster is to a healthy target state |
| Episode | One incident recovery scenario |
| Done | All SLOs restored / all pods healthy |
Why it's novel for OpenEnv: None of Meta's reference environments (calendar, REPL, browser, CARLA, reasoning gym) touch infrastructure operations. This fills a real gap.
Why it's practical: Companies would immediately use an environment like this to train or evaluate agents that assist SREs — the real-world utility score (30% of judging) is very high.
3. How It Works — The Big Picture
Think of the project as three concentric layers:
┌─────────────────────────────────────────────────────────┐
│ LAYER 1 — RL ENVIRONMENT │
│ inference.py ←→ main.py (FastAPI) ←→ tasks/graders │
│ (Sandeep) │
├─────────────────────────────────────────────────────────┤
│ LAYER 2 — SIMULATION ENGINE │
│ world.py ←→ models.py ←→ conditions/ │
│ (You) │
├─────────────────────────────────────────────────────────┤
│ LAYER 3 — ACTION SPACE │
│ worker.py ←→ executor.py ←→ actions/ ←→ validator│
│ (Third Person) │
└─────────────────────────────────────────────────────────┘
Layer 1 (Sandeep) is what the judges see — the API endpoints, the inference script, the task definitions, the graders, the README.
Layer 2 (You) is the fake Kubernetes cluster. It holds the state of the cluster, knows how pods transition between statuses, and can inject failures. Everything sits in Python dictionaries — no real Kubernetes cluster runs.
Layer 3 (Third Person) is the action space — the specific operations the LLM agent is allowed to perform, and the validation/execution bridge that translates those actions into state changes in the simulator.
4. The Three Layers Explained
Layer 1 — RL Environment (Sandeep)
This layer is the public contract of the project. It's what OpenEnv's validate command checks, what the judges' scripts call, and what the LLM agent talks to.
main.py — FastAPI application
The central API server. It exposes exactly three mandatory endpoints:
POST /reset— Starts a new episode. Sets up a broken cluster using one of the condition injectors. Returns the initialClusterObservation(what the agent sees first).POST /step— Receives an action from the agent. Validates it, executes it on the simulated cluster, advances time by one tick, and returns the new observation + reward + done flag + info.GET /state— Returns the full current cluster state. Used for debugging and grading.
inference.py — LLM agent runner
This is the script the hackathon validators actually run. It:
- Reads
API_BASE_URL,MODEL_NAME,HF_TOKENfrom environment variables - Calls
/resetto start an episode - Feeds the observation to the LLM using the OpenAI client
- Parses the LLM's response as a structured action
- Calls
/stepwith that action - Prints structured stdout logs after every step:
[START] task=pod-recovery env=coenv model=Qwen3-VL-30B [STEP] step=1 action=delete_pod('frontend-7d9f-xkp2') reward=0.20 done=false error=null [STEP] step=2 action=scale('frontend',3) reward=0.60 done=false error=null [END] success=true steps=2 rewards=0.20,0.60 - Repeats until
done=trueormax_stepsis reached
openenv.yaml — Spec metadata
Required for openenv validate to pass. Contains:
- Environment name, version, description
- List of task IDs with difficulty labels
- References to the action schema and observation schema
classes/tasks/ — Task definitions
Three Python files, each defining one task:
- What the broken state looks like (which condition to inject)
- What the agent's objective is (in plain English, passed to the LLM as a prompt)
- What counts as success
- Maximum number of steps allowed
classes/graders/ — Reward graders
Three Python files, each implementing a grade(world_state) -> float function. Graders must be fully deterministic — same world state always returns same score. They implement partial credit: a grader doesn't just say "fixed or not fixed" but scores partial progress (e.g., 2 out of 5 pods fixed = 0.4).
Dockerfile
Single-stage Python container. Installs requirements.txt, copies the project, exposes port 8000, runs uvicorn main:app. Must build and run cleanly — this is a hard pass/fail gate.
README.md
Mandatory documentation. Must include: environment overview, motivation, action space definition, observation space definition, task descriptions with difficulty labels, setup instructions, baseline scores table.
Layer 2 — Simulation Engine (You)
This is the most important layer technically. It's what makes the environment believable. Since we cannot run a real Kubernetes cluster inside a 2 vCPU / 8 GB HF Space container, the entire cluster is simulated as an in-memory Python object.
classes/world.py — The cluster simulator
This is the brain of the project. It maintains the complete cluster state as a Python dictionary, structured like a real Kubernetes API response:
cluster_state = {
"nodes": [
{"name": "node-1", "status": "Ready", "cpu_capacity": 4, "mem_capacity": 8192},
{"name": "node-2", "status": "NotReady", "cpu_capacity": 4, "mem_capacity": 8192}
],
"deployments": [
{"name": "frontend", "desired_replicas": 3, "available_replicas": 1, "image": "nginx:1.21"}
],
"pods": [
{"name": "frontend-7d9f-xkp2", "status": "CrashLoopBackOff", "node": "node-1", "restarts": 7},
{"name": "frontend-7d9f-ab3c", "status": "Running", "node": "node-1", "restarts": 0},
{"name": "frontend-7d9f-mn8x", "status": "Pending", "node": None, "restarts": 0}
],
"services": [...],
"configmaps": [...],
"hpa": [...]
}
Key methods:
reset(condition)— Wipes state, injects a failure condition, returns initial observationget_pods(namespace, selector)— Returns filtered pod list (mimicskubectl get pods)apply_patch(resource_type, name, patch)— Applies a patch to a resourcescale(deployment_name, replicas)— Changes replica countdelete_pod(pod_name)— Removes a pod (it gets recreated by the deployment controller on next tick)tick()— Advances simulated time by one step. Pods inCrashLoopBackOffincrement their restart counter. Pending pods on ready nodes eventually transition toRunning. Dead nodes stay dead unless drained.get_observation()— Serialises the current state into aClusterObservationPydantic model
classes/models.py — Pydantic typed models
All data structures are defined here. This is mandatory for OpenEnv spec compliance — typed models enforce the action/observation contract.
class PodStatus(BaseModel):
name: str
status: Literal["Running", "Pending", "CrashLoopBackOff", "OOMKilled", "Terminating", "Unknown"]
node: Optional[str]
restarts: int
cpu_usage: float
mem_usage: float
class NodeStatus(BaseModel):
name: str
status: Literal["Ready", "NotReady", "SchedulingDisabled"]
cpu_capacity: float
mem_capacity: float
cpu_usage: float
mem_usage: float
class ClusterObservation(BaseModel):
nodes: List[NodeStatus]
pods: List[PodStatus]
deployments: List[DeploymentStatus]
services: List[ServiceStatus]
events: List[ClusterEvent] # recent k8s events (error messages, warnings)
step: int
objective: str # plain English description of what to fix
class RewardSignal(BaseModel):
reward: float # 0.0 to 1.0 incremental reward this step
cumulative: float # total reward so far
done: bool
info: Dict[str, Any] # breakdown: why this reward was given
classes/conditions/ — Failure injectors
Each condition is a Python class with a single inject(cluster_state) -> cluster_state method that takes a healthy cluster and returns a broken one. This is how each task starts with a specific failure scenario:
crash_loop.py— Sets 3 pods toCrashLoopBackOffwith high restart counts. Simulates a bad image tag or missing environment variable.oom_kill.py— Sets pods toOOMKilled. Memory limits are set too low in the deployment spec. Pods keep restarting.node_failure.py— Sets one node toNotReady. All pods on that node go toUnknown. New pods arePending(no space to schedule).cascade_failure.py— Combines multiple failures: one OOMKilled service causes downstream 503s in two dependent services, creating a cascading failure across 3 deployments.
classes/utils.py — Probability and simulation helpers
Utility functions that make the simulation feel realistic:
sample_cpu_usage(base_load, noise_factor)— Returns a slightly randomised CPU % (real clusters are never exactly at baseline)sample_latency(healthy_latency, degradation_factor)— Simulates p95 request latency under loadshould_pod_recover(restarts, backoff_seconds)— Determines if aCrashLoopBackOffpod would naturally recover (it usually won't — that's the point)generate_cluster_events(pod_list)— Creates realistic k8s event messages like"Back-off restarting failed container"or"OOMKilled: container exceeded memory limit"
config.json — Cluster defaults
Single source of truth for all simulation parameters:
{
"cluster": {
"num_nodes": 3,
"cpu_per_node": 4,
"mem_per_node_gb": 8
},
"tasks": {
"pod_recovery": { "max_steps": 15, "success_threshold": 0.9 },
"autoscaling": { "max_steps": 20, "success_threshold": 0.85 },
"incident": { "max_steps": 30, "success_threshold": 0.80 }
},
"simulation": {
"tick_interval_seconds": 30,
"crash_backoff_max_seconds": 300,
"hpa_cooldown_seconds": 180
}
}
Layer 3 — Action Space & Workers (Third Person)
This layer defines what the LLM is allowed to do, makes sure it's valid, and executes it against the simulator.
classes/actions/ — Typed action definitions
Each action is a Pydantic model. The LLM must output one of these (Sandeep's inference.py prompts it to respond in JSON matching one of these schemas):
class ScaleAction(BaseModel):
action_type: Literal["scale"]
deployment: str # e.g. "frontend"
replicas: int # e.g. 3
class DeletePodAction(BaseModel):
action_type: Literal["delete_pod"]
pod_name: str # e.g. "frontend-7d9f-xkp2"
class PatchAction(BaseModel):
action_type: Literal["patch"]
resource_type: str # "deployment" | "configmap" | "service"
name: str
patch: Dict[str, Any] # the fields to update
class RolloutRestartAction(BaseModel):
action_type: Literal["rollout_restart"]
deployment: str
class SetHPAAction(BaseModel):
action_type: Literal["set_hpa"]
deployment: str
min_replicas: int
max_replicas: int
cpu_target_percent: int
class DrainNodeAction(BaseModel):
action_type: Literal["drain_node"]
node_name: str
class DescribeAction(BaseModel):
action_type: Literal["describe"]
resource_type: str
name: str # "investigation" action — no state change, returns detail
classes/validator.py — Action validation
Before any action touches the world state, the validator checks it:
- Does the target resource exist? (Can't delete a pod that doesn't exist)
- Is the scale value sane? (Can't scale to 0 or to 1000 replicas)
- Is the node already drained? (Can't drain twice)
- Is the deployment name a real deployment?
If validation fails, it returns an error string. This flows directly into the [STEP] error= field in stdout logs. The step still counts against the agent's limit — bad actions are penalised by wasting steps.
classes/executor.py — Action execution bridge
Maps each validated action type to the correct world.py method call:
def execute(action: KubeAction, world: World) -> ExecutionResult:
if action.action_type == "scale":
world.scale(action.deployment, action.replicas)
elif action.action_type == "delete_pod":
world.delete_pod(action.pod_name)
elif action.action_type == "rollout_restart":
world.rollout_restart(action.deployment)
...
world.tick() # always advance time after an action
return ExecutionResult(observation=world.get_observation(), ...)
classes/worker.py — Agent episode loop
Manages the full lifecycle of a single episode. Sandeep's inference.py calls this:
class Worker:
def run_episode(self, task_id, world, max_steps) -> EpisodeResult:
obs = world.reset(task=task_id)
rewards = []
for step in range(1, max_steps + 1):
action = self.get_action(obs) # calls LLM
result = executor.execute(action, world)
rewards.append(result.reward)
if result.done:
break
return EpisodeResult(rewards=rewards, steps=step, success=result.done)
5. Team Ownership
| Module | Owner | Why It's Their Responsibility |
|---|---|---|
main.py |
Sandeep | He owns the public API contract |
inference.py |
Sandeep | He owns the hackathon submission script |
openenv.yaml |
Sandeep | He owns spec compliance |
Dockerfile |
Sandeep | He owns deployment |
README.md |
Sandeep | He owns documentation |
classes/tasks/ |
Sandeep | He defines what success looks like |
classes/graders/ |
Sandeep | He owns the scoring logic |
classes/world.py |
You | You own the cluster simulator |
classes/models.py |
You | You own all typed data models |
classes/utils.py |
You | You own simulation helpers |
classes/conditions/ |
You | You own failure injection |
config.json |
You | You own all parameters |
classes/worker.py |
Third person | They own the episode loop |
classes/actions/ |
Third person | They own the action space |
classes/executor.py |
Third person | They own action execution |
classes/validator.py |
Third person | They own action validation |
tests/ |
All three | Each writes tests for their own module |
6. Full Project Directory Structure
coenv/
├── .dockerignore # Docker build exclusions
├── __init__.py # Module exports
├── README.md # Project documentation
├── openenv.yaml # OpenEnv manifest
├── pyproject.toml # Project metadata and dependencies
├── uv.lock # Locked dependencies
├── client.py # CoEnv client / inference-side runner
├── models.py # Shared action and observation models
├── config.json # Cluster defaults and simulation params
├── mkdocs.yml # Docs site configuration
├── tests/ # End-to-end and unit tests
│ ├── test_environment.py # From test_world.py
│ ├── test_conditions.py # From test_conditions.py
│ ├── test_models.py # From test_models.py
│ ├── test_actions.py # From test_actions.py
│ ├── test_executor.py # From test_executor.py
│ ├── test_graders.py # From test_graders.py
│ ├── test_tasks.py # From test_tasks.py
│ └── test_integration.py # End-to-end reset→step→state flow
└── server/
├── __init__.py # Server module exports
├── coenv_environment.py # Core environment logic
├── app.py # FastAPI app exposing /reset /step /state
├── Dockerfile # Container image definition
├── utils.py # Simulation helpers
├── validator.py # Action validation
├── executor.py # Action execution bridge
├── worker.py # Episode loop manager
├── tasks/
│ ├── __init__.py
│ ├── task_pod_recovery.py
│ ├── task_autoscaling.py
│ └── task_incident.py
├── graders/
│ ├── __init__.py
│ ├── grader_pod_recovery.py
│ ├── grader_autoscaling.py
│ └── grader_incident.py
├── conditions/
│ ├── __init__.py
│ ├── crash_loop.py
│ ├── oom_kill.py
│ ├── node_failure.py
│ └── cascade_failure.py
└── actions/
├── __init__.py
├── scale_action.py
├── patch_action.py
├── delete_pod_action.py
├── rollout_action.py
├── hpa_action.py
├── drain_action.py
└── describe_action.py
7. The Three Tasks (Easy → Medium → Hard)
Task 1 — Pod Recovery (Easy)
What's broken: A frontend deployment has 3 pods stuck in CrashLoopBackOff. The restart count is climbing. The root cause is a wrong environment variable in the deployment spec pointing to a database host that doesn't exist.
What the agent must do:
- Observe the broken pods and read the k8s events (which mention a connection refused error)
- Identify the bad
DB_HOSTenvironment variable using adescribeorpatchinspect action - Patch the deployment with the correct
DB_HOSTvalue - Optionally delete the crash-looping pods to speed up recovery (they'll get recreated with the new config)
- Verify all 3 pods reach
Runningstate
Objective string shown to agent: "The frontend deployment is crash-looping. Diagnose and fix the root cause so that all pods reach Running state."
Max steps: 15
Success threshold: All 3 pods in Running state (score ≥ 0.9)
Partial rewards:
- +0.1 for each pod that stops crash-looping
- +0.2 for correctly patching the environment variable
- +0.3 bonus for all pods Running within 10 steps
Task 2 — HPA Autoscaling Under Traffic Spike (Medium)
What's broken: The cluster is healthy but receiving 10× normal traffic. The deployment has no HPA configured, is running on fixed 2 replicas, and is already at 95% CPU. Request latency is climbing past the SLO threshold.
What the agent must do:
- Observe high CPU usage and rising latency in the observation
- Immediately scale up the deployment to handle current load
- Configure a HorizontalPodAutoscaler (HPA) with appropriate min/max replicas and CPU target
- Set correct CPU resource requests/limits on the deployment so HPA has a baseline to work with
- Verify that latency drops back below the SLO threshold
Objective string shown to agent: "Traffic has spiked 10×. The api-server deployment is overloaded. Configure autoscaling and ensure p95 latency stays below 500ms."
Max steps: 20
Success threshold: p95 latency < 500ms, HPA configured, replicas ≥ 4 (score ≥ 0.85)
Partial rewards:
- +0.15 for scaling up replicas immediately (within 3 steps)
- +0.20 for configuring HPA correctly
- +0.25 for latency dropping below 1000ms
- +0.30 for latency dropping below 500ms (SLO met)
- -0.10 penalty for scaling beyond 12 replicas unnecessarily (resource waste)
Task 3 — Multi-Service Cascading Incident (Hard)
What's broken: The auth-service deployment has pods getting OOMKilled because memory limits are set 4× too low relative to actual usage. This causes the api-gateway to fail authentication checks and return 503s. Downstream, the data-processor service is also throwing errors because it depends on the gateway. Three services are degraded simultaneously.
What the agent must do:
- Identify the blast radius — which services are affected and why
- Investigate
auth-serviceto find the OOMKill root cause (memory limits too low) - Patch
auth-servicedeployment with correct memory limits - Rollout restart
auth-serviceso new pods come up with correct limits - Drain the partially-failed node where most OOMKilled pods were running, to force clean rescheduling
- Verify
api-gateway503 errors stop (automatically once auth recovers) - Verify
data-processorerror rate drops (automatically once gateway recovers) - Confirm all three services are fully healthy
Objective string shown to agent: "A cascading incident has degraded auth-service, api-gateway, and data-processor. Identify the root cause and restore all three services to healthy state without data loss."
Max steps: 30
Success threshold: All 3 services healthy, error rate < 0.1% (score ≥ 0.80)
Partial rewards:
- +0.10 for correctly identifying
auth-serviceas the root cause (within 5 steps) - +0.15 for patching memory limits correctly
- +0.15 for auth-service pods reaching Running
- +0.20 for api-gateway 503s stopping
- +0.20 for data-processor errors resolving
- +0.10 for draining the bad node cleanly
- -0.15 penalty for deleting services or breaking healthy components
8. Reward & Grading Design
The grading philosophy follows what the PS requires: reward signal over the full trajectory, not just at the end.
Reward Principles
Partial progress is always rewarded. If the agent fixes 1 out of 3 broken pods, it gets 1/3 of the maximum reward for that milestone — not zero.
Speed bonus. Fixing the issue in fewer steps earns a small bonus. This incentivises efficient reasoning.
Waste penalty. Unnecessary destructive actions (scaling to 0, deleting healthy pods, draining a healthy node) subtract from the reward. This teaches the agent to be surgical.
Idempotency. Repeating the same correct action doesn't give extra reward but doesn't penalise either (except for wasted steps).
Grader Implementation Pattern
Each grader implements:
def grade(world_state: dict, step: int, max_steps: int) -> float:
score = 0.0
# Milestone 1: Partial progress
running_pods = [p for p in world_state["pods"] if p["status"] == "Running"]
score += (len(running_pods) / total_expected_pods) * 0.5
# Milestone 2: Full success
if all(p["status"] == "Running" for p in world_state["pods"]):
score += 0.4
# Speed bonus
efficiency = 1.0 - (step / max_steps)
score += efficiency * 0.1
return min(score, 1.0) # always clamp to [0, 1]
9. The Complete Episode Flow
Here is the full step-by-step flow of one complete episode, from start to finish:
1. JUDGE / VALIDATOR runs:
python inference.py
2. inference.py reads env vars:
API_BASE_URL, MODEL_NAME, HF_TOKEN
3. inference.py calls:
POST /reset { "task": "pod_recovery" }
4. main.py receives /reset:
→ Calls task_pod_recovery.get_condition() → crash_loop.inject(cluster_state)
→ world.reset(broken_state)
→ Returns ClusterObservation (3 CrashLoopBackOff pods, events, objective string)
5. stdout prints:
[START] task=pod-recovery env=coenv model=Qwen3-30B
6. inference.py builds LLM prompt:
"You are an SRE. Current cluster state: [observation JSON].
Objective: Fix the frontend deployment crash loop.
Respond with a JSON action from the available action types."
7. LLM responds:
{ "action_type": "describe", "resource_type": "deployment", "name": "frontend" }
8. inference.py calls:
POST /step { action }
9. main.py receives /step:
→ validator.validate(action, world) → OK
→ executor.execute(action, world)
→ world.tick()
→ grader.grade(world.state, step=1) → reward=0.00 (just investigating)
→ Returns observation, reward=0.00, done=false, info={...}
10. stdout prints:
[STEP] step=1 action=describe('deployment','frontend') reward=0.00 done=false error=null
11. LLM sees deployment spec, notices DB_HOST=wrong-host.internal
LLM responds: { "action_type": "patch", "resource_type": "deployment",
"name": "frontend",
"patch": {"env": [{"name": "DB_HOST", "value": "db.prod.internal"}]} }
12. POST /step { patch action }
→ executor patches deployment in world state
→ world.tick() — pods begin restarting with new config
→ grader → reward=0.20 (correct patch applied)
13. [STEP] step=2 action=patch('frontend',{env...}) reward=0.20 done=false error=null
14. LLM responds: { "action_type": "delete_pod", "pod_name": "frontend-7d9f-xkp2" }
→ world deletes pod, recreates with correct env, status → Running
→ grader → reward=0.40
15. Repeat for remaining 2 pods...
16. All 3 pods Running. grader → reward=1.0, done=true
17. stdout prints:
[END] success=true steps=8 rewards=0.00,0.20,0.40,0.55,0.70,0.85,0.95,1.00
10. OpenEnv Spec Compliance Checklist
| Requirement | File | Status |
|---|---|---|
| Typed Observation model | classes/models.py → ClusterObservation |
Required |
| Typed Action model | classes/models.py → KubeAction |
Required |
| Typed Reward model | classes/models.py → RewardSignal |
Required |
step(action) → (obs, reward, done, info) |
main.py → POST /step |
Required |
reset() → initial_observation |
main.py → POST /reset |
Required |
state() → current_state |
main.py → GET /state |
Required |
openenv.yaml with metadata |
openenv.yaml |
Required |
openenv validate passes |
Tested via pre-validation script | Required |
| Min 3 tasks | classes/tasks/ — 3 files |
Required |
| Easy → medium → hard difficulty | task_pod_recovery / task_autoscaling / task_incident | Required |
| Graders return 0.0–1.0 | classes/graders/ — 3 graders |
Required |
| Graders are deterministic | Pure functions, no randomness | Required |
| Partial reward signals | All 3 graders implement milestone scoring | Required |
| Penalise bad actions | validator.py + grader penalty terms | Required |
inference.py in root |
inference.py |
Required |
[START] log line |
inference.py → log_start() |
Required |
[STEP] log per step |
inference.py → log_step() |
Required |
[END] log always emitted |
inference.py → finally: log_end() |
Required |
Reads API_BASE_URL with default |
inference.py |
Required |
Reads MODEL_NAME with default |
inference.py |
Required |
Reads HF_TOKEN (no default) |
inference.py |
Required |
| Uses OpenAI client | from openai import OpenAI |
Required |
Dockerfile builds cleanly |
Dockerfile |
Required |
| HF Space deploys and responds | Deployed on Hugging Face | Required |
| Inference runs in < 20 min | Max 30 steps × ~20s/step = ~10 min | Required |
| Runs in 2 vCPU / 8 GB RAM | Pure Python in-memory sim, no real k8s | Required |
| README with all required sections | README.md |
Required |
11. Submission Checklist
Before submitting, verify all of these:
-
inference.pyis in the root directory (not insideclasses/) -
inference.pyhas default values forAPI_BASE_URLandMODEL_NAME -
inference.pyraisesValueErrorifHF_TOKENis missing -
[START],[STEP],[END]format matches the spec exactly (field names, order, lowercase booleans) -
openenv validatepasses locally -
docker buildcompletes without errors -
docker runstarts the server and responds toGET /state - HF Space is in Running state (not Building, not Stopped)
- All 3 tasks can be reset and stepped without crashing
- All 3 graders return a float between 0.0 and 1.0
- Running
inference.pyend-to-end completes in under 20 minutes -
README.mdincludes baseline scores table -
tests/test_integration.pypasses cleanly
12. Key Technical Decisions
Why a simulated cluster, not a real one?
Running kind or minikube inside a Hugging Face Space container with 2 vCPU / 8 GB RAM is not feasible. The Kubernetes control plane alone (etcd + apiserver + scheduler + controller-manager) consumes ~1.5–2 GB RAM before any workloads run. An in-memory Python simulator is the only viable approach within the hardware constraints. It is also faster (no scheduling latency), fully deterministic (same input = same output), and easier to test.
Why a constrained action space?
Free-form kubectl text strings are nearly impossible to grade deterministically. By defining ~7 typed Pydantic action models, we make the action space clear to the LLM (easier to prompt), easy to validate (Pydantic does the type checking), and easy to grade (executor calls predictable world methods). This also keeps the action space small enough that the LLM can reason about it effectively without getting lost in kubectl's hundreds of sub-commands.
Why FastAPI?
OpenEnv environments are expected to be HTTP servers. FastAPI gives automatic OpenAPI documentation (at /docs), Pydantic integration for request/response validation, async support for when we need it, and a clean decorator syntax that makes main.py easy to read. It is also trivial to run with uvicorn inside a Docker container.
Why partial rewards matter for the hackathon
The PS explicitly states: "The reward function must provide feedback throughout the task trajectory, not just at completion." Binary rewards (0 until success, then 1) are explicitly penalised in the environment design score. Our graders implement milestone-based partial rewards, which also makes the environment more useful for actual RL training — sparse rewards make training slow and unstable.
coenv — Meta × Hugging Face OpenEnv RL Hackathon
Team: Sandeep (RL environment) · You (Simulation) · Third Person (Actions & Workers)