Spaces:

Sumukh66
/

Labexperiment

Sleeping

┌─────────┐   action    ┌─────────────┐
│  AGENT  │ ──────────> │ ENVIRONMENT │
│  (dog)  │ <────────── │  (world)    │
└─────────┘ observation  └─────────────┘
              + reward

Agent: the AI that learns (an LLM, a neural network, etc.)
Environment: the world the agent lives in (our code!)

What is an "Environment" in code?

An environment is a Python class with three methods:

class MyEnvironment:
    def reset(self):
        """Start a new episode. Return the first observation."""
        ...

    def step(self, action):
        """Agent does something. Return what happened + reward."""
        ...

    def state(self):
        """Return metadata about the current episode."""
        ...

That's it. Those three methods are the entire interface between the agent and the world.

What is OpenEnv?

OpenEnv is a standard for RL environments. Think of it like USB for hardware -- it doesn't matter what device you plug in, as long as it follows the USB spec. OpenEnv says:

Your reset() must return an Observation object
Your step() must accept an Action object and return an Observation
Your state must return a State object
These objects must be Pydantic models (typed, validated Python objects)
You must have an openenv.yaml manifest file
You must serve your environment over HTTP (FastAPI)

Why bother with a standard? Because it means any agent can talk to any environment without custom glue code.

What does OUR environment do?

Our environment is called the Scientific Hypothesis Lab. Here's the idea:

The agent is a scientist. Each episode, it faces a hidden causal system (like "Beta = 2.0 * Alpha + 3.0"). The variables are abstract -- named things like Alpha, Beta, Gamma or V1, V2, V3 -- so the agent can't rely on pretrained knowledge of real-world physics. It must reason purely from experimental data.

Think of it like a detective game:

The "crime" is hidden causal rules between variables
The "clues" are noisy experimental results
The "solution" is a written hypothesis
The "score" is how close the hypothesis matches reality

This is a real-world task -- it models how actual scientists discover causal relationships. Using abstract variable names ensures the agent genuinely discovers rules rather than recalling them from training data.

Part 2: The OpenEnv Contract

Before we look at code, let's understand the contract every OpenEnv environment must fulfill.

The Three Methods

reset(**kwargs) -> Observation
    "Start fresh. Generate a new puzzle. Tell the agent what it sees."

step(action: Action) -> Observation
    "The agent did something. Process it. Tell the agent what happened."

state -> State  (property, not a method call)
    "Return metadata about the current episode. Never leak secrets."

The Three Data Types

Every OpenEnv environment defines three Pydantic models that inherit from base types:

Type	Base Class	Purpose	Who sees it
Action	`openenv.core.Action`	What the agent sends	Agent -> Environment
Observation	`openenv.core.Observation`	What comes back	Environment -> Agent
State	`openenv.core.State`	Episode metadata	Anyone (debugging)

The Observation base class always includes:

done: bool -- is the episode over?
reward: float | None -- how well did the agent do on this step?

The State base class always includes:

episode_id: str -- unique ID for this episode
step_count: int -- how many steps so far

The Manifest (openenv.yaml)

Every environment needs a tiny YAML file:

spec_version: 1           # Which version of the OpenEnv spec
name: hypothesis_lab      # Machine-readable name
type: space               # Deployed as an HF Space
runtime: fastapi           # HTTP framework used
app: server.app:app        # Python path to the ASGI app
port: 8000                 # Port the server listens on

This is like a package.json for your environment -- it tells the OpenEnv tooling how to find and run your code.

The Episode Lifecycle

Here's what one complete episode looks like:

1. Agent calls reset(noise_level="low", domain="system_alpha")
2. Environment generates a hidden world with random causal rules
3. Environment returns initial Observation (variable names, budget, instructions)

4. LOOP:
   a. Agent reads the observation
   b. Agent decides on an action (experiment or submit)
   c. Agent calls step(action)
   d. Environment processes the action
   e. Environment returns new Observation (results, reward)
   f. If observation.done == True, episode is over

5. Agent calls state to see final metadata

Part 3: Tour of Every File

Here is every file and what it does. Think of this as the map before we explore each room.

hypothesis_lab/
│
├── openenv.yaml                          # THE MANIFEST
│   "Hi, I'm an OpenEnv environment.      #   Points the framework
│    Here's how to find my server."        #   to server.app:app
│
├── models.py                             # THE LANGUAGE
│   "These are the words the agent        #   HypLabAction
│    and environment use to talk."         #   HypLabObservation
│                                         #   HypLabState
│
├── server/                               # THE BRAIN
│   ├── app.py                            #   HTTP server (thin wrapper)
│   ├── hypothesis_lab_environment.py     #   Core game logic
│   ├── causal_world.py                   #   Hidden puzzle generator
│   └── rubric.py                         #   Scoring engine
│
├── tasks/                                # THE EXAM
│   ├── task_easy.py                      #   Easy test + grader
│   ├── task_medium.py                    #   Medium test + grader
│   └── task_hard.py                      #   Hard test + grader
│
├── client.py                             # THE PHONE
│   "Typed Python client so agents        #   Wraps HTTP calls
│    don't need to speak raw HTTP."        #   into nice methods
│
├── baseline_inference.py                 # THE DEMO AGENT
│   "Here's a simple GPT agent that       #   Uses OpenAI API
│    can play the game. Not great,         #   Produces reproducible
│    but proves the game works."           #   scores on all 3 tasks
│
├── tests/                                # THE SAFETY NET
│   └── test_environment.py               #   39 tests covering
│                                         #   every component
│
├── Dockerfile                            # THE SHIPPING BOX
│   "Packages everything into a           #   Multi-stage build
│    container for deployment."            #   OpenEnv base image
│
├── pyproject.toml                        # THE SHOPPING LIST
│   "What Python packages we need."       #   Dependencies + metadata
│
└── README.md                             # THE COVER LETTER
    "What this environment is and         #   HF Spaces frontmatter
     how to use it."                      #   Action/observation docs

Now let's explore each room in detail.

Part 4: The Hidden World

File: server/causal_world.py

This is the puzzle the agent must solve. Every episode generates a fresh hidden world.

Core Concept: Causal Graphs

A causal graph is a set of variables connected by rules:

Alpha ──(quadratic)──> Beta ──(saturating)──> Gamma
 7.93      B = 0.5*A² + 1.2       G = 10*B / (3 + B)

The agent never sees this graph. It can only probe it through experiments.

Why Abstract Variable Names?

An earlier version of this environment used real-world names like "Temperature", "Pressure", "Volume". This created a serious problem: LLM agents have pretrained knowledge about how those variables relate (PV=nRT, supply/demand curves, etc.). The agent would use that prior knowledge instead of reasoning from experimental data -- which defeats the entire purpose.

Now variables are named things like Alpha, Beta, Gamma or V1, V2, V3 or Quant_A, Quant_B, Quant_C. The LLM has no prior about how "Alpha" relates to "Beta", so it must genuinely discover the relationship through experiments.

The Building Blocks

CausalRule -- one edge in the graph:

@dataclass
class CausalRule:
    cause: str          # "Alpha"
    effect: str         # "Beta"
    rule_type: str      # one of 8 types (see table below)
    params: dict        # {"a": 2.1, "b": 3.0}
    description: str    # "Beta = 2.1 * Alpha + 3.0"

    def evaluate(self, x: float) -> float:
        # Given x (the cause value), compute the effect value

There are eight single-parent rule types:

Rule	Formula	What it looks like	Why it's tricky
Linear	`y = a*x + b`	Straight line	Easy to identify
Threshold	`y = high if x > t else low`	Step function	Need to find the cutoff
Inverse	`y = a / x`	Hyperbola	Blows up near zero
Quadratic	`y = ax² + bx + c`	Parabola	Looks linear in narrow range
Exponential	`y = a * exp(k*x)`	Growth/decay curve	Looks linear locally
Logarithmic	`y = a * ln(x) + b`	Diminishing returns	Looks linear in mid-range
Saturating	`y = Vmax * x / (Km + x)`	Plateau	Looks linear for small x
Piecewise-linear	Two slopes with a knot	Bent line	Looks linear on each side

Many of these look similar with limited data. Quadratic, exponential, and saturating all resemble linear in a narrow range -- the agent must design experiments that discriminate between hypotheses (e.g., sampling at extremes to check for curvature).

InteractionRule -- a multi-parent edge where the effect depends on two causes:

@dataclass
class InteractionRule:
    cause1: str         # "Alpha"
    cause2: str         # "Beta"
    effect: str         # "Gamma"
    interaction_type: str  # "additive", "multiplicative", "min", "max"

These are genuinely hard: the agent can't discover them by varying one variable at a time. It must realise that two parents jointly determine the effect.

Try it yourself -- open a Python shell in the project directory:

from server.causal_world import CausalRule

rule = CausalRule(
    cause="Alpha", effect="Beta",
    rule_type="linear", params={"a": 2.0, "b": 3.0},
    description="Beta = 2.0 * Alpha + 3.0"
)

print(rule.evaluate(0))   # 3.0  (y = 2*0 + 3)
print(rule.evaluate(5))   # 13.0 (y = 2*5 + 3)
print(rule.evaluate(10))  # 23.0 (y = 2*10 + 3)

# Try a saturating rule
sat = CausalRule(
    cause="Alpha", effect="Beta",
    rule_type="saturating", params={"v_max": 10.0, "k_m": 3.0},
    description="Beta = 10 * Alpha / (3 + Alpha)"
)
print(sat.evaluate(1))    # 2.5  (still growing)
print(sat.evaluate(10))   # 7.69 (approaching plateau)
print(sat.evaluate(1000)) # ~10  (saturated)

CausalWorld -- the full hidden system

The CausalWorld holds all the variables, rules, interaction rules, and default values. It also tracks a confounder_sigma -- if > 0, a hidden variable injects correlated noise the agent can't explain.

It has four query methods -- one for each experiment type the agent can run:

world.query_intervention(cause, value, effect, sigma)
# "Set Alpha to 5.0. What does Beta become?" (+ noise + confounder)

world.query_correlation(cause, [1, 10, 5], effect, sigma)
# "Sweep Alpha from 1 to 10 in 5 steps. Show me Beta at each."

world.query_counterfactual(cause, delta, effect, sigma)
# "If Alpha increases by +3.0, what happens to Beta?"

world.query_passive(target, sigma)
# "Just show me what Beta is right now, without changing anything."

Every result has Gaussian noise added. If sigma=0.05, the noise is tiny (easy mode). If sigma=0.50, the noise is huge (hard mode). On top of that, ~27% of worlds also have hidden confounder noise.

Try it yourself:

from server.causal_world import generate_world

world = generate_world(n_variables=3, domain="system_alpha", seed=42)
print("Variables:", world.variables)
print("Ground truth:")
print(world.ground_truth_summary())

# Check for interactions and confounders
print(f"\nInteraction rules: {len(world.interactions)}")
print(f"Confounder sigma: {world.confounder_sigma}")

# Run an experiment
cause, effect = world.variables[0], world.variables[1]
result = world.query_intervention(cause, 5.0, effect, sigma=0.05)
print(f"\nSet {cause}=5.0, observed {effect}={result:.4f}")

The generate_world() Function

This is the factory that builds a fresh puzzle:

Pick a domain (system_alpha/beta/gamma/delta) -- this only changes the context prompt
Pick an abstract variable pool (Greek letters, V1-V5, Quant_A-E, etc.)
Choose N variables and connect them with random rules (8 possible types)
Add extra random edges with 30% probability
Optionally replace some single-parent rules with multi-parent interaction rules (~40% chance when n >= 3)
Optionally add a hidden confounder (~30% chance when n >= 3)
Compute default values for all variables

Domains and Variable Pools

Domains provide different narrative prompts but use the same abstract variable names:

DOMAIN_LABELS = {
    "system_alpha": {"context": "You are studying an unknown dynamical system..."},
    "system_beta":  {"context": "You are investigating a black-box system..."},
    "system_gamma": {"context": "You are analysing an opaque process..."},
    "system_delta": {"context": "You are probing a simulated environment..."},
}

ABSTRACT_VAR_POOLS = [
    ["Alpha", "Beta", "Gamma", "Delta", "Epsilon"],
    ["Zeta", "Eta", "Theta", "Iota", "Kappa"],
    ["V1", "V2", "V3", "V4", "V5"],
    ["Rho", "Sigma", "Tau", "Upsilon", "Phi"],
    # ... more pools
]

Each episode randomly selects a pool, so the agent can't even memorise variable-name-to-position mappings across episodes.

Part 5: The Reward Engine

File: server/rubric.py

The reward function is arguably the most important part of any RL environment. A bad reward function trains bad agents. Let's understand every piece.

Two Kinds of Rewards

Our environment gives rewards at two different times:

Per-step rewards (during the episode):

Every experiment gives information gain reward
Redundant experiments get penalized

End-of-episode rewards (when the agent submits its hypothesis):

Accuracy, precision, calibration, efficiency, contradiction checks

Per-Step: InfoGainTracker

This tracks which variable pairs (edges) the agent has probed:

tracker = InfoGainTracker()

# First time probing Alpha -> Beta: +0.20
reward, redundant = tracker.record_and_score("Alpha", "Beta", "intervention", 5.0)
# reward = 0.20, redundant = False

# Second time, different experiment type (triangulation!): +0.25
reward, redundant = tracker.record_and_score("Alpha", "Beta", "correlation", [1,10,5])
# reward = 0.25, redundant = False  (BONUS for using different experiment type!)

# Third time: only +0.05
# Fourth time: -0.10 (PENALTY)

The reward schedule:

Visit #	Same type	Different type	Purpose
1st	+0.20	+0.20	Reward exploration
2nd	+0.12	+0.25	Reward triangulation
3rd	+0.05	+0.05	Diminishing returns
4th+	-0.10	-0.10	Punish redundancy

Why this design? In real science, repeating the exact same experiment is wasteful. But using a different method to study the same relationship (triangulation) is valuable because it confirms findings. Our reward function teaches the agent this lesson.

Try it yourself:

from server.rubric import InfoGainTracker

tracker = InfoGainTracker()
for i in range(5):
    reward, redundant = tracker.record_and_score("A", "B", "intervention", 1.0)
    print(f"Visit {i+1}: reward={reward:+.2f}, redundant={redundant}")

print(f"\nCumulative info gain: {tracker.cumulative_gain:.2f}")
print(f"Redundant experiments: {tracker.redundant_count}")

End-of-Episode: score_hypothesis()

When the agent submits, five scoring components fire:

1. Accuracy Score (0.0 - 1.0)

How much of the ground truth did the agent discover?

For single-parent rules, the scorer checks:

Did the hypothesis mention both the cause and effect variable names? (+0.4 per rule)
Did it identify the relationship type (linear, quadratic, saturating, etc.)? (+0.3 per rule)
Did it include the correct numerical parameters? (+0.3 per rule)

For interaction rules, the scorer checks:

Did the hypothesis mention the effect and at least one cause? (+0.3)
Did it mention both causes? (+0.2 additional)
Did it identify the interaction type (additive, multiplicative, etc.)? (+0.5)

Example: if the ground truth is Beta = 2.0 * Alpha + 3.0 and the agent writes "Beta increases linearly with Alpha at a slope of 2.0", it scores high on all three checks.

Each of the 8 rule types has its own set of keywords the scorer recognises (e.g. "saturating", "plateau", "asymptote" for saturating rules; "quadratic", "squared", "parabola" for quadratic).

2. Precision Bonus (+0.10)

Does the hypothesis contain actual numbers? "Alpha affects Beta" scores 0. "Beta = 2.0 * Alpha + 3.0" scores +0.10. This rewards agents that make falsifiable, quantitative claims instead of vague hand-waving.

3. Calibration Score (0.0 - 0.20)

When the agent submits, it also reports a confidence level (0.0 to 1.0). Calibration measures how well that confidence matches the actual accuracy:

calibration = 0.20 * (1 - |confidence - accuracy| / 0.5)

If the agent says confidence=0.9 but accuracy=0.2, that's overconfident and scores low. If confidence=0.3 and accuracy=0.2, that's well-calibrated and scores high. This teaches agents to know what they don't know.

4. Efficiency Bonus (+0.15)

If the agent submits early (30%+ budget remaining) with decent accuracy (60%+), it gets a bonus. This rewards agents that don't waste time running unnecessary experiments.

5. Contradiction Penalty (-0.50)

If the hypothesis contradicts the experimental setup (e.g., claiming "all variables are independent" or "no causal relationship exists"), it gets a harsh penalty. This teaches agents not to give up without trying.

Try it yourself:

import numpy as np
from server.causal_world import CausalWorld, CausalRule
from server.rubric import score_hypothesis

rule = CausalRule("Alpha", "Beta", "linear",
                  {"a": 2.0, "b": 3.0},
                  "Beta = 2.0 * Alpha + 3.0")

world = CausalWorld(
    domain="system_alpha",
    variables=["Alpha", "Beta"],
    units={"Alpha": "units", "Beta": "units"},
    rules=[rule],
    default_values={"Alpha": 5.0, "Beta": 13.0},
    rng=np.random.default_rng(0),
)

# Good hypothesis
result = score_hypothesis(
    "Beta = 2.0 * Alpha + 3.0. Linear relationship.",
    ["Beta = 2.0 * Alpha + 3.0"],
    confidence=0.85,
    world=world,
    budget_remaining=4,
    budget_total=10,
)
print(f"Accuracy:     {result.accuracy_score:.2f}")
print(f"Precision:    {result.precision_bonus:.2f}")
print(f"Calibration:  {result.calibration_score:.2f}")
print(f"Efficiency:   {result.efficiency_bonus:.2f}")
print(f"Contradiction:{result.contradiction_penalty:.2f}")
print(f"TOTAL:        {result.total:.2f}")
print(f"\nFeedback: {result.feedback}")

Part 6: The Environment Core

File: server/hypothesis_lab_environment.py

This is the central nervous system. It ties together the hidden world, the rubric, and the data models.

The Class Structure

class HypothesisLabEnvironment(Environment):
    SUPPORTS_CONCURRENT_SESSIONS = True  # Multiple agents can play at once

    def __init__(self, **kwargs):
        # Initialize empty state -- no episode running yet
        self._world = None           # The hidden causal graph
        self._tracker = None         # InfoGainTracker for per-step rewards
        self._step_count = 0
        self._budget_remaining = 0
        self._done = True            # No episode until reset() is called
        self._history = []           # Log of all experiments
        ...

reset() -- Starting a New Episode

def reset(self, seed=None, episode_id=None, **kwargs):
    # 1. Read difficulty parameters
    noise_level = kwargs.get("noise_level", "medium")  # low/medium/high
    domain = kwargs.get("domain", None)                 # system_alpha/beta/gamma/delta

    # 2. Look up noise and budget from schedule tables
    sigma = NOISE_SCHEDULE[noise_level]    # low=0.05, medium=0.20, high=0.50
    budget = BUDGET_SCHEDULE[noise_level]  # low=12,   medium=10,   high=8
    n_vars = N_VARIABLES_SCHEDULE[noise_level]  # low=2, medium=3, high=4

    # 3. Generate a fresh hidden world (abstract variable names, 8+ rule types)
    self._world = generate_world(n_variables=n_vars, domain=domain, seed=seed)

    # 4. Initialize tracking
    self._tracker = InfoGainTracker()
    self._budget_remaining = budget
    self._done = False

    # 5. Return initial observation (variable names, budget, instructions)
    return HypLabObservation(
        system_message="New episode started. You have 3 unknown variables...",
        available_variables=self._world.variables,
        budget_remaining=budget,
        done=False,
        reward=0.0,
    )

Key insight: reset() generates a new hidden world every time. The agent never carries knowledge between episodes. Each episode is an independent puzzle.

step() -- Processing an Action

def step(self, action: HypLabAction, **kwargs):
    if self._done:
        raise RuntimeError("Episode is done. Call reset().")

    self._step_count += 1

    if action.action_type == ActionType.EXPERIMENT:
        return self._handle_experiment(action)
    elif action.action_type == ActionType.SUBMIT:
        return self._handle_submit(action)

There are only two things the agent can do: run an experiment, or submit a hypothesis. This is a clean action space -- no ambiguity about what actions are valid.

_handle_experiment() -- Running an Experiment

This is the longest method. Here's what it does:

Validate the variable names (are they real variables in this world?)
Route to the right query method based on experiment type
Format the result as human-readable text (for the LLM to read)
Score the information gain via InfoGainTracker
Deduct budget
Check if budget is exhausted
Return observation with all the details

_handle_submit() -- Grading the Hypothesis

Mark episode as done
Call score_hypothesis() from the rubric
Format the rubric breakdown as text
Return observation with scores and revealed ground truth

Key insight: the ground truth is only revealed after submission. This prevents the agent from cheating.

state -- Episode Metadata

@property
def state(self) -> HypLabState:
    return HypLabState(
        episode_id=self._episode_id,
        step_count=self._step_count,
        budget_remaining=self._budget_remaining,
        noise_level=self._noise_level,
        experiment_history=self._history,  # What experiments ran so far
        ...
    )

Critical rule: state must NEVER leak the hidden world. No rule types, no parameters, no ground truth. Only metadata the agent already knows.

Try the full loop yourself:

from models import ActionType, ExperimentType, HypLabAction
from server.hypothesis_lab_environment import HypothesisLabEnvironment

env = HypothesisLabEnvironment()

# Start a new episode
obs = env.reset(seed=42, noise_level="low", domain="system_alpha")
print("=== RESET ===")
print(obs.system_message)
print()

# Run an experiment
vars_ = obs.available_variables
action = HypLabAction(
    action_type=ActionType.EXPERIMENT,
    experiment_type=ExperimentType.INTERVENTION,
    control_variable=vars_[0],
    target_variable=vars_[1],
    control_value=5.0,
)
obs = env.step(action)
print("=== EXPERIMENT ===")
print(obs.system_message)
print(f"Info gain: {obs.info_gain_reward}")
print()

# Try a correlation sweep
action2 = HypLabAction(
    action_type=ActionType.EXPERIMENT,
    experiment_type=ExperimentType.CORRELATION,
    control_variable=vars_[0],
    control_range=[1.0, 10.0, 5.0],
    target_variable=vars_[1],
)
obs = env.step(action2)
print("=== CORRELATION ===")
print(obs.system_message)
print()

# Submit hypothesis
submit = HypLabAction(
    action_type=ActionType.SUBMIT,
    hypothesis_text=f"{vars_[1]} is linearly related to {vars_[0]} with slope ~2.0",
    hypothesis_equations=[f"{vars_[1]} = 2.0 * {vars_[0]} + 3.0"],
    confidence=0.75,
)
obs = env.step(submit)
print("=== SUBMIT ===")
print(obs.system_message)

Part 7: The Data Models

File: models.py

This file defines the language the agent and environment speak. Every piece of data that crosses the boundary must be one of these types.

Why Pydantic?

Pydantic gives us:

Validation -- if the agent sends control_value="hello" instead of a number, it gets a clear error
Serialization -- objects convert to/from JSON automatically for HTTP transport
Documentation -- every field has a type and a description
IDE support -- autocomplete and type checking

The Import Pattern

try:
    from openenv.core.env_server.types import Action, Observation, State
except ImportError:
    # Fallback for when openenv-core isn't installed
    from pydantic import BaseModel
    class Action(BaseModel): ...
    class Observation(BaseModel): ...
    class State(BaseModel): ...

This pattern lets the code work both:

In production (with openenv-core installed)
In development/testing (without it)

The Enums

class ExperimentType(str, Enum):
    INTERVENTION = "intervention"
    CORRELATION = "correlation"
    COUNTERFACTUAL = "counterfactual"
    PASSIVE = "passive"

class ActionType(str, Enum):
    EXPERIMENT = "experiment"
    SUBMIT = "submit"

class NoiseLevelTag(str, Enum):
    LOW = "low"
    MEDIUM = "medium"
    HIGH = "high"

Using str, Enum means these serialize as simple strings in JSON: "intervention" instead of ExperimentType.INTERVENTION. This makes the API friendly for LLM agents that output raw JSON.

HypLabAction -- What the Agent Sends

The action model is polymorphic -- it handles two different use cases in one object:

# Use case 1: Run an experiment
HypLabAction(
    action_type="experiment",
    experiment_type="intervention",
    control_variable="Alpha",
    control_value=5.0,
    target_variable="Beta",
)

# Use case 2: Submit a hypothesis
HypLabAction(
    action_type="submit",
    hypothesis_text="Beta = 2.0 * Alpha + 3.0",
    hypothesis_equations=["Beta = 2.0 * Alpha + 3.0"],
    confidence=0.85,
)

The experiment fields are Optional so they can be None when submitting, and vice versa. This is a common pattern in RL environments where the action space has distinct modes.

HypLabObservation -- What Comes Back

Observations are rich and multi-purpose:

Always present: system_message, available_variables, budget_remaining, done, reward
After experiments: result_value, noise_sigma, info_gain_reward, is_redundant
After submission: accuracy_score, total_episode_reward, ground_truth_revealed

The system_message field is crucial -- it's the human-readable text that an LLM agent reads (e.g. "Set Alpha=5.0, observed Beta=13.04"). The structured fields are for programmatic access.

HypLabState -- Episode Metadata

class HypLabState(State):
    budget_total: int = 0
    budget_remaining: int = 0
    noise_level: NoiseLevelTag = NoiseLevelTag.MEDIUM
    experiment_history: list[dict] = []
    cumulative_info_gain: float = 0.0
    redundant_experiment_count: int = 0

Notice what's NOT here: no rules, no default_values, no ground_truth. The state is safe to show to the agent without leaking the answer.

Part 8: The Server

File: server/app.py

This is the thinnest file in the project, and that's by design.

from openenv.core.env_server.http_server import create_app

app = create_app(
    HypothesisLabEnvironment,  # The environment class
    HypLabAction,              # What the agent sends
    HypLabObservation,         # What comes back
    env_name="hypothesis_lab",
    max_concurrent_envs=200,
)

create_app() does all the heavy lifting:

Creates FastAPI routes: /reset, /step, /state, /health, /schema
Handles session management (multiple agents playing at once)
Serializes/deserializes Pydantic models to/from JSON
Adds WebSocket support for persistent connections

You almost never need to touch this file. The magic is in create_app().

The HTTP Endpoints

Endpoint	Method	What it does
`/health`	GET	Returns `{"status": "ok"}` -- for Docker healthchecks
`/reset`	POST	Starts a new episode, returns initial observation
`/step`	POST	Sends an action, returns observation + reward
`/state`	GET	Returns current episode metadata
`/schema`	GET	Returns JSON schemas for Action/Observation

Running the Server

cd "files 2"
uvicorn server.app:app --port 8000

Then in another terminal:

curl http://localhost:8000/health
# {"status": "ok"}

curl -X POST http://localhost:8000/reset \
  -H "Content-Type: application/json" \
  -d '{"noise_level": "low", "domain": "system_alpha", "seed": 42}'

Part 9: The Client

File: client.py

The client is the agent's friendly interface to the server. Instead of constructing raw HTTP requests, the agent gets nice typed methods.

class HypothesisLabEnv(EnvClient[HypLabAction, HypLabObservation, HypLabState]):

The EnvClient base class handles:

WebSocket connections (persistent, faster than HTTP polling)
Automatic reconnection
JSON serialization

Our client adds convenience methods:

await env.run_intervention("Alpha", 5.0, "Beta")
await env.run_correlation("Alpha", [1, 10, 5], "Beta")
await env.run_counterfactual("Alpha", 3.0, "Beta")
await env.run_passive("Beta")
await env.submit_hypothesis("Beta = 2.0 * Alpha + 3.0", confidence=0.85)

Each method constructs the right HypLabAction internally so the agent doesn't have to remember the field names.

The Three Abstract Methods

Every EnvClient subclass must implement:

def _step_payload(self, action):
    """Convert a HypLabAction into a JSON-ready dict."""
    return action.model_dump(exclude_none=True)

def _parse_result(self, payload):
    """Convert a JSON dict from the server into a StepResult."""
    obs = HypLabObservation(**payload)
    return StepResult(observation=obs, reward=..., done=...)

def _parse_state(self, payload):
    """Convert a JSON dict into a HypLabState."""
    return HypLabState(**payload)

Part 10: Tasks and Graders

Files: tasks/task_easy.py, task_medium.py, task_hard.py

The hackathon rules require minimum 3 tasks with programmatic graders that return scores between 0.0 and 1.0.

What is a Task?

A task is a configuration dict that says "run the environment with these settings":

TASK_EASY = {
    "id": "easy",
    "name": "Easy -- Single-Edge Discovery",
    "description": "Discover the causal relationship between two abstract variables...",
    "difficulty": "easy",
    "reset_kwargs": {
        "noise_level": "low",        # sigma = 0.05
        "domain": "system_alpha",    # abstract domain
        "seed": 42,                  # deterministic for reproducibility
    },
}

What is a Grader?

A grader takes the episode results and returns a normalized score:

def grade_easy(episode_result: dict) -> float:
    accuracy = episode_result.get("accuracy_score", 0.0)
    efficiency = episode_result.get("efficiency_bonus", 0.0)
    calibration = episode_result.get("calibration_score", 0.0)

    raw = (
        0.60 * min(accuracy, 1.0)               # 60% weight on accuracy
        + 0.20 * min(efficiency / 0.15, 1.0)     # 20% weight on efficiency
        + 0.20 * min(calibration / 0.20, 1.0)    # 20% weight on calibration
    )

    return round(max(0.0, min(1.0, raw)), 4)

Difficulty Progression

	Easy	Medium	Hard
Variables	2	3	4
Noise (sigma)	0.05	0.20	0.50
Budget	12	10	8
Domain	system_alpha (fixed)	Random	Random
Key challenge	Single edge	Multiple edges + interactions	Complex graph + confounders + noise

The hard task is genuinely hard for frontier models:

4 variables means up to 6 possible edges to discover
Rules can be any of 8 types (not just linear!) plus interaction rules
High noise + hidden confounders make every observation unreliable
Only 8 experiments to figure it all out
Abstract variable names prevent exploiting pretrained knowledge

Try it yourself:

from tasks.task_easy import grade_easy

# Perfect episode
score = grade_easy({
    "accuracy_score": 1.0,
    "efficiency_bonus": 0.15,
    "calibration_score": 0.20,
})
print(f"Perfect score: {score}")  # 1.0

# Mediocre episode
score = grade_easy({
    "accuracy_score": 0.4,
    "efficiency_bonus": 0.0,
    "calibration_score": 0.05,
})
print(f"Mediocre score: {score}")  # ~0.29

# Zero effort
score = grade_easy({})
print(f"Zero score: {score}")  # 0.0

Part 11: The Baseline Agent

File: baseline_inference.py

This script proves the environment works by running a real LLM agent against all three tasks.

The Flow

1. Create an OpenAI client (reads OPENAI_API_KEY from env)
2. For each of the 3 tasks:
   a. Create a fresh HypothesisLabEnvironment
   b. Call reset() with the task's settings
   c. Enter a loop (max 8 turns):
      - Send the observation to the LLM as a "user" message
      - Parse the LLM's response into a HypLabAction
      - Call step(action)
      - If done, break
   d. If not done after 8 turns, force a submit
   e. Grade the episode with the task's grader
3. Print all scores

The System Prompt

The system prompt teaches the LLM how to interact with the environment:

You are a scientific AI assistant trained to discover hidden causal rules.
...
Format your actions as JSON:
{"action_type": "experiment", "experiment_type": "intervention", ...}
...
Strategy tips:
- Run interventions first to discover which variables are causally connected
- Vary the control variable widely (e.g. 1, 5, 10) to detect nonlinearity
- Don't repeat the same experiment -- redundant experiments are penalised

The Action Parser

LLMs don't always produce perfect JSON. The parser handles multiple formats:

JSON in code blocks: ```json {...} ```
Raw JSON: {...}
Natural language: "I conclude that Beta = 2 * Alpha" (extracted via regex)
Timeout: if it's the last turn, force a submit with whatever text the LLM wrote

Running It

export OPENAI_API_KEY=sk-...
python baseline_inference.py

Expected output:

============================================================
  Scientific Hypothesis Lab -- Baseline Inference
  Model: gpt-4o-mini
============================================================

--- Task: Easy -- Single-Edge Discovery ---
    Total episode reward: +0.6100
    Graded score:         0.6500

--- Task: Medium -- Multi-Edge Discovery ---
    Total episode reward: +0.3800
    Graded score:         0.4000

--- Task: Hard -- Complex Graph Under Noise ---
    Total episode reward: +0.2100
    Graded score:         0.2500

============================================================
  SUMMARY
============================================================
  easy    : 0.6500
  medium  : 0.4000
  hard    : 0.2500
  average : 0.4333

Part 12: Testing

File: tests/test_environment.py

39 tests organized into 5 test classes. Run them with:

pytest tests/ -v

Test Classes

Class	Tests	What it covers
TestCausalWorld	18	World generation, all 8 rule types, interactions, domains, seeds, abstract names
TestInfoGainTracker	4	Reward schedule, redundancy, triangulation
TestRubric	6	Accuracy scoring, calibration, efficiency, feedback
TestEnvironmentIntegration	6	Full episodes, budget exhaustion, errors, state leaks
TestGraders	5	Grader range [0,1], zero input, perfect input

Key Tests to Study

Seed reproducibility -- same seed produces same world:

world1 = generate_world(n_variables=3, domain="system_alpha", seed=99)
world2 = generate_world(n_variables=3, domain="system_alpha", seed=99)
assert world1.variables == world2.variables

Variable names are abstract -- no real-world names that give LLMs prior knowledge:

for seed in range(50):
    world = generate_world(n_variables=4, seed=seed)
    for v in world.variables:
        assert v.lower() not in {"temperature", "pressure", "price", ...}

State doesn't leak secrets:

st = env.state
state_str = str(st.model_dump())
assert "rule_type" not in state_str
assert "params" not in state_str

Diverse rule types over many seeds -- we see all 8+ types:

types_seen = set()
for seed in range(100):
    world = generate_world(n_variables=3, seed=seed)
    for rule in world.rules:
        types_seen.add(rule.rule_type)
assert len(types_seen) >= 5

Grader always returns [0, 1]:

score = grade_easy({"accuracy_score": 1.0, "efficiency_bonus": 0.15, ...})
assert 0.0 <= score <= 1.0

Part 13: Deployment

Dockerfile

The Dockerfile uses a multi-stage build:

Stage 1 (builder):
  - Start from OpenEnv base image
  - Copy source code
  - Install uv (Python package manager)
  - Run uv sync to install dependencies
  - This creates a .venv with all packages

Stage 2 (runtime):
  - Start from a clean base image
  - Copy only the .venv and source code (not build tools)
  - Set PATH and PYTHONPATH
  - Run uvicorn to start the server

Step 1: Build the Docker Image

cd Lab-experiment
docker build -t hypothesis-lab .

This takes 2-5 minutes the first time (downloads base image + installs dependencies). Subsequent builds are fast thanks to layer caching. You should see Successfully tagged hypothesis-lab:latest at the end.

If the build fails, check:

pyproject.toml has build-backend = "setuptools.build_meta" (not the experimental setuptools.backends path)
.dockerignore excludes .venv/, __pycache__/, .git/

Step 2: Run the Container

docker run -p 8000:8000 hypothesis-lab

You should see uvicorn start up:

INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

To run in the background (detached mode):

docker run -d --name hyp-lab -p 8000:8000 hypothesis-lab

Step 3: Verify the Server is Running

Open a new terminal and run:

curl http://localhost:8000/health

Expected response:

{"status":"ok"}

Step 4: Check the API Schema

curl -s http://localhost:8000/schema | python3 -m json.tool

This returns the JSON Schema definitions for HypLabAction and HypLabObservation, useful for understanding what fields exist.

Step 5: Understand HTTP vs WebSocket

Critical concept: The OpenEnv server has two communication modes:

Endpoint Type Stateful? Use case

/health GET No Check if server is alive

/schema GET No Inspect action/observation schemas

/reset POST No -- creates a fresh env, returns result, destroys env One-shot inspection

/step POST No -- creates a fresh env (never reset!), tries to step, fails Don't use for episodes

/ws WebSocket Yes -- persistent connection, one env for the whole episode Use this for episodes

The HTTP /reset and /step are stateless: each request creates a brand-new environment instance and destroys it after responding. If you curl /reset then curl /step, the step hits a different environment that was never reset -- so it fails. Multi-step episodes require the WebSocket endpoint (/ws), which keeps one environment alive for the entire connection.

Endpoint	Type	Stateful?	Use case
`/health`	GET	No	Check if server is alive
`/schema`	GET	No	Inspect action/observation schemas
`/reset`	POST	No -- creates a fresh env, returns result, destroys env	One-shot inspection
`/step`	POST	No -- creates a fresh env (never reset!), tries to step, fails	Don't use for episodes
`/ws`	WebSocket	Yes -- persistent connection, one env for the whole episode	Use this for episodes

This is why curl to /step returned an empty response -- the server-side environment had no world to step in. Our environment now returns a clear error instead of crashing:

{"observation": {"system_message": "Error: No active episode. Call reset() first.", "done": true, "reward": -1.0}, ...}

Step 6: Run a Full Episode (Python script)

The proper way to interact is via WebSocket. The EnvClient class handles this automatically. Save this as test_docker.py and run it while the container is running:

import asyncio
import json
import websockets

async def run_episode():
    uri = "ws://localhost:8000/ws"
    async with websockets.connect(uri) as ws:

        # 1. Reset
        await ws.send(json.dumps({
            "type": "reset",
            "data": {"noise_level": "low", "domain": "system_alpha", "seed": 42}
        }))
        resp = json.loads(await ws.recv())
        obs = resp["data"]["observation"]
        print(f"=== Episode Started ===")
        print(f"Variables: {obs['available_variables']}")
        print(f"Budget: {obs['budget_remaining']}")
        print()

        variables = obs["available_variables"]
        cause, effect = variables[0], variables[1]

        # 2. Intervention experiment
        await ws.send(json.dumps({
            "type": "step",
            "data": {
                "action_type": "experiment",
                "experiment_type": "intervention",
                "control_variable": cause,
                "control_value": 5.0,
                "target_variable": effect,
            }
        }))
        resp = json.loads(await ws.recv())
        obs = resp["data"]["observation"]
        print(f"[Intervention] Set {cause}=5.0 -> {effect}={obs['result_value']}")
        print(f"  Info gain: {obs['info_gain_reward']}, Budget left: {obs['budget_remaining']}")
        print()

        # 3. Correlation sweep
        await ws.send(json.dumps({
            "type": "step",
            "data": {
                "action_type": "experiment",
                "experiment_type": "correlation",
                "control_variable": cause,
                "control_range": [0.5, 20.0, 8],
                "target_variable": effect,
            }
        }))
        resp = json.loads(await ws.recv())
        obs = resp["data"]["observation"]
        print(f"[Correlation] Swept {cause} from 0.5 to 20.0:")
        if isinstance(obs["result_value"], list):
            for point in obs["result_value"]:
                print(f"  {cause}={point[0]:.1f} -> {effect}={point[1]:.4f}")
        print(f"  Info gain: {obs['info_gain_reward']}, Budget left: {obs['budget_remaining']}")
        print()

        # 4. Submit hypothesis
        await ws.send(json.dumps({
            "type": "step",
            "data": {
                "action_type": "submit",
                "hypothesis_text": f"{effect} depends linearly on {cause}.",
                "hypothesis_equations": [f"{effect} = 2.0 * {cause} + 1.0"],
                "confidence": 0.6,
            }
        }))
        resp = json.loads(await ws.recv())
        obs = resp["data"]["observation"]
        print(f"=== Episode Finished ===")
        print(f"Accuracy:      {obs.get('accuracy_score')}")
        print(f"Precision:     {obs.get('precision_bonus')}")
        print(f"Calibration:   {obs.get('calibration_score')}")
        print(f"Efficiency:    {obs.get('efficiency_bonus')}")
        print(f"Contradiction: {obs.get('contradiction_penalty')}")
        print(f"TOTAL REWARD:  {obs.get('total_episode_reward')}")
        print()
        print(f"Ground truth:\n{obs.get('ground_truth_revealed')}")

asyncio.run(run_episode())

Run it:

pip install websockets    # one-time install
python test_docker.py

Expected output:

=== Episode Started ===
Variables: ['Quant_A', 'Quant_E']
Budget: 12

[Intervention] Set Quant_A=5.0 -> Quant_E=3.4521
  Info gain: 0.12, Budget left: 11

[Correlation] Swept Quant_A from 0.5 to 20.0:
  Quant_A=0.5 -> Quant_E=7.8123
  Quant_A=3.3 -> Quant_E=4.2341
  ...
  Info gain: 0.10, Budget left: 10

=== Episode Finished ===
Accuracy:      0.35
Precision:     0.0
Calibration:   0.14
Efficiency:    0.15
Contradiction: 0.0
TOTAL REWARD:  0.86

Ground truth:
Domain: system_alpha
  Quant_E = 1.11 * exp(-0.16 * Quant_A)

Key insight from the WebSocket protocol:

Send messages as {"type": "reset", "data": {...}} and {"type": "step", "data": {...}}

The action fields go directly inside "data" (no extra "action" wrapper)

Responses come back as {"type": "observation", "data": {"observation": {...}, "reward": ..., "done": ...}}

The observation fields live at resp["data"]["observation"] -- note the double nesting

Understanding the Observation Fields

On reset, most fields are null -- only setup information is populated:

Field	What it tells you
`system_message`	Human-readable summary -- the LLM agent reads this
`available_variables`	Variable names to use in experiments
`budget_remaining`	Number of experiment steps left
`result_value`	`null` on reset; float or `[[x,y],...]` list after experiments
`noise_sigma`	`null` on reset; shown per-experiment so you know measurement precision
`done`	`false` until you submit or budget runs out
`reward`	Reward for this step (0.0 on reset)
`accuracy_score` ... `ground_truth_revealed`	All `null` until you submit your hypothesis

After submit, the scoring fields light up:

Field	Meaning
`accuracy_score`	How close your hypothesis matches the true rules (0-1)
`precision_bonus`	Bonus for getting coefficients/parameters right
`calibration_score`	How well your confidence matches your actual accuracy
`efficiency_bonus`	Reward for using fewer budget steps
`contradiction_penalty`	Deducted if your hypothesis contradicts your own data
`total_episode_reward`	Sum of all info gain rewards + final rubric score
`ground_truth_revealed`	The actual hidden rules -- study this to improve!

Design note: Why don't we reveal the exact noise sigma upfront?

The system message says "Noise level: low" but does NOT say "sigma=0.05". In real science you have to estimate measurement uncertainty from repeated measurements. This forces the agent to run a few repeat experiments to gauge noise before trusting single data points. The qualitative label (low/medium/high) sets expectations without handing out a free number. The exact sigma IS shown per-experiment in the noise_sigma field -- that's fine because by then the agent has already spent a budget step.

Error Handling

The environment returns error observations (not crashes) for bad actions:

Situation	Response	Reward
Step without reset	`"Error: No active episode. Call reset() first."`	`-1.0`, `done=true`
Step after episode ended	`"Error: Episode is already done."`	`0.0`, `done=true`
Unknown variable name	`"Error: Unknown control variable 'X'."`	`-0.05`, budget deducted
Unknown experiment type	`"Error: Unknown experiment type..."`	`-0.05`
Unknown action type	`"Error: Unknown action_type..."`	`-0.05`, budget deducted

The small negative reward (-0.05) for invalid actions teaches RL agents to produce valid requests without being so harsh that it dominates the reward signal.

Stopping the Container

# If running in foreground: Ctrl+C

# If running in background:
docker stop hyp-lab
docker rm hyp-lab

Troubleshooting

Problem	Fix
`port is already allocated`	Another process uses port 8000. Use `-p 8001:8000` and hit `localhost:8001` instead
`curl: (7) Failed to connect`	Container isn't running yet. Wait a few seconds for uvicorn to start
`{"detail":"Not Found"}`	You hit the wrong endpoint. Use `/health`, `/reset`, `/step`, `/state`
Container exits immediately	Check logs: `docker logs hyp-lab`. Usually a missing dependency

Deploying to HF Spaces

openenv push --org your-org --token $HF_TOKEN

The README.md has Hugging Face Spaces metadata in its YAML frontmatter:

---
title: Scientific Hypothesis Lab
emoji: 🔬
sdk: docker
app_port: 8000
tags:
  - openenv
---

This tells HF Spaces to build the Docker image and expose port 8000.

Part 14: Hands-On Exercises

Now it's your turn. These exercises go from easy to hard.

Exercise 1: Explore a World (5 min)

from server.causal_world import generate_world

# Generate 3 different worlds and print their ground truth
for seed in [1, 2, 3]:
    world = generate_world(n_variables=3, domain="system_gamma", seed=seed)
    print(f"\n=== Seed {seed} ===")
    print(f"Variables: {world.variables}")
    print(f"Interactions: {len(world.interactions)}")
    print(f"Confounder sigma: {world.confounder_sigma}")
    print(world.ground_truth_summary())

Questions to answer:

How many rules does each world have? What types?
Do any worlds have interaction rules or confounders?
Are variable names abstract (no real-world physics terms)?

Exercise 2: Play a Full Episode (10 min)

from models import ActionType, ExperimentType, HypLabAction
from server.hypothesis_lab_environment import HypothesisLabEnvironment

env = HypothesisLabEnvironment()
obs = env.reset(seed=100, noise_level="medium", domain="system_beta")
print(obs.system_message)

# YOUR TURN: Run 3-4 experiments, then submit a hypothesis.
# Try to get the highest accuracy score you can.
# Hint: use CORRELATION to see the relationship shape,
#   then test at extreme values to distinguish linear from quadratic/saturating.

Exercise 3: Break the Rubric (10 min)

Try to get edge-case scores:

Get accuracy_score = 0.0 (submit empty hypothesis)
Get contradiction_penalty = -0.50 (claim "no causal relationship exists")
Get efficiency_bonus = 0.15 (submit early with high accuracy)
Get calibration_score = 0.20 (match your confidence to your accuracy perfectly)

Exercise 4: Add a New Rule Type (20 min)

The environment already has 8 rule types, but you can add more! Try adding a sinusoidal rule:

Formula: y = a * sin(k * x) + b
Add it to CausalRule.evaluate()
Add it to RULE_TYPES and _random_rule() with appropriate weights
Add keywords to _RULE_KEYWORDS in rubric.py
Test it with a hand-crafted world

Exercise 5: Add a New Variable Pool (10 min)

Add a new abstract variable pool to ABSTRACT_VAR_POOLS in causal_world.py:

Use creative abstract names (e.g., colour names: "Red", "Blue", "Green", "Amber", "Violet")
Make sure they carry no scientific meaning

Exercise 6: Write a Smarter Baseline Agent (30 min)

Modify baseline_inference.py to implement a better strategy:

First, run passive observations on all variables
Then run interventions between each pair to find which are connected
Use wide correlation sweeps (1 to 100) to check for curvature, saturation, or breakpoints
Test at x=0.5 and x=50 to distinguish linear from exponential/logarithmic
If the data suggests two parents, try holding one constant while varying the other
Submit with well-calibrated confidence

Part 15: Golden Rules for Building Environments

These are the principles that separate good environments from great ones.

Rule 1: The Agent Should Never See the Answer

The hidden world, ground truth rules, and correct parameters must NEVER appear in observations or state before the agent submits. This is the most common mistake beginners make.

Bad:

def reset(self):
    return Observation(hint=f"The slope is {self.world.rules[0].params['a']}")

Good:

def reset(self):
    return Observation(system_message="Run experiments to discover the hidden rules.")

Rule 2: Reward Shaping > Sparse Rewards

A reward function that only gives +1 at the end teaches nothing. The agent needs signal throughout the episode.

Bad:

def step(self, action):
    if action.type == "submit":
        return Observation(reward=1.0 if correct else 0.0, done=True)
    return Observation(reward=0.0)  # No signal during experiments!

Good:

def step(self, action):
    if action.type == "experiment":
        info_gain = self.tracker.record(action)
        return Observation(reward=info_gain)  # Signal at every step!
    elif action.type == "submit":
        return Observation(reward=self.rubric.score(action))

Rule 3: Deterministic Seeds for Reproducibility

Every random element must be controlled by a seed. If two runs with the same seed produce different results, your graders are broken.

def generate_world(seed=42):
    py_rng = random.Random(seed)      # Controls structure
    np_rng = np.random.default_rng(seed)  # Controls noise

Rule 4: Observations Should Be LLM-Friendly

If your agent is an LLM, the observation needs a human-readable text field. Don't just return a dict of numbers.

Bad:

return Observation(result={"x": 5.0, "y": 13.04, "sigma": 0.05})

Good:

return Observation(
    system_message="[Step 1] Set Alpha=5.0, observed Beta=13.04 (sigma=0.05)",
    result_value=13.04,
    noise_sigma=0.05,
)

Rule 5: Validate All Agent Input

Never trust the agent. It will send garbage, typos, and adversarial inputs.

if cause not in world.variables:
    return self._error_obs(f"Unknown variable '{cause}'. Available: {world.variables}")

Rule 6: Clean Episode Boundaries

reset() must produce a completely clean state. No leftover data from previous episodes.

def reset(self):
    self._world = generate_world(...)  # Fresh world
    self._tracker = InfoGainTracker()  # Fresh tracker
    self._history = []                 # Fresh history
    self._done = False                 # Episode is active

Rule 7: Budget/Step Limits Prevent Infinite Episodes

Always have a mechanism to end the episode. Either a budget that runs out, or a maximum step count.

Rule 8: The Hard Task Must Be Actually Hard

If your hard task is easy for GPT-4, the judges will notice. Design it so that even frontier models score 0.2-0.4 on the hard task. Our hard task uses 4 variables, sigma=0.50 noise, hidden confounders, interaction rules, and only 8 experiment budget.

Rule 8.5: Don't Let LLMs Cheat with Prior Knowledge

If your environment uses real-world variable names (Temperature, Pressure, Price, Demand), LLM agents will use pretrained knowledge instead of reasoning from data. Use abstract names (Alpha, Beta, V1, V2) to force genuine discovery. Similarly, don't use only 3 rule types -- the agent will memorize the template set. Use enough variety that template-matching fails.

Rule 9: Graders Must Be Deterministic

Given the same episode_result dict, a grader must always return the same score. No randomness, no external API calls, no time-dependent logic.

Rule 10: State Metadata Only

The state property returns metadata, not secrets. It's for debugging, logging, and agent introspection -- never for leaking the answer.

Part 16: How to Build Your Own From Scratch

Here's the step-by-step recipe for creating a new OpenEnv environment.

Step 1: Choose Your Domain

Pick a real-world task humans actually do:

Email triage
Code review
Data cleaning
Scheduling
Customer support
Medical diagnosis
Financial analysis

Step 2: Define the Action Space

What can the agent do? Write it out in plain English first:

The agent can:
1. Read an email subject and preview
2. Assign a priority (high/medium/low)
3. Assign a label (bug/feature/question/spam)
4. Flag for human review

Then convert to a Pydantic model:

class EmailAction(Action):
    action_type: str  # "classify" or "flag"
    priority: Optional[str] = None
    label: Optional[str] = None
    flag_reason: Optional[str] = None

Step 3: Define the Observation Space

What does the agent see after each action?

class EmailObservation(Observation):
    system_message: str
    email_subject: str
    email_preview: str
    emails_remaining: int
    # ... (inherits done, reward from Observation)

Step 4: Build the Hidden World

What's the ground truth the agent is trying to discover/solve? This is your "puzzle generator."

Step 5: Build the Reward Function

Design rewards that teach the right behavior:

Correct classification: +1.0
Partially correct: +0.5
Wrong but not harmful: -0.1
Flagging spam as high priority: -0.5

Step 6: Write the Environment Class

class EmailTriageEnvironment(Environment):
    def reset(self, **kwargs):
        # Generate a batch of emails
        # Return the first email as an observation

    def step(self, action):
        # Grade the agent's classification
        # Move to next email or end episode

    @property
    def state(self):
        # Return progress metadata

Step 7: Wire Up the Server

app = create_app(
    EmailTriageEnvironment,
    EmailAction,
    EmailObservation,
    env_name="email_triage",
)

Step 8: Define 3 Tasks

TASK_EASY = {"id": "easy", "reset_kwargs": {"n_emails": 5, "spam_ratio": 0.5}}
TASK_MEDIUM = {"id": "medium", "reset_kwargs": {"n_emails": 10, "spam_ratio": 0.2}}
TASK_HARD = {"id": "hard", "reset_kwargs": {"n_emails": 20, "spam_ratio": 0.05}}

Step 9: Write the Baseline

Use the OpenAI API to run a simple agent and produce baseline scores.

Step 10: Write Tests

Minimum tests:

reset() produces valid observation
step() with valid action works
step() with invalid action returns error
Episode ends when expected
State doesn't leak secrets
Graders return [0, 1]
Seeds produce deterministic results

Step 11: Write the Dockerfile

Copy our Dockerfile template. Change the CMD to point to your server module.

Step 12: Write openenv.yaml

spec_version: 1
name: your_env_name
type: space
runtime: fastapi
app: server.app:app
port: 8000

Step 13: Write the README

Include HF Spaces frontmatter, environment description, action/observation docs, task descriptions, and baseline scores.

Congratulations

You've read through the entire Scientific Hypothesis Lab codebase and understand:

What RL environments are and how agents interact with them
The OpenEnv contract: reset/step/state, Action/Observation/State, openenv.yaml
How hidden worlds work: causal graphs with 8+ rule types, interaction rules, confounders, abstract variable names
Why abstract variable names matter: prevents LLMs from using pretrained knowledge as a shortcut
How reward functions are designed: info gain, accuracy (across all rule types + interactions), calibration, efficiency, contradiction
How the server works: create_app() wraps everything in HTTP endpoints
How clients connect: typed methods over WebSocket
How tasks and graders work: difficulty progression, deterministic scoring [0, 1]
How baseline agents work: LLM + system prompt + action parsing
How to test: 39 tests covering every component including all rule types
How to deploy: Docker + HF Spaces
The golden rules for building great environments (including anti-cheating via abstract naming)
How to build your own from scratch in 13 steps

You are now qualified to build, debug, explain, and teach RL environments. Go build something amazing.

The Complete Guide to Building RL Environments with OpenEnv

Table of Contents

Part 1: The Big Picture

What is Reinforcement Learning?

What is an "Environment" in code?

What is OpenEnv?

What does OUR environment do?

Part 2: The OpenEnv Contract

The Three Methods

The Three Data Types

The Manifest (openenv.yaml)

The Episode Lifecycle

Part 3: Tour of Every File

Part 4: The Hidden World

Core Concept: Causal Graphs

Why Abstract Variable Names?

The Building Blocks

CausalWorld -- the full hidden system

The generate_world() Function

Domains and Variable Pools

Part 5: The Reward Engine

Two Kinds of Rewards

Per-Step: InfoGainTracker

End-of-Episode: score_hypothesis()

1. Accuracy Score (0.0 - 1.0)

2. Precision Bonus (+0.10)

3. Calibration Score (0.0 - 0.20)

4. Efficiency Bonus (+0.15)

5. Contradiction Penalty (-0.50)

Part 6: The Environment Core

The Class Structure

reset() -- Starting a New Episode

step() -- Processing an Action

_handle_experiment() -- Running an Experiment

_handle_submit() -- Grading the Hypothesis

state -- Episode Metadata

Part 7: The Data Models

Why Pydantic?

The Import Pattern

The Enums

HypLabAction -- What the Agent Sends

HypLabObservation -- What Comes Back

HypLabState -- Episode Metadata

Part 8: The Server

The HTTP Endpoints

Running the Server

Part 9: The Client

The Three Abstract Methods

Part 10: Tasks and Graders

What is a Task?

What is a Grader?

Difficulty Progression

Part 11: The Baseline Agent

The Flow

The System Prompt

The Action Parser

Running It

Part 12: Testing

Test Classes

Key Tests to Study

Part 13: Deployment

Dockerfile

Step 1: Build the Docker Image

Step 2: Run the Container

Step 3: Verify the Server is Running

Step 4: Check the API Schema

Step 5: Understand HTTP vs WebSocket

Step 6: Run a Full Episode (Python script)

Understanding the Observation Fields

Error Handling

Stopping the Container

Troubleshooting

Deploying to HF Spaces

Part 14: Hands-On Exercises

Exercise 1: Explore a World (5 min)

Exercise 2: Play a Full Episode (10 min)

Exercise 3: Break the Rubric (10 min)

Exercise 4: Add a New Rule Type (20 min)

Exercise 5: Add a New Variable Pool (10 min)

Exercise 6: Write a Smarter Baseline Agent (30 min)