Spaces:
Sleeping
Sleeping
File size: 62,620 Bytes
aab0192 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 | # The Complete Guide to Building RL Environments with OpenEnv
**A follow-along tutorial using the Scientific Hypothesis Lab**
By the end of this tutorial you will be able to:
- Explain what an RL environment is and why it matters
- Read and understand every file in this project
- Build your own OpenEnv environment from scratch
- Design reward functions that actually train good agents
- Deploy your environment to Hugging Face Spaces
- Explain all of this to anyone who asks
---
## Table of Contents
1. [Part 1: The Big Picture](#part-1-the-big-picture)
2. [Part 2: The OpenEnv Contract](#part-2-the-openenv-contract)
3. [Part 3: Tour of Every File](#part-3-tour-of-every-file)
4. [Part 4: The Hidden World (causal_world.py)](#part-4-the-hidden-world)
5. [Part 5: The Reward Engine (rubric.py)](#part-5-the-reward-engine)
6. [Part 6: The Environment Core (hypothesis_lab_environment.py)](#part-6-the-environment-core)
7. [Part 7: The Data Models (models.py)](#part-7-the-data-models)
8. [Part 8: The Server (app.py)](#part-8-the-server)
9. [Part 9: The Client (client.py)](#part-9-the-client)
10. [Part 10: Tasks and Graders](#part-10-tasks-and-graders)
11. [Part 11: The Baseline Agent (baseline_inference.py)](#part-11-the-baseline-agent)
12. [Part 12: Testing](#part-12-testing)
13. [Part 13: Deployment](#part-13-deployment)
14. [Part 14: Hands-On Exercises](#part-14-hands-on-exercises)
15. [Part 15: Golden Rules for Building Environments](#part-15-golden-rules)
16. [Part 16: How to Build Your Own From Scratch](#part-16-build-your-own)
---
## Part 1: The Big Picture
### What is Reinforcement Learning?
Imagine teaching a dog a trick. You can't explain the trick in English. Instead, you:
1. Let the dog **try something** (an action)
2. **Show it the result** (an observation)
3. Give it a **treat or a scolding** (a reward)
4. Repeat
The dog learns by trial and error. That's reinforcement learning (RL).
In RL, there are two players:
```
┌─────────┐ action ┌─────────────┐
│ AGENT │ ──────────> │ ENVIRONMENT │
│ (dog) │ <────────── │ (world) │
└─────────┘ observation └─────────────┘
+ reward
```
- **Agent**: the AI that learns (an LLM, a neural network, etc.)
- **Environment**: the world the agent lives in (our code!)
### What is an "Environment" in code?
An environment is a Python class with three methods:
```python
class MyEnvironment:
def reset(self):
"""Start a new episode. Return the first observation."""
...
def step(self, action):
"""Agent does something. Return what happened + reward."""
...
def state(self):
"""Return metadata about the current episode."""
...
```
That's it. Those three methods are the entire interface between the agent and the world.
### What is OpenEnv?
OpenEnv is a **standard** for RL environments. Think of it like USB for hardware -- it doesn't matter what device you plug in, as long as it follows the USB spec. OpenEnv says:
- Your `reset()` must return an `Observation` object
- Your `step()` must accept an `Action` object and return an `Observation`
- Your `state` must return a `State` object
- These objects must be Pydantic models (typed, validated Python objects)
- You must have an `openenv.yaml` manifest file
- You must serve your environment over HTTP (FastAPI)
Why bother with a standard? Because it means **any agent** can talk to **any environment** without custom glue code.
### What does OUR environment do?
Our environment is called the **Scientific Hypothesis Lab**. Here's the idea:
> The agent is a scientist. Each episode, it faces a hidden causal system
> (like "Beta = 2.0 * Alpha + 3.0"). The variables are **abstract** --
> named things like Alpha, Beta, Gamma or V1, V2, V3 -- so the agent
> can't rely on pretrained knowledge of real-world physics. It must
> reason purely from experimental data.
Think of it like a detective game:
- The "crime" is hidden causal rules between variables
- The "clues" are noisy experimental results
- The "solution" is a written hypothesis
- The "score" is how close the hypothesis matches reality
This is a **real-world** task -- it models how actual scientists discover causal relationships. Using abstract variable names ensures the agent genuinely *discovers* rules rather than recalling them from training data.
---
## Part 2: The OpenEnv Contract
Before we look at code, let's understand the contract every OpenEnv environment must fulfill.
### The Three Methods
```
reset(**kwargs) -> Observation
"Start fresh. Generate a new puzzle. Tell the agent what it sees."
step(action: Action) -> Observation
"The agent did something. Process it. Tell the agent what happened."
state -> State (property, not a method call)
"Return metadata about the current episode. Never leak secrets."
```
### The Three Data Types
Every OpenEnv environment defines three Pydantic models that inherit from base types:
| Type | Base Class | Purpose | Who sees it |
|------|-----------|---------|-------------|
| **Action** | `openenv.core.Action` | What the agent sends | Agent -> Environment |
| **Observation** | `openenv.core.Observation` | What comes back | Environment -> Agent |
| **State** | `openenv.core.State` | Episode metadata | Anyone (debugging) |
The `Observation` base class always includes:
- `done: bool` -- is the episode over?
- `reward: float | None` -- how well did the agent do on this step?
The `State` base class always includes:
- `episode_id: str` -- unique ID for this episode
- `step_count: int` -- how many steps so far
### The Manifest (openenv.yaml)
Every environment needs a tiny YAML file:
```yaml
spec_version: 1 # Which version of the OpenEnv spec
name: hypothesis_lab # Machine-readable name
type: space # Deployed as an HF Space
runtime: fastapi # HTTP framework used
app: server.app:app # Python path to the ASGI app
port: 8000 # Port the server listens on
```
This is like a `package.json` for your environment -- it tells the OpenEnv tooling how to find and run your code.
### The Episode Lifecycle
Here's what one complete episode looks like:
```
1. Agent calls reset(noise_level="low", domain="system_alpha")
2. Environment generates a hidden world with random causal rules
3. Environment returns initial Observation (variable names, budget, instructions)
4. LOOP:
a. Agent reads the observation
b. Agent decides on an action (experiment or submit)
c. Agent calls step(action)
d. Environment processes the action
e. Environment returns new Observation (results, reward)
f. If observation.done == True, episode is over
5. Agent calls state to see final metadata
```
---
## Part 3: Tour of Every File
Here is every file and what it does. Think of this as the map before we explore each room.
```
hypothesis_lab/
│
├── openenv.yaml # THE MANIFEST
│ "Hi, I'm an OpenEnv environment. # Points the framework
│ Here's how to find my server." # to server.app:app
│
├── models.py # THE LANGUAGE
│ "These are the words the agent # HypLabAction
│ and environment use to talk." # HypLabObservation
│ # HypLabState
│
├── server/ # THE BRAIN
│ ├── app.py # HTTP server (thin wrapper)
│ ├── hypothesis_lab_environment.py # Core game logic
│ ├── causal_world.py # Hidden puzzle generator
│ └── rubric.py # Scoring engine
│
├── tasks/ # THE EXAM
│ ├── task_easy.py # Easy test + grader
│ ├── task_medium.py # Medium test + grader
│ └── task_hard.py # Hard test + grader
│
├── client.py # THE PHONE
│ "Typed Python client so agents # Wraps HTTP calls
│ don't need to speak raw HTTP." # into nice methods
│
├── baseline_inference.py # THE DEMO AGENT
│ "Here's a simple GPT agent that # Uses OpenAI API
│ can play the game. Not great, # Produces reproducible
│ but proves the game works." # scores on all 3 tasks
│
├── tests/ # THE SAFETY NET
│ └── test_environment.py # 39 tests covering
│ # every component
│
├── Dockerfile # THE SHIPPING BOX
│ "Packages everything into a # Multi-stage build
│ container for deployment." # OpenEnv base image
│
├── pyproject.toml # THE SHOPPING LIST
│ "What Python packages we need." # Dependencies + metadata
│
└── README.md # THE COVER LETTER
"What this environment is and # HF Spaces frontmatter
how to use it." # Action/observation docs
```
Now let's explore each room in detail.
---
## Part 4: The Hidden World
**File: `server/causal_world.py`**
This is the puzzle the agent must solve. Every episode generates a fresh hidden world.
### Core Concept: Causal Graphs
A causal graph is a set of variables connected by rules:
```
Alpha ──(quadratic)──> Beta ──(saturating)──> Gamma
7.93 B = 0.5*A² + 1.2 G = 10*B / (3 + B)
```
The agent never sees this graph. It can only probe it through experiments.
### Why Abstract Variable Names?
An earlier version of this environment used real-world names like "Temperature", "Pressure", "Volume". This created a serious problem: LLM agents have *pretrained knowledge* about how those variables relate (PV=nRT, supply/demand curves, etc.). The agent would use that prior knowledge instead of reasoning from experimental data -- which defeats the entire purpose.
Now variables are named things like **Alpha, Beta, Gamma** or **V1, V2, V3** or **Quant_A, Quant_B, Quant_C**. The LLM has no prior about how "Alpha" relates to "Beta", so it must genuinely discover the relationship through experiments.
### The Building Blocks
**CausalRule** -- one edge in the graph:
```python
@dataclass
class CausalRule:
cause: str # "Alpha"
effect: str # "Beta"
rule_type: str # one of 8 types (see table below)
params: dict # {"a": 2.1, "b": 3.0}
description: str # "Beta = 2.1 * Alpha + 3.0"
def evaluate(self, x: float) -> float:
# Given x (the cause value), compute the effect value
```
There are **eight** single-parent rule types:
| Rule | Formula | What it looks like | Why it's tricky |
|------|---------|-------------------|-----------------|
| Linear | `y = a*x + b` | Straight line | Easy to identify |
| Threshold | `y = high if x > t else low` | Step function | Need to find the cutoff |
| Inverse | `y = a / x` | Hyperbola | Blows up near zero |
| Quadratic | `y = a*x² + b*x + c` | Parabola | Looks linear in narrow range |
| Exponential | `y = a * exp(k*x)` | Growth/decay curve | Looks linear locally |
| Logarithmic | `y = a * ln(x) + b` | Diminishing returns | Looks linear in mid-range |
| Saturating | `y = Vmax * x / (Km + x)` | Plateau | Looks linear for small x |
| Piecewise-linear | Two slopes with a knot | Bent line | Looks linear on each side |
Many of these look similar with limited data. Quadratic, exponential, and saturating all resemble linear in a narrow range -- the agent must design experiments that *discriminate* between hypotheses (e.g., sampling at extremes to check for curvature).
**InteractionRule** -- a multi-parent edge where the effect depends on **two** causes:
```python
@dataclass
class InteractionRule:
cause1: str # "Alpha"
cause2: str # "Beta"
effect: str # "Gamma"
interaction_type: str # "additive", "multiplicative", "min", "max"
```
These are genuinely hard: the agent can't discover them by varying one variable at a time. It must realise that two parents jointly determine the effect.
**Try it yourself** -- open a Python shell in the project directory:
```python
from server.causal_world import CausalRule
rule = CausalRule(
cause="Alpha", effect="Beta",
rule_type="linear", params={"a": 2.0, "b": 3.0},
description="Beta = 2.0 * Alpha + 3.0"
)
print(rule.evaluate(0)) # 3.0 (y = 2*0 + 3)
print(rule.evaluate(5)) # 13.0 (y = 2*5 + 3)
print(rule.evaluate(10)) # 23.0 (y = 2*10 + 3)
# Try a saturating rule
sat = CausalRule(
cause="Alpha", effect="Beta",
rule_type="saturating", params={"v_max": 10.0, "k_m": 3.0},
description="Beta = 10 * Alpha / (3 + Alpha)"
)
print(sat.evaluate(1)) # 2.5 (still growing)
print(sat.evaluate(10)) # 7.69 (approaching plateau)
print(sat.evaluate(1000)) # ~10 (saturated)
```
### CausalWorld -- the full hidden system
The `CausalWorld` holds all the variables, rules, interaction rules, and default values. It also tracks a **confounder_sigma** -- if > 0, a hidden variable injects correlated noise the agent can't explain.
It has four query methods -- one for each experiment type the agent can run:
```python
world.query_intervention(cause, value, effect, sigma)
# "Set Alpha to 5.0. What does Beta become?" (+ noise + confounder)
world.query_correlation(cause, [1, 10, 5], effect, sigma)
# "Sweep Alpha from 1 to 10 in 5 steps. Show me Beta at each."
world.query_counterfactual(cause, delta, effect, sigma)
# "If Alpha increases by +3.0, what happens to Beta?"
world.query_passive(target, sigma)
# "Just show me what Beta is right now, without changing anything."
```
Every result has **Gaussian noise** added. If sigma=0.05, the noise is tiny (easy mode). If sigma=0.50, the noise is huge (hard mode). On top of that, ~27% of worlds also have hidden confounder noise.
**Try it yourself:**
```python
from server.causal_world import generate_world
world = generate_world(n_variables=3, domain="system_alpha", seed=42)
print("Variables:", world.variables)
print("Ground truth:")
print(world.ground_truth_summary())
# Check for interactions and confounders
print(f"\nInteraction rules: {len(world.interactions)}")
print(f"Confounder sigma: {world.confounder_sigma}")
# Run an experiment
cause, effect = world.variables[0], world.variables[1]
result = world.query_intervention(cause, 5.0, effect, sigma=0.05)
print(f"\nSet {cause}=5.0, observed {effect}={result:.4f}")
```
### The generate_world() Function
This is the factory that builds a fresh puzzle:
1. Pick a domain (system_alpha/beta/gamma/delta) -- this only changes the context prompt
2. Pick an abstract variable pool (Greek letters, V1-V5, Quant_A-E, etc.)
3. Choose N variables and connect them with random rules (8 possible types)
4. Add extra random edges with 30% probability
5. Optionally replace some single-parent rules with multi-parent interaction rules (~40% chance when n >= 3)
6. Optionally add a hidden confounder (~30% chance when n >= 3)
7. Compute default values for all variables
### Domains and Variable Pools
Domains provide different narrative prompts but use the same abstract variable names:
```python
DOMAIN_LABELS = {
"system_alpha": {"context": "You are studying an unknown dynamical system..."},
"system_beta": {"context": "You are investigating a black-box system..."},
"system_gamma": {"context": "You are analysing an opaque process..."},
"system_delta": {"context": "You are probing a simulated environment..."},
}
ABSTRACT_VAR_POOLS = [
["Alpha", "Beta", "Gamma", "Delta", "Epsilon"],
["Zeta", "Eta", "Theta", "Iota", "Kappa"],
["V1", "V2", "V3", "V4", "V5"],
["Rho", "Sigma", "Tau", "Upsilon", "Phi"],
# ... more pools
]
```
Each episode randomly selects a pool, so the agent can't even memorise variable-name-to-position mappings across episodes.
---
## Part 5: The Reward Engine
**File: `server/rubric.py`**
The reward function is arguably the most important part of any RL environment. A bad reward function trains bad agents. Let's understand every piece.
### Two Kinds of Rewards
Our environment gives rewards at two different times:
**Per-step rewards** (during the episode):
- Every experiment gives information gain reward
- Redundant experiments get penalized
**End-of-episode rewards** (when the agent submits its hypothesis):
- Accuracy, precision, calibration, efficiency, contradiction checks
### Per-Step: InfoGainTracker
This tracks which variable pairs (edges) the agent has probed:
```python
tracker = InfoGainTracker()
# First time probing Alpha -> Beta: +0.20
reward, redundant = tracker.record_and_score("Alpha", "Beta", "intervention", 5.0)
# reward = 0.20, redundant = False
# Second time, different experiment type (triangulation!): +0.25
reward, redundant = tracker.record_and_score("Alpha", "Beta", "correlation", [1,10,5])
# reward = 0.25, redundant = False (BONUS for using different experiment type!)
# Third time: only +0.05
# Fourth time: -0.10 (PENALTY)
```
The reward schedule:
| Visit # | Same type | Different type | Purpose |
|---------|-----------|---------------|---------|
| 1st | +0.20 | +0.20 | Reward exploration |
| 2nd | +0.12 | +0.25 | Reward triangulation |
| 3rd | +0.05 | +0.05 | Diminishing returns |
| 4th+ | -0.10 | -0.10 | Punish redundancy |
**Why this design?** In real science, repeating the exact same experiment is wasteful. But using a *different* method to study the same relationship (triangulation) is valuable because it confirms findings. Our reward function teaches the agent this lesson.
**Try it yourself:**
```python
from server.rubric import InfoGainTracker
tracker = InfoGainTracker()
for i in range(5):
reward, redundant = tracker.record_and_score("A", "B", "intervention", 1.0)
print(f"Visit {i+1}: reward={reward:+.2f}, redundant={redundant}")
print(f"\nCumulative info gain: {tracker.cumulative_gain:.2f}")
print(f"Redundant experiments: {tracker.redundant_count}")
```
### End-of-Episode: score_hypothesis()
When the agent submits, five scoring components fire:
#### 1. Accuracy Score (0.0 - 1.0)
How much of the ground truth did the agent discover?
For **single-parent rules**, the scorer checks:
- Did the hypothesis mention both the cause and effect variable names? (+0.4 per rule)
- Did it identify the relationship type (linear, quadratic, saturating, etc.)? (+0.3 per rule)
- Did it include the correct numerical parameters? (+0.3 per rule)
For **interaction rules**, the scorer checks:
- Did the hypothesis mention the effect and at least one cause? (+0.3)
- Did it mention both causes? (+0.2 additional)
- Did it identify the interaction type (additive, multiplicative, etc.)? (+0.5)
Example: if the ground truth is `Beta = 2.0 * Alpha + 3.0` and the agent writes "Beta increases linearly with Alpha at a slope of 2.0", it scores high on all three checks.
Each of the 8 rule types has its own set of keywords the scorer recognises (e.g. "saturating", "plateau", "asymptote" for saturating rules; "quadratic", "squared", "parabola" for quadratic).
#### 2. Precision Bonus (+0.10)
Does the hypothesis contain actual numbers? "Alpha affects Beta" scores 0. "Beta = 2.0 * Alpha + 3.0" scores +0.10. This rewards agents that make **falsifiable, quantitative claims** instead of vague hand-waving.
#### 3. Calibration Score (0.0 - 0.20)
When the agent submits, it also reports a confidence level (0.0 to 1.0). Calibration measures how well that confidence matches the actual accuracy:
```
calibration = 0.20 * (1 - |confidence - accuracy| / 0.5)
```
If the agent says confidence=0.9 but accuracy=0.2, that's overconfident and scores low. If confidence=0.3 and accuracy=0.2, that's well-calibrated and scores high. This teaches agents to **know what they don't know**.
#### 4. Efficiency Bonus (+0.15)
If the agent submits early (30%+ budget remaining) with decent accuracy (60%+), it gets a bonus. This rewards agents that don't waste time running unnecessary experiments.
#### 5. Contradiction Penalty (-0.50)
If the hypothesis contradicts the experimental setup (e.g., claiming "all variables are independent" or "no causal relationship exists"), it gets a harsh penalty. This teaches agents not to give up without trying.
**Try it yourself:**
```python
import numpy as np
from server.causal_world import CausalWorld, CausalRule
from server.rubric import score_hypothesis
rule = CausalRule("Alpha", "Beta", "linear",
{"a": 2.0, "b": 3.0},
"Beta = 2.0 * Alpha + 3.0")
world = CausalWorld(
domain="system_alpha",
variables=["Alpha", "Beta"],
units={"Alpha": "units", "Beta": "units"},
rules=[rule],
default_values={"Alpha": 5.0, "Beta": 13.0},
rng=np.random.default_rng(0),
)
# Good hypothesis
result = score_hypothesis(
"Beta = 2.0 * Alpha + 3.0. Linear relationship.",
["Beta = 2.0 * Alpha + 3.0"],
confidence=0.85,
world=world,
budget_remaining=4,
budget_total=10,
)
print(f"Accuracy: {result.accuracy_score:.2f}")
print(f"Precision: {result.precision_bonus:.2f}")
print(f"Calibration: {result.calibration_score:.2f}")
print(f"Efficiency: {result.efficiency_bonus:.2f}")
print(f"Contradiction:{result.contradiction_penalty:.2f}")
print(f"TOTAL: {result.total:.2f}")
print(f"\nFeedback: {result.feedback}")
```
---
## Part 6: The Environment Core
**File: `server/hypothesis_lab_environment.py`**
This is the central nervous system. It ties together the hidden world, the rubric, and the data models.
### The Class Structure
```python
class HypothesisLabEnvironment(Environment):
SUPPORTS_CONCURRENT_SESSIONS = True # Multiple agents can play at once
def __init__(self, **kwargs):
# Initialize empty state -- no episode running yet
self._world = None # The hidden causal graph
self._tracker = None # InfoGainTracker for per-step rewards
self._step_count = 0
self._budget_remaining = 0
self._done = True # No episode until reset() is called
self._history = [] # Log of all experiments
...
```
### reset() -- Starting a New Episode
```python
def reset(self, seed=None, episode_id=None, **kwargs):
# 1. Read difficulty parameters
noise_level = kwargs.get("noise_level", "medium") # low/medium/high
domain = kwargs.get("domain", None) # system_alpha/beta/gamma/delta
# 2. Look up noise and budget from schedule tables
sigma = NOISE_SCHEDULE[noise_level] # low=0.05, medium=0.20, high=0.50
budget = BUDGET_SCHEDULE[noise_level] # low=12, medium=10, high=8
n_vars = N_VARIABLES_SCHEDULE[noise_level] # low=2, medium=3, high=4
# 3. Generate a fresh hidden world (abstract variable names, 8+ rule types)
self._world = generate_world(n_variables=n_vars, domain=domain, seed=seed)
# 4. Initialize tracking
self._tracker = InfoGainTracker()
self._budget_remaining = budget
self._done = False
# 5. Return initial observation (variable names, budget, instructions)
return HypLabObservation(
system_message="New episode started. You have 3 unknown variables...",
available_variables=self._world.variables,
budget_remaining=budget,
done=False,
reward=0.0,
)
```
**Key insight:** `reset()` generates a *new* hidden world every time. The agent never carries knowledge between episodes. Each episode is an independent puzzle.
### step() -- Processing an Action
```python
def step(self, action: HypLabAction, **kwargs):
if self._done:
raise RuntimeError("Episode is done. Call reset().")
self._step_count += 1
if action.action_type == ActionType.EXPERIMENT:
return self._handle_experiment(action)
elif action.action_type == ActionType.SUBMIT:
return self._handle_submit(action)
```
There are only two things the agent can do: run an experiment, or submit a hypothesis. This is a **clean action space** -- no ambiguity about what actions are valid.
### _handle_experiment() -- Running an Experiment
This is the longest method. Here's what it does:
1. **Validate** the variable names (are they real variables in this world?)
2. **Route** to the right query method based on experiment type
3. **Format** the result as human-readable text (for the LLM to read)
4. **Score** the information gain via InfoGainTracker
5. **Deduct** budget
6. **Check** if budget is exhausted
7. **Return** observation with all the details
### _handle_submit() -- Grading the Hypothesis
1. Mark episode as done
2. Call `score_hypothesis()` from the rubric
3. Format the rubric breakdown as text
4. Return observation with scores and revealed ground truth
**Key insight:** the ground truth is only revealed **after** submission. This prevents the agent from cheating.
### state -- Episode Metadata
```python
@property
def state(self) -> HypLabState:
return HypLabState(
episode_id=self._episode_id,
step_count=self._step_count,
budget_remaining=self._budget_remaining,
noise_level=self._noise_level,
experiment_history=self._history, # What experiments ran so far
...
)
```
**Critical rule:** `state` must NEVER leak the hidden world. No rule types, no parameters, no ground truth. Only metadata the agent already knows.
**Try the full loop yourself:**
```python
from models import ActionType, ExperimentType, HypLabAction
from server.hypothesis_lab_environment import HypothesisLabEnvironment
env = HypothesisLabEnvironment()
# Start a new episode
obs = env.reset(seed=42, noise_level="low", domain="system_alpha")
print("=== RESET ===")
print(obs.system_message)
print()
# Run an experiment
vars_ = obs.available_variables
action = HypLabAction(
action_type=ActionType.EXPERIMENT,
experiment_type=ExperimentType.INTERVENTION,
control_variable=vars_[0],
target_variable=vars_[1],
control_value=5.0,
)
obs = env.step(action)
print("=== EXPERIMENT ===")
print(obs.system_message)
print(f"Info gain: {obs.info_gain_reward}")
print()
# Try a correlation sweep
action2 = HypLabAction(
action_type=ActionType.EXPERIMENT,
experiment_type=ExperimentType.CORRELATION,
control_variable=vars_[0],
control_range=[1.0, 10.0, 5.0],
target_variable=vars_[1],
)
obs = env.step(action2)
print("=== CORRELATION ===")
print(obs.system_message)
print()
# Submit hypothesis
submit = HypLabAction(
action_type=ActionType.SUBMIT,
hypothesis_text=f"{vars_[1]} is linearly related to {vars_[0]} with slope ~2.0",
hypothesis_equations=[f"{vars_[1]} = 2.0 * {vars_[0]} + 3.0"],
confidence=0.75,
)
obs = env.step(submit)
print("=== SUBMIT ===")
print(obs.system_message)
```
---
## Part 7: The Data Models
**File: `models.py`**
This file defines the *language* the agent and environment speak. Every piece of data that crosses the boundary must be one of these types.
### Why Pydantic?
Pydantic gives us:
1. **Validation** -- if the agent sends `control_value="hello"` instead of a number, it gets a clear error
2. **Serialization** -- objects convert to/from JSON automatically for HTTP transport
3. **Documentation** -- every field has a type and a description
4. **IDE support** -- autocomplete and type checking
### The Import Pattern
```python
try:
from openenv.core.env_server.types import Action, Observation, State
except ImportError:
# Fallback for when openenv-core isn't installed
from pydantic import BaseModel
class Action(BaseModel): ...
class Observation(BaseModel): ...
class State(BaseModel): ...
```
This pattern lets the code work both:
- In production (with openenv-core installed)
- In development/testing (without it)
### The Enums
```python
class ExperimentType(str, Enum):
INTERVENTION = "intervention"
CORRELATION = "correlation"
COUNTERFACTUAL = "counterfactual"
PASSIVE = "passive"
class ActionType(str, Enum):
EXPERIMENT = "experiment"
SUBMIT = "submit"
class NoiseLevelTag(str, Enum):
LOW = "low"
MEDIUM = "medium"
HIGH = "high"
```
Using `str, Enum` means these serialize as simple strings in JSON: `"intervention"` instead of `ExperimentType.INTERVENTION`. This makes the API friendly for LLM agents that output raw JSON.
### HypLabAction -- What the Agent Sends
The action model is **polymorphic** -- it handles two different use cases in one object:
```python
# Use case 1: Run an experiment
HypLabAction(
action_type="experiment",
experiment_type="intervention",
control_variable="Alpha",
control_value=5.0,
target_variable="Beta",
)
# Use case 2: Submit a hypothesis
HypLabAction(
action_type="submit",
hypothesis_text="Beta = 2.0 * Alpha + 3.0",
hypothesis_equations=["Beta = 2.0 * Alpha + 3.0"],
confidence=0.85,
)
```
The experiment fields are `Optional` so they can be `None` when submitting, and vice versa. This is a common pattern in RL environments where the action space has distinct modes.
### HypLabObservation -- What Comes Back
Observations are rich and multi-purpose:
- **Always present**: `system_message`, `available_variables`, `budget_remaining`, `done`, `reward`
- **After experiments**: `result_value`, `noise_sigma`, `info_gain_reward`, `is_redundant`
- **After submission**: `accuracy_score`, `total_episode_reward`, `ground_truth_revealed`
The `system_message` field is crucial -- it's the human-readable text that an LLM agent reads (e.g. "Set Alpha=5.0, observed Beta=13.04"). The structured fields are for programmatic access.
### HypLabState -- Episode Metadata
```python
class HypLabState(State):
budget_total: int = 0
budget_remaining: int = 0
noise_level: NoiseLevelTag = NoiseLevelTag.MEDIUM
experiment_history: list[dict] = []
cumulative_info_gain: float = 0.0
redundant_experiment_count: int = 0
```
Notice what's NOT here: no `rules`, no `default_values`, no `ground_truth`. The state is safe to show to the agent without leaking the answer.
---
## Part 8: The Server
**File: `server/app.py`**
This is the thinnest file in the project, and that's by design.
```python
from openenv.core.env_server.http_server import create_app
app = create_app(
HypothesisLabEnvironment, # The environment class
HypLabAction, # What the agent sends
HypLabObservation, # What comes back
env_name="hypothesis_lab",
max_concurrent_envs=200,
)
```
`create_app()` does all the heavy lifting:
- Creates FastAPI routes: `/reset`, `/step`, `/state`, `/health`, `/schema`
- Handles session management (multiple agents playing at once)
- Serializes/deserializes Pydantic models to/from JSON
- Adds WebSocket support for persistent connections
You almost never need to touch this file. The magic is in `create_app()`.
### The HTTP Endpoints
| Endpoint | Method | What it does |
|----------|--------|-------------|
| `/health` | GET | Returns `{"status": "ok"}` -- for Docker healthchecks |
| `/reset` | POST | Starts a new episode, returns initial observation |
| `/step` | POST | Sends an action, returns observation + reward |
| `/state` | GET | Returns current episode metadata |
| `/schema` | GET | Returns JSON schemas for Action/Observation |
### Running the Server
```bash
cd "files 2"
uvicorn server.app:app --port 8000
```
Then in another terminal:
```bash
curl http://localhost:8000/health
# {"status": "ok"}
curl -X POST http://localhost:8000/reset \
-H "Content-Type: application/json" \
-d '{"noise_level": "low", "domain": "system_alpha", "seed": 42}'
```
---
## Part 9: The Client
**File: `client.py`**
The client is the agent's friendly interface to the server. Instead of constructing raw HTTP requests, the agent gets nice typed methods.
```python
class HypothesisLabEnv(EnvClient[HypLabAction, HypLabObservation, HypLabState]):
```
The `EnvClient` base class handles:
- WebSocket connections (persistent, faster than HTTP polling)
- Automatic reconnection
- JSON serialization
Our client adds convenience methods:
```python
await env.run_intervention("Alpha", 5.0, "Beta")
await env.run_correlation("Alpha", [1, 10, 5], "Beta")
await env.run_counterfactual("Alpha", 3.0, "Beta")
await env.run_passive("Beta")
await env.submit_hypothesis("Beta = 2.0 * Alpha + 3.0", confidence=0.85)
```
Each method constructs the right `HypLabAction` internally so the agent doesn't have to remember the field names.
### The Three Abstract Methods
Every `EnvClient` subclass must implement:
```python
def _step_payload(self, action):
"""Convert a HypLabAction into a JSON-ready dict."""
return action.model_dump(exclude_none=True)
def _parse_result(self, payload):
"""Convert a JSON dict from the server into a StepResult."""
obs = HypLabObservation(**payload)
return StepResult(observation=obs, reward=..., done=...)
def _parse_state(self, payload):
"""Convert a JSON dict into a HypLabState."""
return HypLabState(**payload)
```
---
## Part 10: Tasks and Graders
**Files: `tasks/task_easy.py`, `task_medium.py`, `task_hard.py`**
The hackathon rules require **minimum 3 tasks** with **programmatic graders** that return scores between 0.0 and 1.0.
### What is a Task?
A task is a configuration dict that says "run the environment with these settings":
```python
TASK_EASY = {
"id": "easy",
"name": "Easy -- Single-Edge Discovery",
"description": "Discover the causal relationship between two abstract variables...",
"difficulty": "easy",
"reset_kwargs": {
"noise_level": "low", # sigma = 0.05
"domain": "system_alpha", # abstract domain
"seed": 42, # deterministic for reproducibility
},
}
```
### What is a Grader?
A grader takes the episode results and returns a normalized score:
```python
def grade_easy(episode_result: dict) -> float:
accuracy = episode_result.get("accuracy_score", 0.0)
efficiency = episode_result.get("efficiency_bonus", 0.0)
calibration = episode_result.get("calibration_score", 0.0)
raw = (
0.60 * min(accuracy, 1.0) # 60% weight on accuracy
+ 0.20 * min(efficiency / 0.15, 1.0) # 20% weight on efficiency
+ 0.20 * min(calibration / 0.20, 1.0) # 20% weight on calibration
)
return round(max(0.0, min(1.0, raw)), 4)
```
### Difficulty Progression
| | Easy | Medium | Hard |
|---|---|---|---|
| Variables | 2 | 3 | 4 |
| Noise (sigma) | 0.05 | 0.20 | 0.50 |
| Budget | 12 | 10 | 8 |
| Domain | system_alpha (fixed) | Random | Random |
| Key challenge | Single edge | Multiple edges + interactions | Complex graph + confounders + noise |
The hard task is genuinely hard for frontier models:
- 4 variables means up to 6 possible edges to discover
- Rules can be any of 8 types (not just linear!) plus interaction rules
- High noise + hidden confounders make every observation unreliable
- Only 8 experiments to figure it all out
- Abstract variable names prevent exploiting pretrained knowledge
**Try it yourself:**
```python
from tasks.task_easy import grade_easy
# Perfect episode
score = grade_easy({
"accuracy_score": 1.0,
"efficiency_bonus": 0.15,
"calibration_score": 0.20,
})
print(f"Perfect score: {score}") # 1.0
# Mediocre episode
score = grade_easy({
"accuracy_score": 0.4,
"efficiency_bonus": 0.0,
"calibration_score": 0.05,
})
print(f"Mediocre score: {score}") # ~0.29
# Zero effort
score = grade_easy({})
print(f"Zero score: {score}") # 0.0
```
---
## Part 11: The Baseline Agent
**File: `baseline_inference.py`**
This script proves the environment works by running a real LLM agent against all three tasks.
### The Flow
```
1. Create an OpenAI client (reads OPENAI_API_KEY from env)
2. For each of the 3 tasks:
a. Create a fresh HypothesisLabEnvironment
b. Call reset() with the task's settings
c. Enter a loop (max 8 turns):
- Send the observation to the LLM as a "user" message
- Parse the LLM's response into a HypLabAction
- Call step(action)
- If done, break
d. If not done after 8 turns, force a submit
e. Grade the episode with the task's grader
3. Print all scores
```
### The System Prompt
The system prompt teaches the LLM how to interact with the environment:
```
You are a scientific AI assistant trained to discover hidden causal rules.
...
Format your actions as JSON:
{"action_type": "experiment", "experiment_type": "intervention", ...}
...
Strategy tips:
- Run interventions first to discover which variables are causally connected
- Vary the control variable widely (e.g. 1, 5, 10) to detect nonlinearity
- Don't repeat the same experiment -- redundant experiments are penalised
```
### The Action Parser
LLMs don't always produce perfect JSON. The parser handles multiple formats:
1. **JSON in code blocks**: `` ```json {...} ``` ``
2. **Raw JSON**: `{...}`
3. **Natural language**: "I conclude that Beta = 2 * Alpha" (extracted via regex)
4. **Timeout**: if it's the last turn, force a submit with whatever text the LLM wrote
### Running It
```bash
export OPENAI_API_KEY=sk-...
python baseline_inference.py
```
Expected output:
```
============================================================
Scientific Hypothesis Lab -- Baseline Inference
Model: gpt-4o-mini
============================================================
--- Task: Easy -- Single-Edge Discovery ---
Total episode reward: +0.6100
Graded score: 0.6500
--- Task: Medium -- Multi-Edge Discovery ---
Total episode reward: +0.3800
Graded score: 0.4000
--- Task: Hard -- Complex Graph Under Noise ---
Total episode reward: +0.2100
Graded score: 0.2500
============================================================
SUMMARY
============================================================
easy : 0.6500
medium : 0.4000
hard : 0.2500
average : 0.4333
```
---
## Part 12: Testing
**File: `tests/test_environment.py`**
39 tests organized into 5 test classes. Run them with:
```bash
pytest tests/ -v
```
### Test Classes
| Class | Tests | What it covers |
|-------|-------|----------------|
| TestCausalWorld | 18 | World generation, all 8 rule types, interactions, domains, seeds, abstract names |
| TestInfoGainTracker | 4 | Reward schedule, redundancy, triangulation |
| TestRubric | 6 | Accuracy scoring, calibration, efficiency, feedback |
| TestEnvironmentIntegration | 6 | Full episodes, budget exhaustion, errors, state leaks |
| TestGraders | 5 | Grader range [0,1], zero input, perfect input |
### Key Tests to Study
**Seed reproducibility** -- same seed produces same world:
```python
world1 = generate_world(n_variables=3, domain="system_alpha", seed=99)
world2 = generate_world(n_variables=3, domain="system_alpha", seed=99)
assert world1.variables == world2.variables
```
**Variable names are abstract** -- no real-world names that give LLMs prior knowledge:
```python
for seed in range(50):
world = generate_world(n_variables=4, seed=seed)
for v in world.variables:
assert v.lower() not in {"temperature", "pressure", "price", ...}
```
**State doesn't leak secrets**:
```python
st = env.state
state_str = str(st.model_dump())
assert "rule_type" not in state_str
assert "params" not in state_str
```
**Diverse rule types over many seeds** -- we see all 8+ types:
```python
types_seen = set()
for seed in range(100):
world = generate_world(n_variables=3, seed=seed)
for rule in world.rules:
types_seen.add(rule.rule_type)
assert len(types_seen) >= 5
```
**Grader always returns [0, 1]**:
```python
score = grade_easy({"accuracy_score": 1.0, "efficiency_bonus": 0.15, ...})
assert 0.0 <= score <= 1.0
```
---
## Part 13: Deployment
### Dockerfile
The Dockerfile uses a multi-stage build:
```
Stage 1 (builder):
- Start from OpenEnv base image
- Copy source code
- Install uv (Python package manager)
- Run uv sync to install dependencies
- This creates a .venv with all packages
Stage 2 (runtime):
- Start from a clean base image
- Copy only the .venv and source code (not build tools)
- Set PATH and PYTHONPATH
- Run uvicorn to start the server
```
### Step 1: Build the Docker Image
```bash
cd Lab-experiment
docker build -t hypothesis-lab .
```
This takes 2-5 minutes the first time (downloads base image + installs dependencies). Subsequent builds are fast thanks to layer caching. You should see `Successfully tagged hypothesis-lab:latest` at the end.
If the build fails, check:
- `pyproject.toml` has `build-backend = "setuptools.build_meta"` (not the experimental `setuptools.backends` path)
- `.dockerignore` excludes `.venv/`, `__pycache__/`, `.git/`
### Step 2: Run the Container
```bash
docker run -p 8000:8000 hypothesis-lab
```
You should see uvicorn start up:
```
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
```
To run in the background (detached mode):
```bash
docker run -d --name hyp-lab -p 8000:8000 hypothesis-lab
```
### Step 3: Verify the Server is Running
Open a **new terminal** and run:
```bash
curl http://localhost:8000/health
```
Expected response:
```json
{"status":"ok"}
```
### Step 4: Check the API Schema
```bash
curl -s http://localhost:8000/schema | python3 -m json.tool
```
This returns the JSON Schema definitions for `HypLabAction` and `HypLabObservation`, useful for understanding what fields exist.
### Step 5: Understand HTTP vs WebSocket
> **Critical concept:** The OpenEnv server has two communication modes:
>
> | Endpoint | Type | Stateful? | Use case |
> |---|---|---|---|
> | `/health` | GET | No | Check if server is alive |
> | `/schema` | GET | No | Inspect action/observation schemas |
> | `/reset` | POST | **No** -- creates a fresh env, returns result, destroys env | One-shot inspection |
> | `/step` | POST | **No** -- creates a fresh env (never reset!), tries to step, fails | **Don't use for episodes** |
> | `/ws` | WebSocket | **Yes** -- persistent connection, one env for the whole episode | **Use this for episodes** |
>
> The HTTP `/reset` and `/step` are **stateless**: each request creates a brand-new
> environment instance and destroys it after responding. If you `curl /reset` then
> `curl /step`, the step hits a *different* environment that was never reset -- so
> it fails. Multi-step episodes require the **WebSocket** endpoint (`/ws`), which
> keeps one environment alive for the entire connection.
This is why `curl` to `/step` returned an empty response -- the server-side
environment had no world to step in. Our environment now returns a clear error
instead of crashing:
```json
{"observation": {"system_message": "Error: No active episode. Call reset() first.", "done": true, "reward": -1.0}, ...}
```
### Step 6: Run a Full Episode (Python script)
The proper way to interact is via WebSocket. The `EnvClient` class handles
this automatically. Save this as `test_docker.py` and run it while the
container is running:
```python
import asyncio
import json
import websockets
async def run_episode():
uri = "ws://localhost:8000/ws"
async with websockets.connect(uri) as ws:
# 1. Reset
await ws.send(json.dumps({
"type": "reset",
"data": {"noise_level": "low", "domain": "system_alpha", "seed": 42}
}))
resp = json.loads(await ws.recv())
obs = resp["data"]["observation"]
print(f"=== Episode Started ===")
print(f"Variables: {obs['available_variables']}")
print(f"Budget: {obs['budget_remaining']}")
print()
variables = obs["available_variables"]
cause, effect = variables[0], variables[1]
# 2. Intervention experiment
await ws.send(json.dumps({
"type": "step",
"data": {
"action_type": "experiment",
"experiment_type": "intervention",
"control_variable": cause,
"control_value": 5.0,
"target_variable": effect,
}
}))
resp = json.loads(await ws.recv())
obs = resp["data"]["observation"]
print(f"[Intervention] Set {cause}=5.0 -> {effect}={obs['result_value']}")
print(f" Info gain: {obs['info_gain_reward']}, Budget left: {obs['budget_remaining']}")
print()
# 3. Correlation sweep
await ws.send(json.dumps({
"type": "step",
"data": {
"action_type": "experiment",
"experiment_type": "correlation",
"control_variable": cause,
"control_range": [0.5, 20.0, 8],
"target_variable": effect,
}
}))
resp = json.loads(await ws.recv())
obs = resp["data"]["observation"]
print(f"[Correlation] Swept {cause} from 0.5 to 20.0:")
if isinstance(obs["result_value"], list):
for point in obs["result_value"]:
print(f" {cause}={point[0]:.1f} -> {effect}={point[1]:.4f}")
print(f" Info gain: {obs['info_gain_reward']}, Budget left: {obs['budget_remaining']}")
print()
# 4. Submit hypothesis
await ws.send(json.dumps({
"type": "step",
"data": {
"action_type": "submit",
"hypothesis_text": f"{effect} depends linearly on {cause}.",
"hypothesis_equations": [f"{effect} = 2.0 * {cause} + 1.0"],
"confidence": 0.6,
}
}))
resp = json.loads(await ws.recv())
obs = resp["data"]["observation"]
print(f"=== Episode Finished ===")
print(f"Accuracy: {obs.get('accuracy_score')}")
print(f"Precision: {obs.get('precision_bonus')}")
print(f"Calibration: {obs.get('calibration_score')}")
print(f"Efficiency: {obs.get('efficiency_bonus')}")
print(f"Contradiction: {obs.get('contradiction_penalty')}")
print(f"TOTAL REWARD: {obs.get('total_episode_reward')}")
print()
print(f"Ground truth:\n{obs.get('ground_truth_revealed')}")
asyncio.run(run_episode())
```
Run it:
```bash
pip install websockets # one-time install
python test_docker.py
```
Expected output:
```
=== Episode Started ===
Variables: ['Quant_A', 'Quant_E']
Budget: 12
[Intervention] Set Quant_A=5.0 -> Quant_E=3.4521
Info gain: 0.12, Budget left: 11
[Correlation] Swept Quant_A from 0.5 to 20.0:
Quant_A=0.5 -> Quant_E=7.8123
Quant_A=3.3 -> Quant_E=4.2341
...
Info gain: 0.10, Budget left: 10
=== Episode Finished ===
Accuracy: 0.35
Precision: 0.0
Calibration: 0.14
Efficiency: 0.15
Contradiction: 0.0
TOTAL REWARD: 0.86
Ground truth:
Domain: system_alpha
Quant_E = 1.11 * exp(-0.16 * Quant_A)
```
> **Key insight from the WebSocket protocol:**
>
> - Send messages as `{"type": "reset", "data": {...}}` and `{"type": "step", "data": {...}}`
> - The action fields go directly inside `"data"` (no extra `"action"` wrapper)
> - Responses come back as `{"type": "observation", "data": {"observation": {...}, "reward": ..., "done": ...}}`
> - The observation fields live at `resp["data"]["observation"]` -- note the double nesting
### Understanding the Observation Fields
On reset, most fields are `null` -- only setup information is populated:
| Field | What it tells you |
|---|---|
| `system_message` | Human-readable summary -- the LLM agent reads this |
| `available_variables` | Variable names to use in experiments |
| `budget_remaining` | Number of experiment steps left |
| `result_value` | `null` on reset; float or `[[x,y],...]` list after experiments |
| `noise_sigma` | `null` on reset; shown per-experiment so you know measurement precision |
| `done` | `false` until you submit or budget runs out |
| `reward` | Reward for this step (0.0 on reset) |
| `accuracy_score` ... `ground_truth_revealed` | All `null` until you submit your hypothesis |
After submit, the scoring fields light up:
| Field | Meaning |
|---|---|
| `accuracy_score` | How close your hypothesis matches the true rules (0-1) |
| `precision_bonus` | Bonus for getting coefficients/parameters right |
| `calibration_score` | How well your confidence matches your actual accuracy |
| `efficiency_bonus` | Reward for using fewer budget steps |
| `contradiction_penalty` | Deducted if your hypothesis contradicts your own data |
| `total_episode_reward` | Sum of all info gain rewards + final rubric score |
| `ground_truth_revealed` | The actual hidden rules -- study this to improve! |
> **Design note: Why don't we reveal the exact noise sigma upfront?**
>
> The system message says "Noise level: low" but does NOT say "sigma=0.05".
> In real science you have to estimate measurement uncertainty from repeated
> measurements. This forces the agent to run a few repeat experiments to
> gauge noise before trusting single data points. The qualitative label
> (low/medium/high) sets expectations without handing out a free number.
> The exact sigma IS shown per-experiment in the `noise_sigma` field --
> that's fine because by then the agent has already spent a budget step.
### Error Handling
The environment returns error observations (not crashes) for bad actions:
| Situation | Response | Reward |
|---|---|---|
| Step without reset | `"Error: No active episode. Call reset() first."` | `-1.0`, `done=true` |
| Step after episode ended | `"Error: Episode is already done."` | `0.0`, `done=true` |
| Unknown variable name | `"Error: Unknown control variable 'X'."` | `-0.05`, budget deducted |
| Unknown experiment type | `"Error: Unknown experiment type..."` | `-0.05` |
| Unknown action type | `"Error: Unknown action_type..."` | `-0.05`, budget deducted |
The small negative reward (`-0.05`) for invalid actions teaches RL agents to
produce valid requests without being so harsh that it dominates the reward signal.
### Stopping the Container
```bash
# If running in foreground: Ctrl+C
# If running in background:
docker stop hyp-lab
docker rm hyp-lab
```
### Troubleshooting
| Problem | Fix |
|---|---|
| `port is already allocated` | Another process uses port 8000. Use `-p 8001:8000` and hit `localhost:8001` instead |
| `curl: (7) Failed to connect` | Container isn't running yet. Wait a few seconds for uvicorn to start |
| `{"detail":"Not Found"}` | You hit the wrong endpoint. Use `/health`, `/reset`, `/step`, `/state` |
| Container exits immediately | Check logs: `docker logs hyp-lab`. Usually a missing dependency |
### Deploying to HF Spaces
```bash
openenv push --org your-org --token $HF_TOKEN
```
The README.md has Hugging Face Spaces metadata in its YAML frontmatter:
```yaml
---
title: Scientific Hypothesis Lab
emoji: 🔬
sdk: docker
app_port: 8000
tags:
- openenv
---
```
This tells HF Spaces to build the Docker image and expose port 8000.
---
## Part 14: Hands-On Exercises
Now it's your turn. These exercises go from easy to hard.
### Exercise 1: Explore a World (5 min)
```python
from server.causal_world import generate_world
# Generate 3 different worlds and print their ground truth
for seed in [1, 2, 3]:
world = generate_world(n_variables=3, domain="system_gamma", seed=seed)
print(f"\n=== Seed {seed} ===")
print(f"Variables: {world.variables}")
print(f"Interactions: {len(world.interactions)}")
print(f"Confounder sigma: {world.confounder_sigma}")
print(world.ground_truth_summary())
```
Questions to answer:
- How many rules does each world have? What types?
- Do any worlds have interaction rules or confounders?
- Are variable names abstract (no real-world physics terms)?
### Exercise 2: Play a Full Episode (10 min)
```python
from models import ActionType, ExperimentType, HypLabAction
from server.hypothesis_lab_environment import HypothesisLabEnvironment
env = HypothesisLabEnvironment()
obs = env.reset(seed=100, noise_level="medium", domain="system_beta")
print(obs.system_message)
# YOUR TURN: Run 3-4 experiments, then submit a hypothesis.
# Try to get the highest accuracy score you can.
# Hint: use CORRELATION to see the relationship shape,
# then test at extreme values to distinguish linear from quadratic/saturating.
```
### Exercise 3: Break the Rubric (10 min)
Try to get edge-case scores:
- Get accuracy_score = 0.0 (submit empty hypothesis)
- Get contradiction_penalty = -0.50 (claim "no causal relationship exists")
- Get efficiency_bonus = 0.15 (submit early with high accuracy)
- Get calibration_score = 0.20 (match your confidence to your accuracy perfectly)
### Exercise 4: Add a New Rule Type (20 min)
The environment already has 8 rule types, but you can add more! Try adding a **sinusoidal** rule:
- Formula: `y = a * sin(k * x) + b`
- Add it to `CausalRule.evaluate()`
- Add it to `RULE_TYPES` and `_random_rule()` with appropriate weights
- Add keywords to `_RULE_KEYWORDS` in `rubric.py`
- Test it with a hand-crafted world
### Exercise 5: Add a New Variable Pool (10 min)
Add a new abstract variable pool to `ABSTRACT_VAR_POOLS` in `causal_world.py`:
- Use creative abstract names (e.g., colour names: "Red", "Blue", "Green", "Amber", "Violet")
- Make sure they carry no scientific meaning
### Exercise 6: Write a Smarter Baseline Agent (30 min)
Modify `baseline_inference.py` to implement a better strategy:
1. First, run passive observations on all variables
2. Then run interventions between each pair to find which are connected
3. Use wide correlation sweeps (1 to 100) to check for curvature, saturation, or breakpoints
4. Test at x=0.5 and x=50 to distinguish linear from exponential/logarithmic
5. If the data suggests two parents, try holding one constant while varying the other
6. Submit with well-calibrated confidence
---
## Part 15: Golden Rules for Building Environments
These are the principles that separate good environments from great ones.
### Rule 1: The Agent Should Never See the Answer
The hidden world, ground truth rules, and correct parameters must NEVER appear in observations or state before the agent submits. This is the most common mistake beginners make.
**Bad:**
```python
def reset(self):
return Observation(hint=f"The slope is {self.world.rules[0].params['a']}")
```
**Good:**
```python
def reset(self):
return Observation(system_message="Run experiments to discover the hidden rules.")
```
### Rule 2: Reward Shaping > Sparse Rewards
A reward function that only gives +1 at the end teaches nothing. The agent needs signal throughout the episode.
**Bad:**
```python
def step(self, action):
if action.type == "submit":
return Observation(reward=1.0 if correct else 0.0, done=True)
return Observation(reward=0.0) # No signal during experiments!
```
**Good:**
```python
def step(self, action):
if action.type == "experiment":
info_gain = self.tracker.record(action)
return Observation(reward=info_gain) # Signal at every step!
elif action.type == "submit":
return Observation(reward=self.rubric.score(action))
```
### Rule 3: Deterministic Seeds for Reproducibility
Every random element must be controlled by a seed. If two runs with the same seed produce different results, your graders are broken.
```python
def generate_world(seed=42):
py_rng = random.Random(seed) # Controls structure
np_rng = np.random.default_rng(seed) # Controls noise
```
### Rule 4: Observations Should Be LLM-Friendly
If your agent is an LLM, the observation needs a human-readable text field. Don't just return a dict of numbers.
**Bad:**
```python
return Observation(result={"x": 5.0, "y": 13.04, "sigma": 0.05})
```
**Good:**
```python
return Observation(
system_message="[Step 1] Set Alpha=5.0, observed Beta=13.04 (sigma=0.05)",
result_value=13.04,
noise_sigma=0.05,
)
```
### Rule 5: Validate All Agent Input
Never trust the agent. It will send garbage, typos, and adversarial inputs.
```python
if cause not in world.variables:
return self._error_obs(f"Unknown variable '{cause}'. Available: {world.variables}")
```
### Rule 6: Clean Episode Boundaries
`reset()` must produce a completely clean state. No leftover data from previous episodes.
```python
def reset(self):
self._world = generate_world(...) # Fresh world
self._tracker = InfoGainTracker() # Fresh tracker
self._history = [] # Fresh history
self._done = False # Episode is active
```
### Rule 7: Budget/Step Limits Prevent Infinite Episodes
Always have a mechanism to end the episode. Either a budget that runs out, or a maximum step count.
### Rule 8: The Hard Task Must Be Actually Hard
If your hard task is easy for GPT-4, the judges will notice. Design it so that even frontier models score 0.2-0.4 on the hard task. Our hard task uses 4 variables, sigma=0.50 noise, hidden confounders, interaction rules, and only 8 experiment budget.
### Rule 8.5: Don't Let LLMs Cheat with Prior Knowledge
If your environment uses real-world variable names (Temperature, Pressure, Price, Demand), LLM agents will use pretrained knowledge instead of reasoning from data. Use abstract names (Alpha, Beta, V1, V2) to force genuine discovery. Similarly, don't use only 3 rule types -- the agent will memorize the template set. Use enough variety that template-matching fails.
### Rule 9: Graders Must Be Deterministic
Given the same `episode_result` dict, a grader must always return the same score. No randomness, no external API calls, no time-dependent logic.
### Rule 10: State Metadata Only
The `state` property returns metadata, not secrets. It's for debugging, logging, and agent introspection -- never for leaking the answer.
---
## Part 16: How to Build Your Own From Scratch
Here's the step-by-step recipe for creating a new OpenEnv environment.
### Step 1: Choose Your Domain
Pick a real-world task humans actually do:
- Email triage
- Code review
- Data cleaning
- Scheduling
- Customer support
- Medical diagnosis
- Financial analysis
### Step 2: Define the Action Space
What can the agent do? Write it out in plain English first:
```
The agent can:
1. Read an email subject and preview
2. Assign a priority (high/medium/low)
3. Assign a label (bug/feature/question/spam)
4. Flag for human review
```
Then convert to a Pydantic model:
```python
class EmailAction(Action):
action_type: str # "classify" or "flag"
priority: Optional[str] = None
label: Optional[str] = None
flag_reason: Optional[str] = None
```
### Step 3: Define the Observation Space
What does the agent see after each action?
```python
class EmailObservation(Observation):
system_message: str
email_subject: str
email_preview: str
emails_remaining: int
# ... (inherits done, reward from Observation)
```
### Step 4: Build the Hidden World
What's the ground truth the agent is trying to discover/solve? This is your "puzzle generator."
### Step 5: Build the Reward Function
Design rewards that teach the right behavior:
- Correct classification: +1.0
- Partially correct: +0.5
- Wrong but not harmful: -0.1
- Flagging spam as high priority: -0.5
### Step 6: Write the Environment Class
```python
class EmailTriageEnvironment(Environment):
def reset(self, **kwargs):
# Generate a batch of emails
# Return the first email as an observation
def step(self, action):
# Grade the agent's classification
# Move to next email or end episode
@property
def state(self):
# Return progress metadata
```
### Step 7: Wire Up the Server
```python
app = create_app(
EmailTriageEnvironment,
EmailAction,
EmailObservation,
env_name="email_triage",
)
```
### Step 8: Define 3 Tasks
```python
TASK_EASY = {"id": "easy", "reset_kwargs": {"n_emails": 5, "spam_ratio": 0.5}}
TASK_MEDIUM = {"id": "medium", "reset_kwargs": {"n_emails": 10, "spam_ratio": 0.2}}
TASK_HARD = {"id": "hard", "reset_kwargs": {"n_emails": 20, "spam_ratio": 0.05}}
```
### Step 9: Write the Baseline
Use the OpenAI API to run a simple agent and produce baseline scores.
### Step 10: Write Tests
Minimum tests:
- reset() produces valid observation
- step() with valid action works
- step() with invalid action returns error
- Episode ends when expected
- State doesn't leak secrets
- Graders return [0, 1]
- Seeds produce deterministic results
### Step 11: Write the Dockerfile
Copy our Dockerfile template. Change the CMD to point to your server module.
### Step 12: Write openenv.yaml
```yaml
spec_version: 1
name: your_env_name
type: space
runtime: fastapi
app: server.app:app
port: 8000
```
### Step 13: Write the README
Include HF Spaces frontmatter, environment description, action/observation docs, task descriptions, and baseline scores.
---
## Congratulations
You've read through the entire Scientific Hypothesis Lab codebase and understand:
- **What RL environments are** and how agents interact with them
- **The OpenEnv contract**: reset/step/state, Action/Observation/State, openenv.yaml
- **How hidden worlds work**: causal graphs with 8+ rule types, interaction rules, confounders, abstract variable names
- **Why abstract variable names matter**: prevents LLMs from using pretrained knowledge as a shortcut
- **How reward functions are designed**: info gain, accuracy (across all rule types + interactions), calibration, efficiency, contradiction
- **How the server works**: create_app() wraps everything in HTTP endpoints
- **How clients connect**: typed methods over WebSocket
- **How tasks and graders work**: difficulty progression, deterministic scoring [0, 1]
- **How baseline agents work**: LLM + system prompt + action parsing
- **How to test**: 39 tests covering every component including all rule types
- **How to deploy**: Docker + HF Spaces
- **The golden rules** for building great environments (including anti-cheating via abstract naming)
- **How to build your own** from scratch in 13 steps
You are now qualified to build, debug, explain, and teach RL environments. Go build something amazing.
|