File size: 62,620 Bytes
aab0192
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
# The Complete Guide to Building RL Environments with OpenEnv

**A follow-along tutorial using the Scientific Hypothesis Lab**

By the end of this tutorial you will be able to:
- Explain what an RL environment is and why it matters
- Read and understand every file in this project
- Build your own OpenEnv environment from scratch
- Design reward functions that actually train good agents
- Deploy your environment to Hugging Face Spaces
- Explain all of this to anyone who asks

---

## Table of Contents

1. [Part 1: The Big Picture](#part-1-the-big-picture)
2. [Part 2: The OpenEnv Contract](#part-2-the-openenv-contract)
3. [Part 3: Tour of Every File](#part-3-tour-of-every-file)
4. [Part 4: The Hidden World (causal_world.py)](#part-4-the-hidden-world)
5. [Part 5: The Reward Engine (rubric.py)](#part-5-the-reward-engine)
6. [Part 6: The Environment Core (hypothesis_lab_environment.py)](#part-6-the-environment-core)
7. [Part 7: The Data Models (models.py)](#part-7-the-data-models)
8. [Part 8: The Server (app.py)](#part-8-the-server)
9. [Part 9: The Client (client.py)](#part-9-the-client)
10. [Part 10: Tasks and Graders](#part-10-tasks-and-graders)
11. [Part 11: The Baseline Agent (baseline_inference.py)](#part-11-the-baseline-agent)
12. [Part 12: Testing](#part-12-testing)
13. [Part 13: Deployment](#part-13-deployment)
14. [Part 14: Hands-On Exercises](#part-14-hands-on-exercises)
15. [Part 15: Golden Rules for Building Environments](#part-15-golden-rules)
16. [Part 16: How to Build Your Own From Scratch](#part-16-build-your-own)

---

## Part 1: The Big Picture

### What is Reinforcement Learning?

Imagine teaching a dog a trick. You can't explain the trick in English. Instead, you:

1. Let the dog **try something** (an action)
2. **Show it the result** (an observation)
3. Give it a **treat or a scolding** (a reward)
4. Repeat

The dog learns by trial and error. That's reinforcement learning (RL).

In RL, there are two players:

```
┌─────────┐   action    ┌─────────────┐
│  AGENT  │ ──────────> │ ENVIRONMENT │
│  (dog)  │ <────────── │  (world)    │
└─────────┘ observation  └─────────────┘
              + reward
```

- **Agent**: the AI that learns (an LLM, a neural network, etc.)
- **Environment**: the world the agent lives in (our code!)

### What is an "Environment" in code?

An environment is a Python class with three methods:

```python
class MyEnvironment:
    def reset(self):
        """Start a new episode. Return the first observation."""
        ...

    def step(self, action):
        """Agent does something. Return what happened + reward."""
        ...

    def state(self):
        """Return metadata about the current episode."""
        ...
```

That's it. Those three methods are the entire interface between the agent and the world.

### What is OpenEnv?

OpenEnv is a **standard** for RL environments. Think of it like USB for hardware -- it doesn't matter what device you plug in, as long as it follows the USB spec. OpenEnv says:

- Your `reset()` must return an `Observation` object
- Your `step()` must accept an `Action` object and return an `Observation`
- Your `state` must return a `State` object
- These objects must be Pydantic models (typed, validated Python objects)
- You must have an `openenv.yaml` manifest file
- You must serve your environment over HTTP (FastAPI)

Why bother with a standard? Because it means **any agent** can talk to **any environment** without custom glue code.

### What does OUR environment do?

Our environment is called the **Scientific Hypothesis Lab**. Here's the idea:

> The agent is a scientist. Each episode, it faces a hidden causal system
> (like "Beta = 2.0 * Alpha + 3.0"). The variables are **abstract** --
> named things like Alpha, Beta, Gamma or V1, V2, V3 -- so the agent
> can't rely on pretrained knowledge of real-world physics. It must
> reason purely from experimental data.

Think of it like a detective game:
- The "crime" is hidden causal rules between variables
- The "clues" are noisy experimental results
- The "solution" is a written hypothesis
- The "score" is how close the hypothesis matches reality

This is a **real-world** task -- it models how actual scientists discover causal relationships. Using abstract variable names ensures the agent genuinely *discovers* rules rather than recalling them from training data.

---

## Part 2: The OpenEnv Contract

Before we look at code, let's understand the contract every OpenEnv environment must fulfill.

### The Three Methods

```
reset(**kwargs) -> Observation
    "Start fresh. Generate a new puzzle. Tell the agent what it sees."

step(action: Action) -> Observation
    "The agent did something. Process it. Tell the agent what happened."

state -> State  (property, not a method call)
    "Return metadata about the current episode. Never leak secrets."
```

### The Three Data Types

Every OpenEnv environment defines three Pydantic models that inherit from base types:

| Type | Base Class | Purpose | Who sees it |
|------|-----------|---------|-------------|
| **Action** | `openenv.core.Action` | What the agent sends | Agent -> Environment |
| **Observation** | `openenv.core.Observation` | What comes back | Environment -> Agent |
| **State** | `openenv.core.State` | Episode metadata | Anyone (debugging) |

The `Observation` base class always includes:
- `done: bool` -- is the episode over?
- `reward: float | None` -- how well did the agent do on this step?

The `State` base class always includes:
- `episode_id: str` -- unique ID for this episode
- `step_count: int` -- how many steps so far

### The Manifest (openenv.yaml)

Every environment needs a tiny YAML file:

```yaml
spec_version: 1           # Which version of the OpenEnv spec
name: hypothesis_lab      # Machine-readable name
type: space               # Deployed as an HF Space
runtime: fastapi           # HTTP framework used
app: server.app:app        # Python path to the ASGI app
port: 8000                 # Port the server listens on
```

This is like a `package.json` for your environment -- it tells the OpenEnv tooling how to find and run your code.

### The Episode Lifecycle

Here's what one complete episode looks like:

```
1. Agent calls reset(noise_level="low", domain="system_alpha")
2. Environment generates a hidden world with random causal rules
3. Environment returns initial Observation (variable names, budget, instructions)

4. LOOP:
   a. Agent reads the observation
   b. Agent decides on an action (experiment or submit)
   c. Agent calls step(action)
   d. Environment processes the action
   e. Environment returns new Observation (results, reward)
   f. If observation.done == True, episode is over

5. Agent calls state to see final metadata
```

---

## Part 3: Tour of Every File

Here is every file and what it does. Think of this as the map before we explore each room.

```
hypothesis_lab/

├── openenv.yaml                          # THE MANIFEST
│   "Hi, I'm an OpenEnv environment.      #   Points the framework
│    Here's how to find my server."        #   to server.app:app

├── models.py                             # THE LANGUAGE
│   "These are the words the agent        #   HypLabAction
│    and environment use to talk."         #   HypLabObservation
│                                         #   HypLabState

├── server/                               # THE BRAIN
│   ├── app.py                            #   HTTP server (thin wrapper)
│   ├── hypothesis_lab_environment.py     #   Core game logic
│   ├── causal_world.py                   #   Hidden puzzle generator
│   └── rubric.py                         #   Scoring engine

├── tasks/                                # THE EXAM
│   ├── task_easy.py                      #   Easy test + grader
│   ├── task_medium.py                    #   Medium test + grader
│   └── task_hard.py                      #   Hard test + grader

├── client.py                             # THE PHONE
│   "Typed Python client so agents        #   Wraps HTTP calls
│    don't need to speak raw HTTP."        #   into nice methods

├── baseline_inference.py                 # THE DEMO AGENT
│   "Here's a simple GPT agent that       #   Uses OpenAI API
│    can play the game. Not great,         #   Produces reproducible
│    but proves the game works."           #   scores on all 3 tasks

├── tests/                                # THE SAFETY NET
│   └── test_environment.py               #   39 tests covering
│                                         #   every component

├── Dockerfile                            # THE SHIPPING BOX
│   "Packages everything into a           #   Multi-stage build
│    container for deployment."            #   OpenEnv base image

├── pyproject.toml                        # THE SHOPPING LIST
│   "What Python packages we need."       #   Dependencies + metadata

└── README.md                             # THE COVER LETTER
    "What this environment is and         #   HF Spaces frontmatter
     how to use it."                      #   Action/observation docs
```

Now let's explore each room in detail.

---

## Part 4: The Hidden World

**File: `server/causal_world.py`**

This is the puzzle the agent must solve. Every episode generates a fresh hidden world.

### Core Concept: Causal Graphs

A causal graph is a set of variables connected by rules:

```
Alpha ──(quadratic)──> Beta ──(saturating)──> Gamma
 7.93      B = 0.5*A² + 1.2       G = 10*B / (3 + B)
```

The agent never sees this graph. It can only probe it through experiments.

### Why Abstract Variable Names?

An earlier version of this environment used real-world names like "Temperature", "Pressure", "Volume". This created a serious problem: LLM agents have *pretrained knowledge* about how those variables relate (PV=nRT, supply/demand curves, etc.). The agent would use that prior knowledge instead of reasoning from experimental data -- which defeats the entire purpose.

Now variables are named things like **Alpha, Beta, Gamma** or **V1, V2, V3** or **Quant_A, Quant_B, Quant_C**. The LLM has no prior about how "Alpha" relates to "Beta", so it must genuinely discover the relationship through experiments.

### The Building Blocks

**CausalRule** -- one edge in the graph:

```python
@dataclass
class CausalRule:
    cause: str          # "Alpha"
    effect: str         # "Beta"
    rule_type: str      # one of 8 types (see table below)
    params: dict        # {"a": 2.1, "b": 3.0}
    description: str    # "Beta = 2.1 * Alpha + 3.0"

    def evaluate(self, x: float) -> float:
        # Given x (the cause value), compute the effect value
```

There are **eight** single-parent rule types:

| Rule | Formula | What it looks like | Why it's tricky |
|------|---------|-------------------|-----------------|
| Linear | `y = a*x + b` | Straight line | Easy to identify |
| Threshold | `y = high if x > t else low` | Step function | Need to find the cutoff |
| Inverse | `y = a / x` | Hyperbola | Blows up near zero |
| Quadratic | `y = a*x² + b*x + c` | Parabola | Looks linear in narrow range |
| Exponential | `y = a * exp(k*x)` | Growth/decay curve | Looks linear locally |
| Logarithmic | `y = a * ln(x) + b` | Diminishing returns | Looks linear in mid-range |
| Saturating | `y = Vmax * x / (Km + x)` | Plateau | Looks linear for small x |
| Piecewise-linear | Two slopes with a knot | Bent line | Looks linear on each side |

Many of these look similar with limited data. Quadratic, exponential, and saturating all resemble linear in a narrow range -- the agent must design experiments that *discriminate* between hypotheses (e.g., sampling at extremes to check for curvature).

**InteractionRule** -- a multi-parent edge where the effect depends on **two** causes:

```python
@dataclass
class InteractionRule:
    cause1: str         # "Alpha"
    cause2: str         # "Beta"
    effect: str         # "Gamma"
    interaction_type: str  # "additive", "multiplicative", "min", "max"
```

These are genuinely hard: the agent can't discover them by varying one variable at a time. It must realise that two parents jointly determine the effect.

**Try it yourself** -- open a Python shell in the project directory:

```python
from server.causal_world import CausalRule

rule = CausalRule(
    cause="Alpha", effect="Beta",
    rule_type="linear", params={"a": 2.0, "b": 3.0},
    description="Beta = 2.0 * Alpha + 3.0"
)

print(rule.evaluate(0))   # 3.0  (y = 2*0 + 3)
print(rule.evaluate(5))   # 13.0 (y = 2*5 + 3)
print(rule.evaluate(10))  # 23.0 (y = 2*10 + 3)

# Try a saturating rule
sat = CausalRule(
    cause="Alpha", effect="Beta",
    rule_type="saturating", params={"v_max": 10.0, "k_m": 3.0},
    description="Beta = 10 * Alpha / (3 + Alpha)"
)
print(sat.evaluate(1))    # 2.5  (still growing)
print(sat.evaluate(10))   # 7.69 (approaching plateau)
print(sat.evaluate(1000)) # ~10  (saturated)
```

### CausalWorld -- the full hidden system

The `CausalWorld` holds all the variables, rules, interaction rules, and default values. It also tracks a **confounder_sigma** -- if > 0, a hidden variable injects correlated noise the agent can't explain.

It has four query methods -- one for each experiment type the agent can run:

```python
world.query_intervention(cause, value, effect, sigma)
# "Set Alpha to 5.0. What does Beta become?" (+ noise + confounder)

world.query_correlation(cause, [1, 10, 5], effect, sigma)
# "Sweep Alpha from 1 to 10 in 5 steps. Show me Beta at each."

world.query_counterfactual(cause, delta, effect, sigma)
# "If Alpha increases by +3.0, what happens to Beta?"

world.query_passive(target, sigma)
# "Just show me what Beta is right now, without changing anything."
```

Every result has **Gaussian noise** added. If sigma=0.05, the noise is tiny (easy mode). If sigma=0.50, the noise is huge (hard mode). On top of that, ~27% of worlds also have hidden confounder noise.

**Try it yourself:**

```python
from server.causal_world import generate_world

world = generate_world(n_variables=3, domain="system_alpha", seed=42)
print("Variables:", world.variables)
print("Ground truth:")
print(world.ground_truth_summary())

# Check for interactions and confounders
print(f"\nInteraction rules: {len(world.interactions)}")
print(f"Confounder sigma: {world.confounder_sigma}")

# Run an experiment
cause, effect = world.variables[0], world.variables[1]
result = world.query_intervention(cause, 5.0, effect, sigma=0.05)
print(f"\nSet {cause}=5.0, observed {effect}={result:.4f}")
```

### The generate_world() Function

This is the factory that builds a fresh puzzle:

1. Pick a domain (system_alpha/beta/gamma/delta) -- this only changes the context prompt
2. Pick an abstract variable pool (Greek letters, V1-V5, Quant_A-E, etc.)
3. Choose N variables and connect them with random rules (8 possible types)
4. Add extra random edges with 30% probability
5. Optionally replace some single-parent rules with multi-parent interaction rules (~40% chance when n >= 3)
6. Optionally add a hidden confounder (~30% chance when n >= 3)
7. Compute default values for all variables

### Domains and Variable Pools

Domains provide different narrative prompts but use the same abstract variable names:

```python
DOMAIN_LABELS = {
    "system_alpha": {"context": "You are studying an unknown dynamical system..."},
    "system_beta":  {"context": "You are investigating a black-box system..."},
    "system_gamma": {"context": "You are analysing an opaque process..."},
    "system_delta": {"context": "You are probing a simulated environment..."},
}

ABSTRACT_VAR_POOLS = [
    ["Alpha", "Beta", "Gamma", "Delta", "Epsilon"],
    ["Zeta", "Eta", "Theta", "Iota", "Kappa"],
    ["V1", "V2", "V3", "V4", "V5"],
    ["Rho", "Sigma", "Tau", "Upsilon", "Phi"],
    # ... more pools
]
```

Each episode randomly selects a pool, so the agent can't even memorise variable-name-to-position mappings across episodes.

---

## Part 5: The Reward Engine

**File: `server/rubric.py`**

The reward function is arguably the most important part of any RL environment. A bad reward function trains bad agents. Let's understand every piece.

### Two Kinds of Rewards

Our environment gives rewards at two different times:

**Per-step rewards** (during the episode):
- Every experiment gives information gain reward
- Redundant experiments get penalized

**End-of-episode rewards** (when the agent submits its hypothesis):
- Accuracy, precision, calibration, efficiency, contradiction checks

### Per-Step: InfoGainTracker

This tracks which variable pairs (edges) the agent has probed:

```python
tracker = InfoGainTracker()

# First time probing Alpha -> Beta: +0.20
reward, redundant = tracker.record_and_score("Alpha", "Beta", "intervention", 5.0)
# reward = 0.20, redundant = False

# Second time, different experiment type (triangulation!): +0.25
reward, redundant = tracker.record_and_score("Alpha", "Beta", "correlation", [1,10,5])
# reward = 0.25, redundant = False  (BONUS for using different experiment type!)

# Third time: only +0.05
# Fourth time: -0.10 (PENALTY)
```

The reward schedule:

| Visit # | Same type | Different type | Purpose |
|---------|-----------|---------------|---------|
| 1st | +0.20 | +0.20 | Reward exploration |
| 2nd | +0.12 | +0.25 | Reward triangulation |
| 3rd | +0.05 | +0.05 | Diminishing returns |
| 4th+ | -0.10 | -0.10 | Punish redundancy |

**Why this design?** In real science, repeating the exact same experiment is wasteful. But using a *different* method to study the same relationship (triangulation) is valuable because it confirms findings. Our reward function teaches the agent this lesson.

**Try it yourself:**

```python
from server.rubric import InfoGainTracker

tracker = InfoGainTracker()
for i in range(5):
    reward, redundant = tracker.record_and_score("A", "B", "intervention", 1.0)
    print(f"Visit {i+1}: reward={reward:+.2f}, redundant={redundant}")

print(f"\nCumulative info gain: {tracker.cumulative_gain:.2f}")
print(f"Redundant experiments: {tracker.redundant_count}")
```

### End-of-Episode: score_hypothesis()

When the agent submits, five scoring components fire:

#### 1. Accuracy Score (0.0 - 1.0)

How much of the ground truth did the agent discover?

For **single-parent rules**, the scorer checks:
- Did the hypothesis mention both the cause and effect variable names?  (+0.4 per rule)
- Did it identify the relationship type (linear, quadratic, saturating, etc.)?  (+0.3 per rule)
- Did it include the correct numerical parameters?  (+0.3 per rule)

For **interaction rules**, the scorer checks:
- Did the hypothesis mention the effect and at least one cause?  (+0.3)
- Did it mention both causes?  (+0.2 additional)
- Did it identify the interaction type (additive, multiplicative, etc.)?  (+0.5)

Example: if the ground truth is `Beta = 2.0 * Alpha + 3.0` and the agent writes "Beta increases linearly with Alpha at a slope of 2.0", it scores high on all three checks.

Each of the 8 rule types has its own set of keywords the scorer recognises (e.g. "saturating", "plateau", "asymptote" for saturating rules; "quadratic", "squared", "parabola" for quadratic).

#### 2. Precision Bonus (+0.10)

Does the hypothesis contain actual numbers? "Alpha affects Beta" scores 0. "Beta = 2.0 * Alpha + 3.0" scores +0.10. This rewards agents that make **falsifiable, quantitative claims** instead of vague hand-waving.

#### 3. Calibration Score (0.0 - 0.20)

When the agent submits, it also reports a confidence level (0.0 to 1.0). Calibration measures how well that confidence matches the actual accuracy:

```
calibration = 0.20 * (1 - |confidence - accuracy| / 0.5)
```

If the agent says confidence=0.9 but accuracy=0.2, that's overconfident and scores low. If confidence=0.3 and accuracy=0.2, that's well-calibrated and scores high. This teaches agents to **know what they don't know**.

#### 4. Efficiency Bonus (+0.15)

If the agent submits early (30%+ budget remaining) with decent accuracy (60%+), it gets a bonus. This rewards agents that don't waste time running unnecessary experiments.

#### 5. Contradiction Penalty (-0.50)

If the hypothesis contradicts the experimental setup (e.g., claiming "all variables are independent" or "no causal relationship exists"), it gets a harsh penalty. This teaches agents not to give up without trying.

**Try it yourself:**

```python
import numpy as np
from server.causal_world import CausalWorld, CausalRule
from server.rubric import score_hypothesis

rule = CausalRule("Alpha", "Beta", "linear",
                  {"a": 2.0, "b": 3.0},
                  "Beta = 2.0 * Alpha + 3.0")

world = CausalWorld(
    domain="system_alpha",
    variables=["Alpha", "Beta"],
    units={"Alpha": "units", "Beta": "units"},
    rules=[rule],
    default_values={"Alpha": 5.0, "Beta": 13.0},
    rng=np.random.default_rng(0),
)

# Good hypothesis
result = score_hypothesis(
    "Beta = 2.0 * Alpha + 3.0. Linear relationship.",
    ["Beta = 2.0 * Alpha + 3.0"],
    confidence=0.85,
    world=world,
    budget_remaining=4,
    budget_total=10,
)
print(f"Accuracy:     {result.accuracy_score:.2f}")
print(f"Precision:    {result.precision_bonus:.2f}")
print(f"Calibration:  {result.calibration_score:.2f}")
print(f"Efficiency:   {result.efficiency_bonus:.2f}")
print(f"Contradiction:{result.contradiction_penalty:.2f}")
print(f"TOTAL:        {result.total:.2f}")
print(f"\nFeedback: {result.feedback}")
```

---

## Part 6: The Environment Core

**File: `server/hypothesis_lab_environment.py`**

This is the central nervous system. It ties together the hidden world, the rubric, and the data models.

### The Class Structure

```python
class HypothesisLabEnvironment(Environment):
    SUPPORTS_CONCURRENT_SESSIONS = True  # Multiple agents can play at once

    def __init__(self, **kwargs):
        # Initialize empty state -- no episode running yet
        self._world = None           # The hidden causal graph
        self._tracker = None         # InfoGainTracker for per-step rewards
        self._step_count = 0
        self._budget_remaining = 0
        self._done = True            # No episode until reset() is called
        self._history = []           # Log of all experiments
        ...
```

### reset() -- Starting a New Episode

```python
def reset(self, seed=None, episode_id=None, **kwargs):
    # 1. Read difficulty parameters
    noise_level = kwargs.get("noise_level", "medium")  # low/medium/high
    domain = kwargs.get("domain", None)                 # system_alpha/beta/gamma/delta

    # 2. Look up noise and budget from schedule tables
    sigma = NOISE_SCHEDULE[noise_level]    # low=0.05, medium=0.20, high=0.50
    budget = BUDGET_SCHEDULE[noise_level]  # low=12,   medium=10,   high=8
    n_vars = N_VARIABLES_SCHEDULE[noise_level]  # low=2, medium=3, high=4

    # 3. Generate a fresh hidden world (abstract variable names, 8+ rule types)
    self._world = generate_world(n_variables=n_vars, domain=domain, seed=seed)

    # 4. Initialize tracking
    self._tracker = InfoGainTracker()
    self._budget_remaining = budget
    self._done = False

    # 5. Return initial observation (variable names, budget, instructions)
    return HypLabObservation(
        system_message="New episode started. You have 3 unknown variables...",
        available_variables=self._world.variables,
        budget_remaining=budget,
        done=False,
        reward=0.0,
    )
```

**Key insight:** `reset()` generates a *new* hidden world every time. The agent never carries knowledge between episodes. Each episode is an independent puzzle.

### step() -- Processing an Action

```python
def step(self, action: HypLabAction, **kwargs):
    if self._done:
        raise RuntimeError("Episode is done. Call reset().")

    self._step_count += 1

    if action.action_type == ActionType.EXPERIMENT:
        return self._handle_experiment(action)
    elif action.action_type == ActionType.SUBMIT:
        return self._handle_submit(action)
```

There are only two things the agent can do: run an experiment, or submit a hypothesis. This is a **clean action space** -- no ambiguity about what actions are valid.

### _handle_experiment() -- Running an Experiment

This is the longest method. Here's what it does:

1. **Validate** the variable names (are they real variables in this world?)
2. **Route** to the right query method based on experiment type
3. **Format** the result as human-readable text (for the LLM to read)
4. **Score** the information gain via InfoGainTracker
5. **Deduct** budget
6. **Check** if budget is exhausted
7. **Return** observation with all the details

### _handle_submit() -- Grading the Hypothesis

1. Mark episode as done
2. Call `score_hypothesis()` from the rubric
3. Format the rubric breakdown as text
4. Return observation with scores and revealed ground truth

**Key insight:** the ground truth is only revealed **after** submission. This prevents the agent from cheating.

### state -- Episode Metadata

```python
@property
def state(self) -> HypLabState:
    return HypLabState(
        episode_id=self._episode_id,
        step_count=self._step_count,
        budget_remaining=self._budget_remaining,
        noise_level=self._noise_level,
        experiment_history=self._history,  # What experiments ran so far
        ...
    )
```

**Critical rule:** `state` must NEVER leak the hidden world. No rule types, no parameters, no ground truth. Only metadata the agent already knows.

**Try the full loop yourself:**

```python
from models import ActionType, ExperimentType, HypLabAction
from server.hypothesis_lab_environment import HypothesisLabEnvironment

env = HypothesisLabEnvironment()

# Start a new episode
obs = env.reset(seed=42, noise_level="low", domain="system_alpha")
print("=== RESET ===")
print(obs.system_message)
print()

# Run an experiment
vars_ = obs.available_variables
action = HypLabAction(
    action_type=ActionType.EXPERIMENT,
    experiment_type=ExperimentType.INTERVENTION,
    control_variable=vars_[0],
    target_variable=vars_[1],
    control_value=5.0,
)
obs = env.step(action)
print("=== EXPERIMENT ===")
print(obs.system_message)
print(f"Info gain: {obs.info_gain_reward}")
print()

# Try a correlation sweep
action2 = HypLabAction(
    action_type=ActionType.EXPERIMENT,
    experiment_type=ExperimentType.CORRELATION,
    control_variable=vars_[0],
    control_range=[1.0, 10.0, 5.0],
    target_variable=vars_[1],
)
obs = env.step(action2)
print("=== CORRELATION ===")
print(obs.system_message)
print()

# Submit hypothesis
submit = HypLabAction(
    action_type=ActionType.SUBMIT,
    hypothesis_text=f"{vars_[1]} is linearly related to {vars_[0]} with slope ~2.0",
    hypothesis_equations=[f"{vars_[1]} = 2.0 * {vars_[0]} + 3.0"],
    confidence=0.75,
)
obs = env.step(submit)
print("=== SUBMIT ===")
print(obs.system_message)
```

---

## Part 7: The Data Models

**File: `models.py`**

This file defines the *language* the agent and environment speak. Every piece of data that crosses the boundary must be one of these types.

### Why Pydantic?

Pydantic gives us:
1. **Validation** -- if the agent sends `control_value="hello"` instead of a number, it gets a clear error
2. **Serialization** -- objects convert to/from JSON automatically for HTTP transport
3. **Documentation** -- every field has a type and a description
4. **IDE support** -- autocomplete and type checking

### The Import Pattern

```python
try:
    from openenv.core.env_server.types import Action, Observation, State
except ImportError:
    # Fallback for when openenv-core isn't installed
    from pydantic import BaseModel
    class Action(BaseModel): ...
    class Observation(BaseModel): ...
    class State(BaseModel): ...
```

This pattern lets the code work both:
- In production (with openenv-core installed)
- In development/testing (without it)

### The Enums

```python
class ExperimentType(str, Enum):
    INTERVENTION = "intervention"
    CORRELATION = "correlation"
    COUNTERFACTUAL = "counterfactual"
    PASSIVE = "passive"

class ActionType(str, Enum):
    EXPERIMENT = "experiment"
    SUBMIT = "submit"

class NoiseLevelTag(str, Enum):
    LOW = "low"
    MEDIUM = "medium"
    HIGH = "high"
```

Using `str, Enum` means these serialize as simple strings in JSON: `"intervention"` instead of `ExperimentType.INTERVENTION`. This makes the API friendly for LLM agents that output raw JSON.

### HypLabAction -- What the Agent Sends

The action model is **polymorphic** -- it handles two different use cases in one object:

```python
# Use case 1: Run an experiment
HypLabAction(
    action_type="experiment",
    experiment_type="intervention",
    control_variable="Alpha",
    control_value=5.0,
    target_variable="Beta",
)

# Use case 2: Submit a hypothesis
HypLabAction(
    action_type="submit",
    hypothesis_text="Beta = 2.0 * Alpha + 3.0",
    hypothesis_equations=["Beta = 2.0 * Alpha + 3.0"],
    confidence=0.85,
)
```

The experiment fields are `Optional` so they can be `None` when submitting, and vice versa. This is a common pattern in RL environments where the action space has distinct modes.

### HypLabObservation -- What Comes Back

Observations are rich and multi-purpose:

- **Always present**: `system_message`, `available_variables`, `budget_remaining`, `done`, `reward`
- **After experiments**: `result_value`, `noise_sigma`, `info_gain_reward`, `is_redundant`
- **After submission**: `accuracy_score`, `total_episode_reward`, `ground_truth_revealed`

The `system_message` field is crucial -- it's the human-readable text that an LLM agent reads (e.g. "Set Alpha=5.0, observed Beta=13.04"). The structured fields are for programmatic access.

### HypLabState -- Episode Metadata

```python
class HypLabState(State):
    budget_total: int = 0
    budget_remaining: int = 0
    noise_level: NoiseLevelTag = NoiseLevelTag.MEDIUM
    experiment_history: list[dict] = []
    cumulative_info_gain: float = 0.0
    redundant_experiment_count: int = 0
```

Notice what's NOT here: no `rules`, no `default_values`, no `ground_truth`. The state is safe to show to the agent without leaking the answer.

---

## Part 8: The Server

**File: `server/app.py`**

This is the thinnest file in the project, and that's by design.

```python
from openenv.core.env_server.http_server import create_app

app = create_app(
    HypothesisLabEnvironment,  # The environment class
    HypLabAction,              # What the agent sends
    HypLabObservation,         # What comes back
    env_name="hypothesis_lab",
    max_concurrent_envs=200,
)
```

`create_app()` does all the heavy lifting:
- Creates FastAPI routes: `/reset`, `/step`, `/state`, `/health`, `/schema`
- Handles session management (multiple agents playing at once)
- Serializes/deserializes Pydantic models to/from JSON
- Adds WebSocket support for persistent connections

You almost never need to touch this file. The magic is in `create_app()`.

### The HTTP Endpoints

| Endpoint | Method | What it does |
|----------|--------|-------------|
| `/health` | GET | Returns `{"status": "ok"}` -- for Docker healthchecks |
| `/reset` | POST | Starts a new episode, returns initial observation |
| `/step` | POST | Sends an action, returns observation + reward |
| `/state` | GET | Returns current episode metadata |
| `/schema` | GET | Returns JSON schemas for Action/Observation |

### Running the Server

```bash
cd "files 2"
uvicorn server.app:app --port 8000
```

Then in another terminal:

```bash
curl http://localhost:8000/health
# {"status": "ok"}

curl -X POST http://localhost:8000/reset \
  -H "Content-Type: application/json" \
  -d '{"noise_level": "low", "domain": "system_alpha", "seed": 42}'
```

---

## Part 9: The Client

**File: `client.py`**

The client is the agent's friendly interface to the server. Instead of constructing raw HTTP requests, the agent gets nice typed methods.

```python
class HypothesisLabEnv(EnvClient[HypLabAction, HypLabObservation, HypLabState]):
```

The `EnvClient` base class handles:
- WebSocket connections (persistent, faster than HTTP polling)
- Automatic reconnection
- JSON serialization

Our client adds convenience methods:

```python
await env.run_intervention("Alpha", 5.0, "Beta")
await env.run_correlation("Alpha", [1, 10, 5], "Beta")
await env.run_counterfactual("Alpha", 3.0, "Beta")
await env.run_passive("Beta")
await env.submit_hypothesis("Beta = 2.0 * Alpha + 3.0", confidence=0.85)
```

Each method constructs the right `HypLabAction` internally so the agent doesn't have to remember the field names.

### The Three Abstract Methods

Every `EnvClient` subclass must implement:

```python
def _step_payload(self, action):
    """Convert a HypLabAction into a JSON-ready dict."""
    return action.model_dump(exclude_none=True)

def _parse_result(self, payload):
    """Convert a JSON dict from the server into a StepResult."""
    obs = HypLabObservation(**payload)
    return StepResult(observation=obs, reward=..., done=...)

def _parse_state(self, payload):
    """Convert a JSON dict into a HypLabState."""
    return HypLabState(**payload)
```

---

## Part 10: Tasks and Graders

**Files: `tasks/task_easy.py`, `task_medium.py`, `task_hard.py`**

The hackathon rules require **minimum 3 tasks** with **programmatic graders** that return scores between 0.0 and 1.0.

### What is a Task?

A task is a configuration dict that says "run the environment with these settings":

```python
TASK_EASY = {
    "id": "easy",
    "name": "Easy -- Single-Edge Discovery",
    "description": "Discover the causal relationship between two abstract variables...",
    "difficulty": "easy",
    "reset_kwargs": {
        "noise_level": "low",        # sigma = 0.05
        "domain": "system_alpha",    # abstract domain
        "seed": 42,                  # deterministic for reproducibility
    },
}
```

### What is a Grader?

A grader takes the episode results and returns a normalized score:

```python
def grade_easy(episode_result: dict) -> float:
    accuracy = episode_result.get("accuracy_score", 0.0)
    efficiency = episode_result.get("efficiency_bonus", 0.0)
    calibration = episode_result.get("calibration_score", 0.0)

    raw = (
        0.60 * min(accuracy, 1.0)               # 60% weight on accuracy
        + 0.20 * min(efficiency / 0.15, 1.0)     # 20% weight on efficiency
        + 0.20 * min(calibration / 0.20, 1.0)    # 20% weight on calibration
    )

    return round(max(0.0, min(1.0, raw)), 4)
```

### Difficulty Progression

| | Easy | Medium | Hard |
|---|---|---|---|
| Variables | 2 | 3 | 4 |
| Noise (sigma) | 0.05 | 0.20 | 0.50 |
| Budget | 12 | 10 | 8 |
| Domain | system_alpha (fixed) | Random | Random |
| Key challenge | Single edge | Multiple edges + interactions | Complex graph + confounders + noise |

The hard task is genuinely hard for frontier models:
- 4 variables means up to 6 possible edges to discover
- Rules can be any of 8 types (not just linear!) plus interaction rules
- High noise + hidden confounders make every observation unreliable
- Only 8 experiments to figure it all out
- Abstract variable names prevent exploiting pretrained knowledge

**Try it yourself:**

```python
from tasks.task_easy import grade_easy

# Perfect episode
score = grade_easy({
    "accuracy_score": 1.0,
    "efficiency_bonus": 0.15,
    "calibration_score": 0.20,
})
print(f"Perfect score: {score}")  # 1.0

# Mediocre episode
score = grade_easy({
    "accuracy_score": 0.4,
    "efficiency_bonus": 0.0,
    "calibration_score": 0.05,
})
print(f"Mediocre score: {score}")  # ~0.29

# Zero effort
score = grade_easy({})
print(f"Zero score: {score}")  # 0.0
```

---

## Part 11: The Baseline Agent

**File: `baseline_inference.py`**

This script proves the environment works by running a real LLM agent against all three tasks.

### The Flow

```
1. Create an OpenAI client (reads OPENAI_API_KEY from env)
2. For each of the 3 tasks:
   a. Create a fresh HypothesisLabEnvironment
   b. Call reset() with the task's settings
   c. Enter a loop (max 8 turns):
      - Send the observation to the LLM as a "user" message
      - Parse the LLM's response into a HypLabAction
      - Call step(action)
      - If done, break
   d. If not done after 8 turns, force a submit
   e. Grade the episode with the task's grader
3. Print all scores
```

### The System Prompt

The system prompt teaches the LLM how to interact with the environment:

```
You are a scientific AI assistant trained to discover hidden causal rules.
...
Format your actions as JSON:
{"action_type": "experiment", "experiment_type": "intervention", ...}
...
Strategy tips:
- Run interventions first to discover which variables are causally connected
- Vary the control variable widely (e.g. 1, 5, 10) to detect nonlinearity
- Don't repeat the same experiment -- redundant experiments are penalised
```

### The Action Parser

LLMs don't always produce perfect JSON. The parser handles multiple formats:

1. **JSON in code blocks**: `` ```json {...} ``` ``
2. **Raw JSON**: `{...}`
3. **Natural language**: "I conclude that Beta = 2 * Alpha" (extracted via regex)
4. **Timeout**: if it's the last turn, force a submit with whatever text the LLM wrote

### Running It

```bash
export OPENAI_API_KEY=sk-...
python baseline_inference.py
```

Expected output:

```
============================================================
  Scientific Hypothesis Lab -- Baseline Inference
  Model: gpt-4o-mini
============================================================

--- Task: Easy -- Single-Edge Discovery ---
    Total episode reward: +0.6100
    Graded score:         0.6500

--- Task: Medium -- Multi-Edge Discovery ---
    Total episode reward: +0.3800
    Graded score:         0.4000

--- Task: Hard -- Complex Graph Under Noise ---
    Total episode reward: +0.2100
    Graded score:         0.2500

============================================================
  SUMMARY
============================================================
  easy    : 0.6500
  medium  : 0.4000
  hard    : 0.2500
  average : 0.4333
```

---

## Part 12: Testing

**File: `tests/test_environment.py`**

39 tests organized into 5 test classes. Run them with:

```bash
pytest tests/ -v
```

### Test Classes

| Class | Tests | What it covers |
|-------|-------|----------------|
| TestCausalWorld | 18 | World generation, all 8 rule types, interactions, domains, seeds, abstract names |
| TestInfoGainTracker | 4 | Reward schedule, redundancy, triangulation |
| TestRubric | 6 | Accuracy scoring, calibration, efficiency, feedback |
| TestEnvironmentIntegration | 6 | Full episodes, budget exhaustion, errors, state leaks |
| TestGraders | 5 | Grader range [0,1], zero input, perfect input |

### Key Tests to Study

**Seed reproducibility** -- same seed produces same world:
```python
world1 = generate_world(n_variables=3, domain="system_alpha", seed=99)
world2 = generate_world(n_variables=3, domain="system_alpha", seed=99)
assert world1.variables == world2.variables
```

**Variable names are abstract** -- no real-world names that give LLMs prior knowledge:
```python
for seed in range(50):
    world = generate_world(n_variables=4, seed=seed)
    for v in world.variables:
        assert v.lower() not in {"temperature", "pressure", "price", ...}
```

**State doesn't leak secrets**:
```python
st = env.state
state_str = str(st.model_dump())
assert "rule_type" not in state_str
assert "params" not in state_str
```

**Diverse rule types over many seeds** -- we see all 8+ types:
```python
types_seen = set()
for seed in range(100):
    world = generate_world(n_variables=3, seed=seed)
    for rule in world.rules:
        types_seen.add(rule.rule_type)
assert len(types_seen) >= 5
```

**Grader always returns [0, 1]**:
```python
score = grade_easy({"accuracy_score": 1.0, "efficiency_bonus": 0.15, ...})
assert 0.0 <= score <= 1.0
```

---

## Part 13: Deployment

### Dockerfile

The Dockerfile uses a multi-stage build:

```
Stage 1 (builder):
  - Start from OpenEnv base image
  - Copy source code
  - Install uv (Python package manager)
  - Run uv sync to install dependencies
  - This creates a .venv with all packages

Stage 2 (runtime):
  - Start from a clean base image
  - Copy only the .venv and source code (not build tools)
  - Set PATH and PYTHONPATH
  - Run uvicorn to start the server
```

### Step 1: Build the Docker Image

```bash
cd Lab-experiment
docker build -t hypothesis-lab .
```

This takes 2-5 minutes the first time (downloads base image + installs dependencies). Subsequent builds are fast thanks to layer caching. You should see `Successfully tagged hypothesis-lab:latest` at the end.

If the build fails, check:
- `pyproject.toml` has `build-backend = "setuptools.build_meta"` (not the experimental `setuptools.backends` path)
- `.dockerignore` excludes `.venv/`, `__pycache__/`, `.git/`

### Step 2: Run the Container

```bash
docker run -p 8000:8000 hypothesis-lab
```

You should see uvicorn start up:

```
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
```

To run in the background (detached mode):

```bash
docker run -d --name hyp-lab -p 8000:8000 hypothesis-lab
```

### Step 3: Verify the Server is Running

Open a **new terminal** and run:

```bash
curl http://localhost:8000/health
```

Expected response:

```json
{"status":"ok"}
```

### Step 4: Check the API Schema

```bash
curl -s http://localhost:8000/schema | python3 -m json.tool
```

This returns the JSON Schema definitions for `HypLabAction` and `HypLabObservation`, useful for understanding what fields exist.

### Step 5: Understand HTTP vs WebSocket

> **Critical concept:** The OpenEnv server has two communication modes:
>
> | Endpoint | Type | Stateful? | Use case |
> |---|---|---|---|
> | `/health` | GET | No | Check if server is alive |
> | `/schema` | GET | No | Inspect action/observation schemas |
> | `/reset` | POST | **No** -- creates a fresh env, returns result, destroys env | One-shot inspection |
> | `/step` | POST | **No** -- creates a fresh env (never reset!), tries to step, fails | **Don't use for episodes** |
> | `/ws` | WebSocket | **Yes** -- persistent connection, one env for the whole episode | **Use this for episodes** |
>
> The HTTP `/reset` and `/step` are **stateless**: each request creates a brand-new
> environment instance and destroys it after responding. If you `curl /reset` then
> `curl /step`, the step hits a *different* environment that was never reset -- so
> it fails. Multi-step episodes require the **WebSocket** endpoint (`/ws`), which
> keeps one environment alive for the entire connection.

This is why `curl` to `/step` returned an empty response -- the server-side
environment had no world to step in. Our environment now returns a clear error
instead of crashing:

```json
{"observation": {"system_message": "Error: No active episode. Call reset() first.", "done": true, "reward": -1.0}, ...}
```

### Step 6: Run a Full Episode (Python script)

The proper way to interact is via WebSocket. The `EnvClient` class handles
this automatically. Save this as `test_docker.py` and run it while the
container is running:

```python
import asyncio
import json
import websockets

async def run_episode():
    uri = "ws://localhost:8000/ws"
    async with websockets.connect(uri) as ws:

        # 1. Reset
        await ws.send(json.dumps({
            "type": "reset",
            "data": {"noise_level": "low", "domain": "system_alpha", "seed": 42}
        }))
        resp = json.loads(await ws.recv())
        obs = resp["data"]["observation"]
        print(f"=== Episode Started ===")
        print(f"Variables: {obs['available_variables']}")
        print(f"Budget: {obs['budget_remaining']}")
        print()

        variables = obs["available_variables"]
        cause, effect = variables[0], variables[1]

        # 2. Intervention experiment
        await ws.send(json.dumps({
            "type": "step",
            "data": {
                "action_type": "experiment",
                "experiment_type": "intervention",
                "control_variable": cause,
                "control_value": 5.0,
                "target_variable": effect,
            }
        }))
        resp = json.loads(await ws.recv())
        obs = resp["data"]["observation"]
        print(f"[Intervention] Set {cause}=5.0 -> {effect}={obs['result_value']}")
        print(f"  Info gain: {obs['info_gain_reward']}, Budget left: {obs['budget_remaining']}")
        print()

        # 3. Correlation sweep
        await ws.send(json.dumps({
            "type": "step",
            "data": {
                "action_type": "experiment",
                "experiment_type": "correlation",
                "control_variable": cause,
                "control_range": [0.5, 20.0, 8],
                "target_variable": effect,
            }
        }))
        resp = json.loads(await ws.recv())
        obs = resp["data"]["observation"]
        print(f"[Correlation] Swept {cause} from 0.5 to 20.0:")
        if isinstance(obs["result_value"], list):
            for point in obs["result_value"]:
                print(f"  {cause}={point[0]:.1f} -> {effect}={point[1]:.4f}")
        print(f"  Info gain: {obs['info_gain_reward']}, Budget left: {obs['budget_remaining']}")
        print()

        # 4. Submit hypothesis
        await ws.send(json.dumps({
            "type": "step",
            "data": {
                "action_type": "submit",
                "hypothesis_text": f"{effect} depends linearly on {cause}.",
                "hypothesis_equations": [f"{effect} = 2.0 * {cause} + 1.0"],
                "confidence": 0.6,
            }
        }))
        resp = json.loads(await ws.recv())
        obs = resp["data"]["observation"]
        print(f"=== Episode Finished ===")
        print(f"Accuracy:      {obs.get('accuracy_score')}")
        print(f"Precision:     {obs.get('precision_bonus')}")
        print(f"Calibration:   {obs.get('calibration_score')}")
        print(f"Efficiency:    {obs.get('efficiency_bonus')}")
        print(f"Contradiction: {obs.get('contradiction_penalty')}")
        print(f"TOTAL REWARD:  {obs.get('total_episode_reward')}")
        print()
        print(f"Ground truth:\n{obs.get('ground_truth_revealed')}")

asyncio.run(run_episode())
```

Run it:

```bash
pip install websockets    # one-time install
python test_docker.py
```

Expected output:

```
=== Episode Started ===
Variables: ['Quant_A', 'Quant_E']
Budget: 12

[Intervention] Set Quant_A=5.0 -> Quant_E=3.4521
  Info gain: 0.12, Budget left: 11

[Correlation] Swept Quant_A from 0.5 to 20.0:
  Quant_A=0.5 -> Quant_E=7.8123
  Quant_A=3.3 -> Quant_E=4.2341
  ...
  Info gain: 0.10, Budget left: 10

=== Episode Finished ===
Accuracy:      0.35
Precision:     0.0
Calibration:   0.14
Efficiency:    0.15
Contradiction: 0.0
TOTAL REWARD:  0.86

Ground truth:
Domain: system_alpha
  Quant_E = 1.11 * exp(-0.16 * Quant_A)
```

> **Key insight from the WebSocket protocol:**
>
> - Send messages as `{"type": "reset", "data": {...}}` and `{"type": "step", "data": {...}}`
> - The action fields go directly inside `"data"` (no extra `"action"` wrapper)
> - Responses come back as `{"type": "observation", "data": {"observation": {...}, "reward": ..., "done": ...}}`
> - The observation fields live at `resp["data"]["observation"]` -- note the double nesting

### Understanding the Observation Fields

On reset, most fields are `null` -- only setup information is populated:

| Field | What it tells you |
|---|---|
| `system_message` | Human-readable summary -- the LLM agent reads this |
| `available_variables` | Variable names to use in experiments |
| `budget_remaining` | Number of experiment steps left |
| `result_value` | `null` on reset; float or `[[x,y],...]` list after experiments |
| `noise_sigma` | `null` on reset; shown per-experiment so you know measurement precision |
| `done` | `false` until you submit or budget runs out |
| `reward` | Reward for this step (0.0 on reset) |
| `accuracy_score` ... `ground_truth_revealed` | All `null` until you submit your hypothesis |

After submit, the scoring fields light up:

| Field | Meaning |
|---|---|
| `accuracy_score` | How close your hypothesis matches the true rules (0-1) |
| `precision_bonus` | Bonus for getting coefficients/parameters right |
| `calibration_score` | How well your confidence matches your actual accuracy |
| `efficiency_bonus` | Reward for using fewer budget steps |
| `contradiction_penalty` | Deducted if your hypothesis contradicts your own data |
| `total_episode_reward` | Sum of all info gain rewards + final rubric score |
| `ground_truth_revealed` | The actual hidden rules -- study this to improve! |

> **Design note: Why don't we reveal the exact noise sigma upfront?**
>
> The system message says "Noise level: low" but does NOT say "sigma=0.05".
> In real science you have to estimate measurement uncertainty from repeated
> measurements. This forces the agent to run a few repeat experiments to
> gauge noise before trusting single data points. The qualitative label
> (low/medium/high) sets expectations without handing out a free number.
> The exact sigma IS shown per-experiment in the `noise_sigma` field --
> that's fine because by then the agent has already spent a budget step.

### Error Handling

The environment returns error observations (not crashes) for bad actions:

| Situation | Response | Reward |
|---|---|---|
| Step without reset | `"Error: No active episode. Call reset() first."` | `-1.0`, `done=true` |
| Step after episode ended | `"Error: Episode is already done."` | `0.0`, `done=true` |
| Unknown variable name | `"Error: Unknown control variable 'X'."` | `-0.05`, budget deducted |
| Unknown experiment type | `"Error: Unknown experiment type..."` | `-0.05` |
| Unknown action type | `"Error: Unknown action_type..."` | `-0.05`, budget deducted |

The small negative reward (`-0.05`) for invalid actions teaches RL agents to
produce valid requests without being so harsh that it dominates the reward signal.

### Stopping the Container

```bash
# If running in foreground: Ctrl+C

# If running in background:
docker stop hyp-lab
docker rm hyp-lab
```

### Troubleshooting

| Problem | Fix |
|---|---|
| `port is already allocated` | Another process uses port 8000. Use `-p 8001:8000` and hit `localhost:8001` instead |
| `curl: (7) Failed to connect` | Container isn't running yet. Wait a few seconds for uvicorn to start |
| `{"detail":"Not Found"}` | You hit the wrong endpoint. Use `/health`, `/reset`, `/step`, `/state` |
| Container exits immediately | Check logs: `docker logs hyp-lab`. Usually a missing dependency |

### Deploying to HF Spaces

```bash
openenv push --org your-org --token $HF_TOKEN
```

The README.md has Hugging Face Spaces metadata in its YAML frontmatter:

```yaml
---
title: Scientific Hypothesis Lab
emoji: 🔬
sdk: docker
app_port: 8000
tags:
  - openenv
---
```

This tells HF Spaces to build the Docker image and expose port 8000.

---

## Part 14: Hands-On Exercises

Now it's your turn. These exercises go from easy to hard.

### Exercise 1: Explore a World (5 min)

```python
from server.causal_world import generate_world

# Generate 3 different worlds and print their ground truth
for seed in [1, 2, 3]:
    world = generate_world(n_variables=3, domain="system_gamma", seed=seed)
    print(f"\n=== Seed {seed} ===")
    print(f"Variables: {world.variables}")
    print(f"Interactions: {len(world.interactions)}")
    print(f"Confounder sigma: {world.confounder_sigma}")
    print(world.ground_truth_summary())
```

Questions to answer:
- How many rules does each world have? What types?
- Do any worlds have interaction rules or confounders?
- Are variable names abstract (no real-world physics terms)?

### Exercise 2: Play a Full Episode (10 min)

```python
from models import ActionType, ExperimentType, HypLabAction
from server.hypothesis_lab_environment import HypothesisLabEnvironment

env = HypothesisLabEnvironment()
obs = env.reset(seed=100, noise_level="medium", domain="system_beta")
print(obs.system_message)

# YOUR TURN: Run 3-4 experiments, then submit a hypothesis.
# Try to get the highest accuracy score you can.
# Hint: use CORRELATION to see the relationship shape,
#   then test at extreme values to distinguish linear from quadratic/saturating.
```

### Exercise 3: Break the Rubric (10 min)

Try to get edge-case scores:
- Get accuracy_score = 0.0 (submit empty hypothesis)
- Get contradiction_penalty = -0.50 (claim "no causal relationship exists")
- Get efficiency_bonus = 0.15 (submit early with high accuracy)
- Get calibration_score = 0.20 (match your confidence to your accuracy perfectly)

### Exercise 4: Add a New Rule Type (20 min)

The environment already has 8 rule types, but you can add more! Try adding a **sinusoidal** rule:
- Formula: `y = a * sin(k * x) + b`
- Add it to `CausalRule.evaluate()`
- Add it to `RULE_TYPES` and `_random_rule()` with appropriate weights
- Add keywords to `_RULE_KEYWORDS` in `rubric.py`
- Test it with a hand-crafted world

### Exercise 5: Add a New Variable Pool (10 min)

Add a new abstract variable pool to `ABSTRACT_VAR_POOLS` in `causal_world.py`:
- Use creative abstract names (e.g., colour names: "Red", "Blue", "Green", "Amber", "Violet")
- Make sure they carry no scientific meaning

### Exercise 6: Write a Smarter Baseline Agent (30 min)

Modify `baseline_inference.py` to implement a better strategy:
1. First, run passive observations on all variables
2. Then run interventions between each pair to find which are connected
3. Use wide correlation sweeps (1 to 100) to check for curvature, saturation, or breakpoints
4. Test at x=0.5 and x=50 to distinguish linear from exponential/logarithmic
5. If the data suggests two parents, try holding one constant while varying the other
6. Submit with well-calibrated confidence

---

## Part 15: Golden Rules for Building Environments

These are the principles that separate good environments from great ones.

### Rule 1: The Agent Should Never See the Answer

The hidden world, ground truth rules, and correct parameters must NEVER appear in observations or state before the agent submits. This is the most common mistake beginners make.

**Bad:**
```python
def reset(self):
    return Observation(hint=f"The slope is {self.world.rules[0].params['a']}")
```

**Good:**
```python
def reset(self):
    return Observation(system_message="Run experiments to discover the hidden rules.")
```

### Rule 2: Reward Shaping > Sparse Rewards

A reward function that only gives +1 at the end teaches nothing. The agent needs signal throughout the episode.

**Bad:**
```python
def step(self, action):
    if action.type == "submit":
        return Observation(reward=1.0 if correct else 0.0, done=True)
    return Observation(reward=0.0)  # No signal during experiments!
```

**Good:**
```python
def step(self, action):
    if action.type == "experiment":
        info_gain = self.tracker.record(action)
        return Observation(reward=info_gain)  # Signal at every step!
    elif action.type == "submit":
        return Observation(reward=self.rubric.score(action))
```

### Rule 3: Deterministic Seeds for Reproducibility

Every random element must be controlled by a seed. If two runs with the same seed produce different results, your graders are broken.

```python
def generate_world(seed=42):
    py_rng = random.Random(seed)      # Controls structure
    np_rng = np.random.default_rng(seed)  # Controls noise
```

### Rule 4: Observations Should Be LLM-Friendly

If your agent is an LLM, the observation needs a human-readable text field. Don't just return a dict of numbers.

**Bad:**
```python
return Observation(result={"x": 5.0, "y": 13.04, "sigma": 0.05})
```

**Good:**
```python
return Observation(
    system_message="[Step 1] Set Alpha=5.0, observed Beta=13.04 (sigma=0.05)",
    result_value=13.04,
    noise_sigma=0.05,
)
```

### Rule 5: Validate All Agent Input

Never trust the agent. It will send garbage, typos, and adversarial inputs.

```python
if cause not in world.variables:
    return self._error_obs(f"Unknown variable '{cause}'. Available: {world.variables}")
```

### Rule 6: Clean Episode Boundaries

`reset()` must produce a completely clean state. No leftover data from previous episodes.

```python
def reset(self):
    self._world = generate_world(...)  # Fresh world
    self._tracker = InfoGainTracker()  # Fresh tracker
    self._history = []                 # Fresh history
    self._done = False                 # Episode is active
```

### Rule 7: Budget/Step Limits Prevent Infinite Episodes

Always have a mechanism to end the episode. Either a budget that runs out, or a maximum step count.

### Rule 8: The Hard Task Must Be Actually Hard

If your hard task is easy for GPT-4, the judges will notice. Design it so that even frontier models score 0.2-0.4 on the hard task. Our hard task uses 4 variables, sigma=0.50 noise, hidden confounders, interaction rules, and only 8 experiment budget.

### Rule 8.5: Don't Let LLMs Cheat with Prior Knowledge

If your environment uses real-world variable names (Temperature, Pressure, Price, Demand), LLM agents will use pretrained knowledge instead of reasoning from data. Use abstract names (Alpha, Beta, V1, V2) to force genuine discovery. Similarly, don't use only 3 rule types -- the agent will memorize the template set. Use enough variety that template-matching fails.

### Rule 9: Graders Must Be Deterministic

Given the same `episode_result` dict, a grader must always return the same score. No randomness, no external API calls, no time-dependent logic.

### Rule 10: State Metadata Only

The `state` property returns metadata, not secrets. It's for debugging, logging, and agent introspection -- never for leaking the answer.

---

## Part 16: How to Build Your Own From Scratch

Here's the step-by-step recipe for creating a new OpenEnv environment.

### Step 1: Choose Your Domain

Pick a real-world task humans actually do:
- Email triage
- Code review
- Data cleaning
- Scheduling
- Customer support
- Medical diagnosis
- Financial analysis

### Step 2: Define the Action Space

What can the agent do? Write it out in plain English first:

```
The agent can:
1. Read an email subject and preview
2. Assign a priority (high/medium/low)
3. Assign a label (bug/feature/question/spam)
4. Flag for human review
```

Then convert to a Pydantic model:

```python
class EmailAction(Action):
    action_type: str  # "classify" or "flag"
    priority: Optional[str] = None
    label: Optional[str] = None
    flag_reason: Optional[str] = None
```

### Step 3: Define the Observation Space

What does the agent see after each action?

```python
class EmailObservation(Observation):
    system_message: str
    email_subject: str
    email_preview: str
    emails_remaining: int
    # ... (inherits done, reward from Observation)
```

### Step 4: Build the Hidden World

What's the ground truth the agent is trying to discover/solve? This is your "puzzle generator."

### Step 5: Build the Reward Function

Design rewards that teach the right behavior:
- Correct classification: +1.0
- Partially correct: +0.5
- Wrong but not harmful: -0.1
- Flagging spam as high priority: -0.5

### Step 6: Write the Environment Class

```python
class EmailTriageEnvironment(Environment):
    def reset(self, **kwargs):
        # Generate a batch of emails
        # Return the first email as an observation

    def step(self, action):
        # Grade the agent's classification
        # Move to next email or end episode

    @property
    def state(self):
        # Return progress metadata
```

### Step 7: Wire Up the Server

```python
app = create_app(
    EmailTriageEnvironment,
    EmailAction,
    EmailObservation,
    env_name="email_triage",
)
```

### Step 8: Define 3 Tasks

```python
TASK_EASY = {"id": "easy", "reset_kwargs": {"n_emails": 5, "spam_ratio": 0.5}}
TASK_MEDIUM = {"id": "medium", "reset_kwargs": {"n_emails": 10, "spam_ratio": 0.2}}
TASK_HARD = {"id": "hard", "reset_kwargs": {"n_emails": 20, "spam_ratio": 0.05}}
```

### Step 9: Write the Baseline

Use the OpenAI API to run a simple agent and produce baseline scores.

### Step 10: Write Tests

Minimum tests:
- reset() produces valid observation
- step() with valid action works
- step() with invalid action returns error
- Episode ends when expected
- State doesn't leak secrets
- Graders return [0, 1]
- Seeds produce deterministic results

### Step 11: Write the Dockerfile

Copy our Dockerfile template. Change the CMD to point to your server module.

### Step 12: Write openenv.yaml

```yaml
spec_version: 1
name: your_env_name
type: space
runtime: fastapi
app: server.app:app
port: 8000
```

### Step 13: Write the README

Include HF Spaces frontmatter, environment description, action/observation docs, task descriptions, and baseline scores.

---

## Congratulations

You've read through the entire Scientific Hypothesis Lab codebase and understand:

- **What RL environments are** and how agents interact with them
- **The OpenEnv contract**: reset/step/state, Action/Observation/State, openenv.yaml
- **How hidden worlds work**: causal graphs with 8+ rule types, interaction rules, confounders, abstract variable names
- **Why abstract variable names matter**: prevents LLMs from using pretrained knowledge as a shortcut
- **How reward functions are designed**: info gain, accuracy (across all rule types + interactions), calibration, efficiency, contradiction
- **How the server works**: create_app() wraps everything in HTTP endpoints
- **How clients connect**: typed methods over WebSocket
- **How tasks and graders work**: difficulty progression, deterministic scoring [0, 1]
- **How baseline agents work**: LLM + system prompt + action parsing
- **How to test**: 39 tests covering every component including all rule types
- **How to deploy**: Docker + HF Spaces
- **The golden rules** for building great environments (including anti-cheating via abstract naming)
- **How to build your own** from scratch in 13 steps

You are now qualified to build, debug, explain, and teach RL environments. Go build something amazing.