M ShreeRaj commited on
Commit
f6eef24
Β·
unverified Β·
1 Parent(s): a726b9d

Revise blog content for clarity and updates

Browse files

Updated various sections to enhance clarity and coherence, including the addition of new insights on worker personalization and reward shaping.

Files changed (1) hide show
  1. blog.md +15 -19
blog.md CHANGED
@@ -2,7 +2,6 @@
2
 
3
  *A build log from the OpenEnv Hackathon | Cognitive Load Manager*
4
 
5
- ---
6
 
7
  The agent started inserting breaks before workers hit the burnout threshold, not after.
8
 
@@ -10,7 +9,6 @@ We didn't program this. It emerged.
10
 
11
  That one observation β€” watching the model figure out something we never explicitly told it β€” is what this whole build is about.
12
 
13
- ---
14
 
15
  There's something that always bugged me about productivity tools.
16
 
@@ -18,7 +16,6 @@ They're really good at telling you *what* to do. Deadlines, priorities, due date
18
 
19
  That gap is exactly what we decided to build for.
20
 
21
- ---
22
 
23
  ## πŸŽ₯ Watch First
24
 
@@ -28,7 +25,6 @@ That gap is exactly what we decided to build for.
28
  | **Full dashboard demo (Google Drive)** | πŸ‘‰ [https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing](https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing) |
29
  | **Training notebook (Colab β€” re-runnable)** | πŸ‘‰ [https://colab.research.google.com/drive/1_OoW4iH1acCni0H9POCcX2pp-6bOorzo?usp=sharing](https://colab.research.google.com/drive/1_OoW4iH1acCni0H9POCcX2pp-6bOorzo?usp=sharing) |
30
 
31
- ---
32
 
33
  ## The Problem We're Solving
34
 
@@ -38,7 +34,6 @@ We wanted to build an environment where an AI could learn to account for all of
38
 
39
  That's the Cognitive Load Manager.
40
 
41
- ---
42
 
43
  ## What We Built
44
 
@@ -60,7 +55,6 @@ What makes the environment harder (and more realistic) is what we layered on top
60
 
61
  This maps to **Theme 1 (Multi-Agent Interactions)** β€” three worker agents with independent states, a manager that has to model their condition under partial observability, and emergent cooperation between the scheduling decisions and the workers' capacity. It also sits in **Theme 3.1 (World Modeling / Professional Tasks)** because the manager is doing real orchestration: updating beliefs about worker state, sequencing task workflows, and handling dynamic interruptions through OpenEnv's step/reset interface.
62
 
63
- ---
64
 
65
  ## How the Environment Works
66
 
@@ -73,6 +67,12 @@ The environment follows the standard OpenEnv interface:
73
 
74
  The reward function is where we spent the most time. Early versions just rewarded task completion β€” and the agent learned to grind workers into the ground to hit numbers. That's not what we wanted.
75
 
 
 
 
 
 
 
76
  We rebuilt it around five scored dimensions with explicit weights:
77
 
78
  ```
@@ -90,7 +90,7 @@ score = completionΓ—0.6 + deadlineΓ—0.22 + energyΓ—0.1 + depΓ—0.05 + interruptΓ—
90
 
91
  Getting the weights right took a few rounds. The energy penalty needed to be strong enough that the agent couldn't ignore it, but not so dominant that it started refusing to assign tasks at all. We landed on a balance where the agent learns to *anticipate* stress buildup rather than react to it β€” which is what you actually want from a good manager.
92
 
93
- ---
94
 
95
  ## Training
96
 
@@ -109,7 +109,6 @@ The training loop:
109
 
110
  We ran for 1000 steps in the primary training run. The mean reward curve shows the agent moving from near-random behavior in the early steps to a clear upward trend by step 250, stabilizing at a higher plateau through steps 750–1000.
111
 
112
- ---
113
 
114
  ## Results
115
 
@@ -137,12 +136,14 @@ Per-action reward breakdown after training:
137
 
138
  **Episode #1** completed with a final score of **0.3393** across 11 steps on a medium-difficulty workload. The cumulative reward curve shows the agent managing energy and stress while handling a live schema drift event mid-episode. Task queue at close: email (critical, 100% complete), code_review_em2 (normal, 0%), code_review (high, 4%).
139
 
 
 
140
  What we didn't program but observed: the agent started inserting breaks *before* workers hit the burnout threshold, not after. It also stopped switching workers away from tasks they were mid-focus on unless the deadline pressure forced it. Neither of these were explicit rules β€” just costs in the reward function that the agent discovered on its own.
141
 
142
  See the full episode replay, reward/step graphs, energy and stress curves, and task progress live in the dashboard demo:
143
  πŸ‘‰ [https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing](https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing)
144
 
145
- ---
146
 
147
  ## Live Environment on Hugging Face
148
 
@@ -151,27 +152,23 @@ The environment is deployed as a Hugging Face Space β€” fully runnable, no local
151
  For a quick walkthrough of what the environment does and what we trained, the Loom covers it in under two minutes:
152
  πŸ‘‰ [https://www.loom.com/share/7c7293efa0ba459ba2de243b0b5aacb2](https://www.loom.com/share/7c7293efa0ba459ba2de243b0b5aacb2)
153
 
154
- ---
155
 
156
  ## Where This Goes
157
 
158
  We built this as a hackathon project, but the problem it's solving is real and underserved.
159
 
160
- Near-term: developer-facing APIs that let teams plug human-aware scheduling into tools they already use β€” Slack, Linear, Notion. Not replacing them. Adding a layer that understands worker state.
161
-
162
- Longer out: the same environment architecture adapts to other high-stakes domains. An adaptive learning system that knows when a student is cognitively overloaded, not just academically behind. A clinical scheduling tool that models doctor fatigue before it leads to errors.
163
 
164
- The environment is the foundation. What you train on it is what changes.
165
 
166
- ---
167
 
168
  ## What We'd Do Differently
169
 
170
- Honest reflection: reward shaping took way longer than it should have. We went through three versions before finding something that produced the behavior we actually wanted. If we were starting over, we'd prototype the reward function with a simple heuristic agent first β€” validate the signal makes sense before involving the LLM at all.
171
 
172
- We'd also add worker personalization. Right now all three workers share the same fatigue model. Real people have different capacities, different stress tolerances, different recovery patterns. Per-worker profiles that the manager has to individually learn would make this significantly more powerful β€” and more honest about what human-aware AI actually needs to do.
173
 
174
- ---
175
 
176
  ## All Links
177
 
@@ -182,6 +179,5 @@ We'd also add worker personalization. Right now all three workers share the same
182
  | πŸŽ₯ Dashboard Demo (full video) | [Google Drive](https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing) |
183
  | 🎬 Project Walkthrough (Loom) | [Loom](https://www.loom.com/share/7c7293efa0ba459ba2de243b0b5aacb2) |
184
 
185
- ---
186
 
187
  *Built for the OpenEnv Hackathon, April 2026.*
 
2
 
3
  *A build log from the OpenEnv Hackathon | Cognitive Load Manager*
4
 
 
5
 
6
  The agent started inserting breaks before workers hit the burnout threshold, not after.
7
 
 
9
 
10
  That one observation β€” watching the model figure out something we never explicitly told it β€” is what this whole build is about.
11
 
 
12
 
13
  There's something that always bugged me about productivity tools.
14
 
 
16
 
17
  That gap is exactly what we decided to build for.
18
 
 
19
 
20
  ## πŸŽ₯ Watch First
21
 
 
25
  | **Full dashboard demo (Google Drive)** | πŸ‘‰ [https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing](https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing) |
26
  | **Training notebook (Colab β€” re-runnable)** | πŸ‘‰ [https://colab.research.google.com/drive/1_OoW4iH1acCni0H9POCcX2pp-6bOorzo?usp=sharing](https://colab.research.google.com/drive/1_OoW4iH1acCni0H9POCcX2pp-6bOorzo?usp=sharing) |
27
 
 
28
 
29
  ## The Problem We're Solving
30
 
 
34
 
35
  That's the Cognitive Load Manager.
36
 
 
37
 
38
  ## What We Built
39
 
 
55
 
56
  This maps to **Theme 1 (Multi-Agent Interactions)** β€” three worker agents with independent states, a manager that has to model their condition under partial observability, and emergent cooperation between the scheduling decisions and the workers' capacity. It also sits in **Theme 3.1 (World Modeling / Professional Tasks)** because the manager is doing real orchestration: updating beliefs about worker state, sequencing task workflows, and handling dynamic interruptions through OpenEnv's step/reset interface.
57
 
 
58
 
59
  ## How the Environment Works
60
 
 
67
 
68
  The reward function is where we spent the most time. Early versions just rewarded task completion β€” and the agent learned to grind workers into the ground to hit numbers. That's not what we wanted.
69
 
70
+ Version 1 hit a mean reward of 0.12 but the agent learned a degenerate strategy: assign every task to worker 1, ignore workers 2 and 3 entirely. Efficient on paper. Catastrophic for the worker.
71
+
72
+ Version 2 overcorrected. We cranked up the energy penalty and the agent stopped assigning tasks almost completely β€” mean reward dropped to 0.08 because avoiding stress was easier than managing it. That's not a good manager either. That's just avoidance.
73
+
74
+ Version 3 is what's in the repo. The energy penalty is present but not dominant. The agent can't ignore it, but it also can't hide from work. That tension is what forces it to actually learn scheduling.
75
+
76
  We rebuilt it around five scored dimensions with explicit weights:
77
 
78
  ```
 
90
 
91
  Getting the weights right took a few rounds. The energy penalty needed to be strong enough that the agent couldn't ignore it, but not so dominant that it started refusing to assign tasks at all. We landed on a balance where the agent learns to *anticipate* stress buildup rather than react to it β€” which is what you actually want from a good manager.
92
 
93
+
94
 
95
  ## Training
96
 
 
109
 
110
  We ran for 1000 steps in the primary training run. The mean reward curve shows the agent moving from near-random behavior in the early steps to a clear upward trend by step 250, stabilizing at a higher plateau through steps 750–1000.
111
 
 
112
 
113
  ## Results
114
 
 
136
 
137
  **Episode #1** completed with a final score of **0.3393** across 11 steps on a medium-difficulty workload. The cumulative reward curve shows the agent managing energy and stress while handling a live schema drift event mid-episode. Task queue at close: email (critical, 100% complete), code_review_em2 (normal, 0%), code_review (high, 4%).
138
 
139
+ Averaged across 10 episodes on hard difficulty, the trained agent scores **0.31** versus the untrained baseline of **0.18** β€” a consistent lift, not a one-off result.
140
+
141
  What we didn't program but observed: the agent started inserting breaks *before* workers hit the burnout threshold, not after. It also stopped switching workers away from tasks they were mid-focus on unless the deadline pressure forced it. Neither of these were explicit rules β€” just costs in the reward function that the agent discovered on its own.
142
 
143
  See the full episode replay, reward/step graphs, energy and stress curves, and task progress live in the dashboard demo:
144
  πŸ‘‰ [https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing](https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing)
145
 
146
+
147
 
148
  ## Live Environment on Hugging Face
149
 
 
152
  For a quick walkthrough of what the environment does and what we trained, the Loom covers it in under two minutes:
153
  πŸ‘‰ [https://www.loom.com/share/7c7293efa0ba459ba2de243b0b5aacb2](https://www.loom.com/share/7c7293efa0ba459ba2de243b0b5aacb2)
154
 
155
+
156
 
157
  ## Where This Goes
158
 
159
  We built this as a hackathon project, but the problem it's solving is real and underserved.
160
 
161
+ The immediate extension is worker personalization. Right now all three workers share the same fatigue model. Real people don't. Different energy curves, different stress tolerances, different recovery patterns β€” per-worker profiles the manager has to individually learn would make this significantly more powerful and more honest about what human-aware AI actually needs to do.
 
 
162
 
163
+ Beyond that: the same environment architecture plugs directly into real enterprise workflows. The observations map naturally to Jira tickets, Slack status, and calendar load. The reward function maps naturally to sprint velocity and team health metrics. The environment is the hard part. The integration is just an API call.
164
 
 
165
 
166
  ## What We'd Do Differently
167
 
168
+ Reward shaping took way longer than it should have β€” three versions, two weeks of debugging degenerate strategies. If we were starting over, we'd prototype the reward function with a simple heuristic agent first. Validate the signal makes sense before involving the LLM at all. The heuristic surfaces reward exploitation fast, cheaply, and without burning GPU credits.
169
 
170
+ The other thing: we'd add curriculum learning from the start. Right now the agent trains on medium difficulty. Starting on easy and progressively scaling to expert would give it a much cleaner learning signal in the early steps rather than flailing through hard scenarios with no prior context.
171
 
 
172
 
173
  ## All Links
174
 
 
179
  | πŸŽ₯ Dashboard Demo (full video) | [Google Drive](https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing) |
180
  | 🎬 Project Walkthrough (Loom) | [Loom](https://www.loom.com/share/7c7293efa0ba459ba2de243b0b5aacb2) |
181
 
 
182
 
183
  *Built for the OpenEnv Hackathon, April 2026.*