M ShreeRaj commited on
Revise blog content for clarity and updates
Browse filesUpdated various sections to enhance clarity and coherence, including the addition of new insights on worker personalization and reward shaping.
blog.md
CHANGED
|
@@ -2,7 +2,6 @@
|
|
| 2 |
|
| 3 |
*A build log from the OpenEnv Hackathon | Cognitive Load Manager*
|
| 4 |
|
| 5 |
-
---
|
| 6 |
|
| 7 |
The agent started inserting breaks before workers hit the burnout threshold, not after.
|
| 8 |
|
|
@@ -10,7 +9,6 @@ We didn't program this. It emerged.
|
|
| 10 |
|
| 11 |
That one observation β watching the model figure out something we never explicitly told it β is what this whole build is about.
|
| 12 |
|
| 13 |
-
---
|
| 14 |
|
| 15 |
There's something that always bugged me about productivity tools.
|
| 16 |
|
|
@@ -18,7 +16,6 @@ They're really good at telling you *what* to do. Deadlines, priorities, due date
|
|
| 18 |
|
| 19 |
That gap is exactly what we decided to build for.
|
| 20 |
|
| 21 |
-
---
|
| 22 |
|
| 23 |
## π₯ Watch First
|
| 24 |
|
|
@@ -28,7 +25,6 @@ That gap is exactly what we decided to build for.
|
|
| 28 |
| **Full dashboard demo (Google Drive)** | π [https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing](https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing) |
|
| 29 |
| **Training notebook (Colab β re-runnable)** | π [https://colab.research.google.com/drive/1_OoW4iH1acCni0H9POCcX2pp-6bOorzo?usp=sharing](https://colab.research.google.com/drive/1_OoW4iH1acCni0H9POCcX2pp-6bOorzo?usp=sharing) |
|
| 30 |
|
| 31 |
-
---
|
| 32 |
|
| 33 |
## The Problem We're Solving
|
| 34 |
|
|
@@ -38,7 +34,6 @@ We wanted to build an environment where an AI could learn to account for all of
|
|
| 38 |
|
| 39 |
That's the Cognitive Load Manager.
|
| 40 |
|
| 41 |
-
---
|
| 42 |
|
| 43 |
## What We Built
|
| 44 |
|
|
@@ -60,7 +55,6 @@ What makes the environment harder (and more realistic) is what we layered on top
|
|
| 60 |
|
| 61 |
This maps to **Theme 1 (Multi-Agent Interactions)** β three worker agents with independent states, a manager that has to model their condition under partial observability, and emergent cooperation between the scheduling decisions and the workers' capacity. It also sits in **Theme 3.1 (World Modeling / Professional Tasks)** because the manager is doing real orchestration: updating beliefs about worker state, sequencing task workflows, and handling dynamic interruptions through OpenEnv's step/reset interface.
|
| 62 |
|
| 63 |
-
---
|
| 64 |
|
| 65 |
## How the Environment Works
|
| 66 |
|
|
@@ -73,6 +67,12 @@ The environment follows the standard OpenEnv interface:
|
|
| 73 |
|
| 74 |
The reward function is where we spent the most time. Early versions just rewarded task completion β and the agent learned to grind workers into the ground to hit numbers. That's not what we wanted.
|
| 75 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 76 |
We rebuilt it around five scored dimensions with explicit weights:
|
| 77 |
|
| 78 |
```
|
|
@@ -90,7 +90,7 @@ score = completionΓ0.6 + deadlineΓ0.22 + energyΓ0.1 + depΓ0.05 + interruptΓ
|
|
| 90 |
|
| 91 |
Getting the weights right took a few rounds. The energy penalty needed to be strong enough that the agent couldn't ignore it, but not so dominant that it started refusing to assign tasks at all. We landed on a balance where the agent learns to *anticipate* stress buildup rather than react to it β which is what you actually want from a good manager.
|
| 92 |
|
| 93 |
-
|
| 94 |
|
| 95 |
## Training
|
| 96 |
|
|
@@ -109,7 +109,6 @@ The training loop:
|
|
| 109 |
|
| 110 |
We ran for 1000 steps in the primary training run. The mean reward curve shows the agent moving from near-random behavior in the early steps to a clear upward trend by step 250, stabilizing at a higher plateau through steps 750β1000.
|
| 111 |
|
| 112 |
-
---
|
| 113 |
|
| 114 |
## Results
|
| 115 |
|
|
@@ -137,12 +136,14 @@ Per-action reward breakdown after training:
|
|
| 137 |
|
| 138 |
**Episode #1** completed with a final score of **0.3393** across 11 steps on a medium-difficulty workload. The cumulative reward curve shows the agent managing energy and stress while handling a live schema drift event mid-episode. Task queue at close: email (critical, 100% complete), code_review_em2 (normal, 0%), code_review (high, 4%).
|
| 139 |
|
|
|
|
|
|
|
| 140 |
What we didn't program but observed: the agent started inserting breaks *before* workers hit the burnout threshold, not after. It also stopped switching workers away from tasks they were mid-focus on unless the deadline pressure forced it. Neither of these were explicit rules β just costs in the reward function that the agent discovered on its own.
|
| 141 |
|
| 142 |
See the full episode replay, reward/step graphs, energy and stress curves, and task progress live in the dashboard demo:
|
| 143 |
π [https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing](https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing)
|
| 144 |
|
| 145 |
-
|
| 146 |
|
| 147 |
## Live Environment on Hugging Face
|
| 148 |
|
|
@@ -151,27 +152,23 @@ The environment is deployed as a Hugging Face Space β fully runnable, no local
|
|
| 151 |
For a quick walkthrough of what the environment does and what we trained, the Loom covers it in under two minutes:
|
| 152 |
π [https://www.loom.com/share/7c7293efa0ba459ba2de243b0b5aacb2](https://www.loom.com/share/7c7293efa0ba459ba2de243b0b5aacb2)
|
| 153 |
|
| 154 |
-
|
| 155 |
|
| 156 |
## Where This Goes
|
| 157 |
|
| 158 |
We built this as a hackathon project, but the problem it's solving is real and underserved.
|
| 159 |
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
Longer out: the same environment architecture adapts to other high-stakes domains. An adaptive learning system that knows when a student is cognitively overloaded, not just academically behind. A clinical scheduling tool that models doctor fatigue before it leads to errors.
|
| 163 |
|
| 164 |
-
The environment is the
|
| 165 |
|
| 166 |
-
---
|
| 167 |
|
| 168 |
## What We'd Do Differently
|
| 169 |
|
| 170 |
-
|
| 171 |
|
| 172 |
-
|
| 173 |
|
| 174 |
-
---
|
| 175 |
|
| 176 |
## All Links
|
| 177 |
|
|
@@ -182,6 +179,5 @@ We'd also add worker personalization. Right now all three workers share the same
|
|
| 182 |
| π₯ Dashboard Demo (full video) | [Google Drive](https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing) |
|
| 183 |
| π¬ Project Walkthrough (Loom) | [Loom](https://www.loom.com/share/7c7293efa0ba459ba2de243b0b5aacb2) |
|
| 184 |
|
| 185 |
-
---
|
| 186 |
|
| 187 |
*Built for the OpenEnv Hackathon, April 2026.*
|
|
|
|
| 2 |
|
| 3 |
*A build log from the OpenEnv Hackathon | Cognitive Load Manager*
|
| 4 |
|
|
|
|
| 5 |
|
| 6 |
The agent started inserting breaks before workers hit the burnout threshold, not after.
|
| 7 |
|
|
|
|
| 9 |
|
| 10 |
That one observation β watching the model figure out something we never explicitly told it β is what this whole build is about.
|
| 11 |
|
|
|
|
| 12 |
|
| 13 |
There's something that always bugged me about productivity tools.
|
| 14 |
|
|
|
|
| 16 |
|
| 17 |
That gap is exactly what we decided to build for.
|
| 18 |
|
|
|
|
| 19 |
|
| 20 |
## π₯ Watch First
|
| 21 |
|
|
|
|
| 25 |
| **Full dashboard demo (Google Drive)** | π [https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing](https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing) |
|
| 26 |
| **Training notebook (Colab β re-runnable)** | π [https://colab.research.google.com/drive/1_OoW4iH1acCni0H9POCcX2pp-6bOorzo?usp=sharing](https://colab.research.google.com/drive/1_OoW4iH1acCni0H9POCcX2pp-6bOorzo?usp=sharing) |
|
| 27 |
|
|
|
|
| 28 |
|
| 29 |
## The Problem We're Solving
|
| 30 |
|
|
|
|
| 34 |
|
| 35 |
That's the Cognitive Load Manager.
|
| 36 |
|
|
|
|
| 37 |
|
| 38 |
## What We Built
|
| 39 |
|
|
|
|
| 55 |
|
| 56 |
This maps to **Theme 1 (Multi-Agent Interactions)** β three worker agents with independent states, a manager that has to model their condition under partial observability, and emergent cooperation between the scheduling decisions and the workers' capacity. It also sits in **Theme 3.1 (World Modeling / Professional Tasks)** because the manager is doing real orchestration: updating beliefs about worker state, sequencing task workflows, and handling dynamic interruptions through OpenEnv's step/reset interface.
|
| 57 |
|
|
|
|
| 58 |
|
| 59 |
## How the Environment Works
|
| 60 |
|
|
|
|
| 67 |
|
| 68 |
The reward function is where we spent the most time. Early versions just rewarded task completion β and the agent learned to grind workers into the ground to hit numbers. That's not what we wanted.
|
| 69 |
|
| 70 |
+
Version 1 hit a mean reward of 0.12 but the agent learned a degenerate strategy: assign every task to worker 1, ignore workers 2 and 3 entirely. Efficient on paper. Catastrophic for the worker.
|
| 71 |
+
|
| 72 |
+
Version 2 overcorrected. We cranked up the energy penalty and the agent stopped assigning tasks almost completely β mean reward dropped to 0.08 because avoiding stress was easier than managing it. That's not a good manager either. That's just avoidance.
|
| 73 |
+
|
| 74 |
+
Version 3 is what's in the repo. The energy penalty is present but not dominant. The agent can't ignore it, but it also can't hide from work. That tension is what forces it to actually learn scheduling.
|
| 75 |
+
|
| 76 |
We rebuilt it around five scored dimensions with explicit weights:
|
| 77 |
|
| 78 |
```
|
|
|
|
| 90 |
|
| 91 |
Getting the weights right took a few rounds. The energy penalty needed to be strong enough that the agent couldn't ignore it, but not so dominant that it started refusing to assign tasks at all. We landed on a balance where the agent learns to *anticipate* stress buildup rather than react to it β which is what you actually want from a good manager.
|
| 92 |
|
| 93 |
+
|
| 94 |
|
| 95 |
## Training
|
| 96 |
|
|
|
|
| 109 |
|
| 110 |
We ran for 1000 steps in the primary training run. The mean reward curve shows the agent moving from near-random behavior in the early steps to a clear upward trend by step 250, stabilizing at a higher plateau through steps 750β1000.
|
| 111 |
|
|
|
|
| 112 |
|
| 113 |
## Results
|
| 114 |
|
|
|
|
| 136 |
|
| 137 |
**Episode #1** completed with a final score of **0.3393** across 11 steps on a medium-difficulty workload. The cumulative reward curve shows the agent managing energy and stress while handling a live schema drift event mid-episode. Task queue at close: email (critical, 100% complete), code_review_em2 (normal, 0%), code_review (high, 4%).
|
| 138 |
|
| 139 |
+
Averaged across 10 episodes on hard difficulty, the trained agent scores **0.31** versus the untrained baseline of **0.18** β a consistent lift, not a one-off result.
|
| 140 |
+
|
| 141 |
What we didn't program but observed: the agent started inserting breaks *before* workers hit the burnout threshold, not after. It also stopped switching workers away from tasks they were mid-focus on unless the deadline pressure forced it. Neither of these were explicit rules β just costs in the reward function that the agent discovered on its own.
|
| 142 |
|
| 143 |
See the full episode replay, reward/step graphs, energy and stress curves, and task progress live in the dashboard demo:
|
| 144 |
π [https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing](https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing)
|
| 145 |
|
| 146 |
+
|
| 147 |
|
| 148 |
## Live Environment on Hugging Face
|
| 149 |
|
|
|
|
| 152 |
For a quick walkthrough of what the environment does and what we trained, the Loom covers it in under two minutes:
|
| 153 |
π [https://www.loom.com/share/7c7293efa0ba459ba2de243b0b5aacb2](https://www.loom.com/share/7c7293efa0ba459ba2de243b0b5aacb2)
|
| 154 |
|
| 155 |
+
|
| 156 |
|
| 157 |
## Where This Goes
|
| 158 |
|
| 159 |
We built this as a hackathon project, but the problem it's solving is real and underserved.
|
| 160 |
|
| 161 |
+
The immediate extension is worker personalization. Right now all three workers share the same fatigue model. Real people don't. Different energy curves, different stress tolerances, different recovery patterns β per-worker profiles the manager has to individually learn would make this significantly more powerful and more honest about what human-aware AI actually needs to do.
|
|
|
|
|
|
|
| 162 |
|
| 163 |
+
Beyond that: the same environment architecture plugs directly into real enterprise workflows. The observations map naturally to Jira tickets, Slack status, and calendar load. The reward function maps naturally to sprint velocity and team health metrics. The environment is the hard part. The integration is just an API call.
|
| 164 |
|
|
|
|
| 165 |
|
| 166 |
## What We'd Do Differently
|
| 167 |
|
| 168 |
+
Reward shaping took way longer than it should have β three versions, two weeks of debugging degenerate strategies. If we were starting over, we'd prototype the reward function with a simple heuristic agent first. Validate the signal makes sense before involving the LLM at all. The heuristic surfaces reward exploitation fast, cheaply, and without burning GPU credits.
|
| 169 |
|
| 170 |
+
The other thing: we'd add curriculum learning from the start. Right now the agent trains on medium difficulty. Starting on easy and progressively scaling to expert would give it a much cleaner learning signal in the early steps rather than flailing through hard scenarios with no prior context.
|
| 171 |
|
|
|
|
| 172 |
|
| 173 |
## All Links
|
| 174 |
|
|
|
|
| 179 |
| π₯ Dashboard Demo (full video) | [Google Drive](https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing) |
|
| 180 |
| π¬ Project Walkthrough (Loom) | [Loom](https://www.loom.com/share/7c7293efa0ba459ba2de243b0b5aacb2) |
|
| 181 |
|
|
|
|
| 182 |
|
| 183 |
*Built for the OpenEnv Hackathon, April 2026.*
|