soumi guria commited on
Commit Β·
6ec90aa
1
Parent(s): 7186ef8
Updated blog.md
Browse files
blog.md
ADDED
|
@@ -0,0 +1,176 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# We Trained an AI to Think Like a Good Manager β Not Just a Task Scheduler
|
| 2 |
+
|
| 3 |
+
*A build log from the OpenEnv Hackathon | Cognitive Load Manager*
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
There's something that always bugged me about productivity tools.
|
| 8 |
+
|
| 9 |
+
They're really good at telling you *what* to do. Deadlines, priorities, due dates β all of it. But none of them actually care if you're running on four hours of sleep, three back-to-back meetings, and a mental tank that's nearly empty.
|
| 10 |
+
|
| 11 |
+
That gap is exactly what we decided to build for.
|
| 12 |
+
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
## π₯ Watch First
|
| 16 |
+
|
| 17 |
+
| | |
|
| 18 |
+
|---|---|
|
| 19 |
+
| **2-min project walkthrough (Loom)** | π [https://www.loom.com/share/7c7293efa0ba459ba2de243b0b5aacb2](https://www.loom.com/share/7c7293efa0ba459ba2de243b0b5aacb2) |
|
| 20 |
+
| **Full dashboard demo (Google Drive)** | π [https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing](https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing) |
|
| 21 |
+
| **Training notebook (Colab β re-runnable)** | π [https://colab.research.google.com/drive/1_OoW4iH1acCni0H9POCcX2pp-6bOorzo?usp=sharing](https://colab.research.google.com/drive/1_OoW4iH1acCni0H9POCcX2pp-6bOorzo?usp=sharing) |
|
| 22 |
+
|
| 23 |
+
---
|
| 24 |
+
|
| 25 |
+
## The Problem We're Solving
|
| 26 |
+
|
| 27 |
+
Most AI systems treat humans like stateless machines. You give them a task, they complete it, move on. But anyone who's actually worked in a high-pressure environment knows that's not how it works. Performance is nonlinear. Fatigue compounds. Stress from one task bleeds into the next. Context switching has a real cognitive cost β and that cost adds up fast.
|
| 28 |
+
|
| 29 |
+
We wanted to build an environment where an AI could learn to account for all of that. Not just "what's the most efficient order of tasks" but "what's the most *sustainable* order, given the human doing the work."
|
| 30 |
+
|
| 31 |
+
That's the Cognitive Load Manager.
|
| 32 |
+
|
| 33 |
+
---
|
| 34 |
+
|
| 35 |
+
## What We Built
|
| 36 |
+
|
| 37 |
+
The Cognitive Load Manager is a **multi-agent reinforcement learning environment** built on top of **OpenEnv** (latest release). It simulates a real knowledge-work day β with all the messiness that comes with it.
|
| 38 |
+
|
| 39 |
+
Here's the setup:
|
| 40 |
+
|
| 41 |
+
- **Three worker agents**, each carrying internal state: energy level, stress level, current task load, and fatigue accumulation
|
| 42 |
+
- **One manager agent** β the AI we're training β that observes the full workspace and makes decisions every step
|
| 43 |
+
- **A task pool** with deadlines, dependencies between tasks, and varying complexity
|
| 44 |
+
|
| 45 |
+
The manager's job is to decide: who gets assigned what, when to delay a task, and when to give someone a break. Every decision has downstream consequences. Overload a worker and their stress spikes, their quality drops. Under-assign and you miss deadlines. The agent has to learn to walk that line.
|
| 46 |
+
|
| 47 |
+
What makes the environment harder (and more realistic) is what we layered on top:
|
| 48 |
+
|
| 49 |
+
- **Context-switching penalties** β moving between unrelated tasks isn't free, and the environment models that cost
|
| 50 |
+
- **Fatigue accumulation** β workers get progressively less effective as the session goes on, not just linearly
|
| 51 |
+
- **Mid-episode rule changes** β deadlines shift, new tasks drop in, priorities change. In our dashboard you can see this live: a "Schema Drift" alert fires mid-episode ("URGENT: Production server down, all code reviews now critical") and the agent has to adapt its decisions in real time β it can't just replay a fixed plan
|
| 52 |
+
|
| 53 |
+
This maps to **Theme 1 (Multi-Agent Interactions)** β three worker agents with independent states, a manager that has to model their condition under partial observability, and emergent cooperation between the scheduling decisions and the workers' capacity. It also sits in **Theme 3.1 (World Modeling / Professional Tasks)** because the manager is doing real orchestration: updating beliefs about worker state, sequencing task workflows, and handling dynamic interruptions through OpenEnv's step/reset interface.
|
| 54 |
+
|
| 55 |
+
---
|
| 56 |
+
|
| 57 |
+
## How the Environment Works
|
| 58 |
+
|
| 59 |
+
The environment follows the standard OpenEnv interface:
|
| 60 |
+
|
| 61 |
+
- `env.reset()` initializes a fresh workday β randomized task loads, worker states, deadline distributions
|
| 62 |
+
- `env.step(action)` takes the manager's decision and returns the next observation, reward, and done flag
|
| 63 |
+
- **Observations** include: per-worker energy and stress readings, task queue state, time remaining, dependency graph
|
| 64 |
+
- **Actions** include: assign task to a worker, focus a worker on current task, delay a task, or give a worker a break
|
| 65 |
+
|
| 66 |
+
The reward function is where we spent the most time. Early versions just rewarded task completion β and the agent learned to grind workers into the ground to hit numbers. That's not what we wanted.
|
| 67 |
+
|
| 68 |
+
We rebuilt it around five scored dimensions with explicit weights:
|
| 69 |
+
|
| 70 |
+
```
|
| 71 |
+
score = completionΓ0.6 + deadlineΓ0.22 + energyΓ0.1 + depΓ0.05 + interruptΓ0.03
|
| 72 |
+
β (0.01, 0.99)
|
| 73 |
+
```
|
| 74 |
+
|
| 75 |
+
| Dimension | Weight | What it measures |
|
| 76 |
+
|---|---|---|
|
| 77 |
+
| Task Completion | Γ0.60 | Fraction of tasks fully completed, weighted by priority |
|
| 78 |
+
| Deadline Adherence | Γ0.22 | Bonus for finishing before deadline; penalty for missing it |
|
| 79 |
+
| Energy Efficiency | Γ0.10 | Penalizes high worker fatigue and stress spikes |
|
| 80 |
+
| Dependency Bonus | Γ0.05 | Reward for respecting task dependency order |
|
| 81 |
+
| Interruption Bonus | Γ0.03 | Reward for minimizing context-switching interruptions |
|
| 82 |
+
|
| 83 |
+
Getting the weights right took a few rounds. The energy penalty needed to be strong enough that the agent couldn't ignore it, but not so dominant that it started refusing to assign tasks at all. We landed on a balance where the agent learns to *anticipate* stress buildup rather than react to it β which is what you actually want from a good manager.
|
| 84 |
+
|
| 85 |
+
---
|
| 86 |
+
|
| 87 |
+
## Training
|
| 88 |
+
|
| 89 |
+
We trained using **Hugging Face TRL with GRPO-based reinforcement learning** on a **Qwen 1.5B** base model.
|
| 90 |
+
|
| 91 |
+
The full training notebook is here β one click, all dependencies handled, re-runnable end to end:
|
| 92 |
+
|
| 93 |
+
π [https://colab.research.google.com/drive/1_OoW4iH1acCni0H9POCcX2pp-6bOorzo?usp=sharing](https://colab.research.google.com/drive/1_OoW4iH1acCni0H9POCcX2pp-6bOorzo?usp=sharing)
|
| 94 |
+
|
| 95 |
+
The training loop:
|
| 96 |
+
|
| 97 |
+
1. The model (manager agent) receives an observation from the environment
|
| 98 |
+
2. It generates an action β structured as a decision over the available action space
|
| 99 |
+
3. The action executes in the environment, and a reward is returned
|
| 100 |
+
4. GRPO updates the model based on relative reward signal across a batch of rollouts
|
| 101 |
+
|
| 102 |
+
We ran for 1000 steps in the primary training run. The mean reward curve shows the agent moving from near-random behavior in the early steps to a clear upward trend by step 250, stabilizing at a higher plateau through steps 750β1000.
|
| 103 |
+
|
| 104 |
+
---
|
| 105 |
+
|
| 106 |
+
## Results
|
| 107 |
+
|
| 108 |
+
The numbers came out better than we expected.
|
| 109 |
+
|
| 110 |
+
**Before vs After GRPO** β measured during 1000-step fine-tuning on the CLM environment:
|
| 111 |
+
|
| 112 |
+
| | Before | After | Lift |
|
| 113 |
+
|---|---|---|---|
|
| 114 |
+
| Mean Reward | 0.101 | 0.265 | **+163%** |
|
| 115 |
+
|
| 116 |
+
Per-action reward breakdown after training:
|
| 117 |
+
|
| 118 |
+
| Action | Reward (After) | What changed |
|
| 119 |
+
|---|---|---|
|
| 120 |
+
| Focus | 0.249 | Highest β agent learned to protect deep work blocks |
|
| 121 |
+
| Work | Improved significantly | Better task-worker matching |
|
| 122 |
+
| Break | 0.040 | Positive β agent learned breaks aren't wasted time |
|
| 123 |
+
| Delay | 0.019 | Low but selective β used strategically, not as default |
|
| 124 |
+
|
| 125 |
+
**Episode #1** completed with a final score of **0.3393** across 11 steps on a medium-difficulty workload. The cumulative reward curve shows the agent managing energy and stress while handling a live schema drift event mid-episode. Task queue at close: email (critical, 100% complete), code_review_em2 (normal, 0%), code_review (high, 4%).
|
| 126 |
+
|
| 127 |
+
What we didn't program but observed: the agent started inserting breaks *before* workers hit the burnout threshold, not after. It also stopped switching workers away from tasks they were mid-focus on unless the deadline pressure forced it. Neither of these were explicit rules β just costs in the reward function that the agent discovered on its own.
|
| 128 |
+
|
| 129 |
+
See the full episode replay, reward/step graphs, energy and stress curves, and task progress live in the dashboard demo:
|
| 130 |
+
|
| 131 |
+
π [https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing](https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing)
|
| 132 |
+
|
| 133 |
+
---
|
| 134 |
+
|
| 135 |
+
## Live Environment on Hugging Face
|
| 136 |
+
|
| 137 |
+
The environment is deployed as a Hugging Face Space β fully runnable, no local setup required. Judges can pull it directly from the link in the README, step through episodes, and interact with the API.
|
| 138 |
+
|
| 139 |
+
For a quick walkthrough of what the environment does and what we trained, the Loom covers it in under two minutes:
|
| 140 |
+
|
| 141 |
+
π [https://www.loom.com/share/7c7293efa0ba459ba2de243b0b5aacb2](https://www.loom.com/share/7c7293efa0ba459ba2de243b0b5aacb2)
|
| 142 |
+
|
| 143 |
+
---
|
| 144 |
+
|
| 145 |
+
## Where This Goes
|
| 146 |
+
|
| 147 |
+
We built this as a hackathon project, but the problem it's solving is real and underserved.
|
| 148 |
+
|
| 149 |
+
Near-term: developer-facing APIs that let teams plug human-aware scheduling into tools they already use β Slack, Linear, Notion. Not replacing them. Adding a layer that understands worker state.
|
| 150 |
+
|
| 151 |
+
Longer out: the same environment architecture adapts to other high-stakes domains. An adaptive learning system that knows when a student is cognitively overloaded, not just academically behind. A clinical scheduling tool that models doctor fatigue before it leads to errors.
|
| 152 |
+
|
| 153 |
+
The environment is the foundation. What you train on it is what changes.
|
| 154 |
+
|
| 155 |
+
---
|
| 156 |
+
|
| 157 |
+
## What We'd Do Differently
|
| 158 |
+
|
| 159 |
+
Honest reflection: reward shaping took way longer than it should have. We went through three versions before finding something that produced the behavior we actually wanted. If we were starting over, we'd prototype the reward function with a simple heuristic agent first β validate the signal makes sense before involving the LLM at all.
|
| 160 |
+
|
| 161 |
+
We'd also add worker personalization. Right now all three workers share the same fatigue model. Real people have different capacities, different stress tolerances, different recovery patterns. Per-worker profiles that the manager has to individually learn would make this significantly more powerful β and more honest about what human-aware AI actually needs to do.
|
| 162 |
+
|
| 163 |
+
---
|
| 164 |
+
|
| 165 |
+
## All Links
|
| 166 |
+
|
| 167 |
+
| Resource | Link |
|
| 168 |
+
|---|---|
|
| 169 |
+
| π€ HF Space (live environment) | Linked in README |
|
| 170 |
+
| π Training Notebook (Colab) | [Open in Colab](https://colab.research.google.com/drive/1_OoW4iH1acCni0H9POCcX2pp-6bOorzo?usp=sharing) |
|
| 171 |
+
| π₯ Dashboard Demo (full video) | [Google Drive](https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing) |
|
| 172 |
+
| π¬ Project Walkthrough (Loom) | [Loom](https://www.loom.com/share/7c7293efa0ba459ba2de243b0b5aacb2) |
|
| 173 |
+
|
| 174 |
+
---
|
| 175 |
+
|
| 176 |
+
*Built for the OpenEnv Hackathon, April 2026.*
|