soumi guria commited on
Commit Β·
44963dd
1
Parent(s): 77f89a1
updated readme
Browse files
README.md
CHANGED
|
@@ -6,32 +6,73 @@ colorTo: red
|
|
| 6 |
sdk: docker
|
| 7 |
app_file: server/app.py
|
| 8 |
pinned: false
|
| 9 |
-
tags: [openenv, rl, scheduling, agent-eval, productivity]
|
| 10 |
---
|
| 11 |
|
| 12 |
-
# π§ Cognitive Load Manager (CLM)
|
| 13 |
|
| 14 |
-
**
|
| 15 |
|
| 16 |
[](#)
|
| 17 |
[](#)
|
| 18 |
[](#)
|
|
|
|
|
|
|
| 19 |
|
| 20 |
-
|
| 21 |
|
| 22 |
-
|
| 23 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
|
|
|
| 25 |
|
| 26 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
|
| 28 |
-
|
| 29 |
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
- **
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
|
|
|
|
| 34 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
## πΉοΈ Actions
|
| 37 |
|
|
@@ -86,6 +127,7 @@ Action format:
|
|
| 86 |
- `focus_mode` β whether the agent is currently in deep-work state
|
| 87 |
|
| 88 |
|
|
|
|
| 89 |
## π Tasks & Baseline Scores
|
| 90 |
|
| 91 |
| Level | Tasks | Deadlines | Dependencies | Interruptions | Baseline Score |
|
|
@@ -109,43 +151,86 @@ score = weighted_completion Γ 0.60
|
|
| 109 |
+ interruption_bonus Γ 0.03
|
| 110 |
```
|
| 111 |
|
| 112 |
-
|
| 113 |
-
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
|
|
|
|
|
|
| 117 |
|
| 118 |
Score is always in **(0.01, 0.99)** β never exactly 0 or 1.
|
| 119 |
|
|
|
|
| 120 |
|
| 121 |
-
## π Setup
|
| 122 |
|
| 123 |
-
##
|
| 124 |
-
```bash
|
| 125 |
-
docker build -t clm-env .
|
| 126 |
-
docker run -p 7860:7860 clm-env
|
| 127 |
-
```
|
| 128 |
|
| 129 |
-
|
| 130 |
-
```bash
|
| 131 |
-
pip install -r requirements.txt
|
| 132 |
-
uvicorn server.app:app --port 7860 --reload
|
| 133 |
-
```
|
| 134 |
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 142 |
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 148 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 149 |
|
| 150 |
## ποΈ Architecture
|
| 151 |
|
|
@@ -173,24 +258,35 @@ graph TD
|
|
| 173 |
API -->|OpenEnv spec| OE[openenv validate]
|
| 174 |
```
|
| 175 |
|
|
|
|
| 176 |
|
| 177 |
-
##
|
| 178 |
|
| 179 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 180 |
|
| 181 |
-
|
| 182 |
-
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
|
| 186 |
-
|
| 187 |
-
|
| 188 |
-
|
| 189 |
-
|
| 190 |
-
|
| 191 |
-
|
| 192 |
-
|
| 193 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 194 |
|
| 195 |
|
| 196 |
## βοΈ Environment Variables
|
|
@@ -200,3 +296,38 @@ Step rewards provide **dense signal** across the full trajectory:
|
|
| 200 |
| `API_BASE_URL` | LLM API endpoint (e.g. `https://router.huggingface.co/v1`) |
|
| 201 |
| `MODEL_NAME` | Model identifier (default: `Qwen/Qwen2.5-72B-Instruct`) |
|
| 202 |
| `HF_TOKEN` | Hugging Face API token |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
sdk: docker
|
| 7 |
app_file: server/app.py
|
| 8 |
pinned: false
|
| 9 |
+
tags: [openenv, rl, scheduling, agent-eval, productivity, multi-agent, grpo, reinforcement-learning]
|
| 10 |
---
|
| 11 |
|
| 12 |
+
# π§ Cognitive Load Manager (CLM)
|
| 13 |
|
| 14 |
+
**A Multi-Agent OpenEnv RL Environment β OpenEnv Hackathon, April 2026**
|
| 15 |
|
| 16 |
[](#)
|
| 17 |
[](#)
|
| 18 |
[](#)
|
| 19 |
+
[](#)
|
| 20 |
+
[](#)
|
| 21 |
|
| 22 |
+
---
|
| 23 |
|
| 24 |
+
## π₯ See It Running First
|
| 25 |
|
| 26 |
+
| | |
|
| 27 |
+
|---|---|
|
| 28 |
+
| **2-min project walkthrough (Loom)** | π [Watch on Loom](https://www.loom.com/share/7c7293efa0ba459ba2de243b0b5aacb2) |
|
| 29 |
+
| **Full dashboard demo (Google Drive)** | π [Watch Demo](https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing) |
|
| 30 |
+
| **Training notebook (Colab β re-runnable)** | π [Open in Colab](https://colab.research.google.com/drive/1_OoW4iH1acCni0H9POCcX2pp-6bOorzo?usp=sharing) |
|
| 31 |
|
| 32 |
+
---
|
| 33 |
|
| 34 |
+
## The Problem
|
| 35 |
+
|
| 36 |
+
Productivity tools are good at one thing: telling you *what* to do. Deadlines, priorities, urgency tags β all mapped out. What none of them do is care whether you're running on four hours of sleep, mid-recovery from three back-to-back meetings, or operating at 40% capacity because the last task drained you.
|
| 37 |
+
|
| 38 |
+
That gap is real. Performance isn't linear. Fatigue compounds across a workday. Stress from one task bleeds into the next. Context switching has a measurable cognitive cost that most schedulers treat as zero.
|
| 39 |
+
|
| 40 |
+
The Cognitive Load Manager is built around that gap. It's a simulation environment where an AI agent learns to schedule work the way a *good manager* would β not just efficiently, but sustainably, with actual awareness of the humans doing the work.
|
| 41 |
+
|
| 42 |
+
---
|
| 43 |
+
|
| 44 |
+
## What We Built
|
| 45 |
|
| 46 |
+
CLM is a **multi-agent reinforcement learning environment** built on the OpenEnv interface. It simulates a real knowledge-work day β tasks of different types, deadlines with real consequences, worker states that shift throughout the episode, and mid-session surprises that force the agent to adapt.
|
| 47 |
|
| 48 |
+
The setup:
|
| 49 |
+
|
| 50 |
+
- **Three worker agents**, each carrying independent internal state: energy level, stress level, current task load, and fatigue accumulation that builds non-linearly across the session
|
| 51 |
+
- **One manager agent** β the AI being trained β that observes the full workspace state and makes scheduling decisions every step
|
| 52 |
+
- **A task pool** with deadlines, dependency chains, and varying complexity levels (email, code review, reports, meetings, calls)
|
| 53 |
+
|
| 54 |
+
The manager has to decide who gets what, when to push, when to delay, and when a worker genuinely needs a break. Every call has downstream consequences. Burn a worker out and their output quality drops, stress spikes, and you lose throughput precisely when you need it. Under-assign and deadlines slip. The agent has to find β and maintain β the line between the two.
|
| 55 |
+
|
| 56 |
+
What makes the environment harder than a standard scheduling problem:
|
| 57 |
+
|
| 58 |
+
- **Context-switching penalties** β moving between unrelated tasks isn't treated as free. Every switch costs something, and the agent learns to protect focus blocks.
|
| 59 |
+
- **Non-linear fatigue accumulation** β workers don't degrade evenly. The drop accelerates as the session progresses.
|
| 60 |
+
- **Mid-episode rule changes** β deadlines shift, urgent tasks inject mid-episode, priorities flip. In the live dashboard you can watch a "Schema Drift" alert fire mid-run (*"URGENT: Production server down β all code reviews now critical"*) and see the agent recalibrate in real time. There's no fixed plan to replay; the agent has to actually adapt.
|
| 61 |
+
|
| 62 |
+
This maps to **Theme 1 (Multi-Agent Interactions)** β three workers with independent states, a manager operating under partial observability, and emergent coordination between scheduling decisions and worker capacity. It also sits squarely in **Theme 3.1 (World Modeling / Professional Tasks)**: the manager is doing genuine orchestration β updating beliefs about worker state, sequencing task workflows, and handling dynamic interruptions through OpenEnv's standard step/reset interface.
|
| 63 |
+
|
| 64 |
+
---
|
| 65 |
+
|
| 66 |
+
## π― Why This Environment Matters
|
| 67 |
|
| 68 |
+
No existing RL environment has modeled knowledge-work cognitive load in a principled, agent-evaluatable way. CLM fills that gap:
|
| 69 |
|
| 70 |
+
- **Useful for training agents** that assist with personal productivity tools, calendar management, and task triage systems
|
| 71 |
+
- **Useful for evaluating LLM planning ability** β especially multi-step planning under resource constraints and changing conditions
|
| 72 |
+
- **Realistic dynamics**: energy, stress, fatigue, and task dependencies create emergent difficulty that pure search algorithms cannot exploit
|
| 73 |
+
- **Dense reward signal** across the full trajectory, not just terminal rewards
|
| 74 |
+
|
| 75 |
+
---
|
| 76 |
|
| 77 |
## πΉοΈ Actions
|
| 78 |
|
|
|
|
| 127 |
- `focus_mode` β whether the agent is currently in deep-work state
|
| 128 |
|
| 129 |
|
| 130 |
+
|
| 131 |
## π Tasks & Baseline Scores
|
| 132 |
|
| 133 |
| Level | Tasks | Deadlines | Dependencies | Interruptions | Baseline Score |
|
|
|
|
| 151 |
+ interruption_bonus Γ 0.03
|
| 152 |
```
|
| 153 |
|
| 154 |
+
| Dimension | Weight | What it measures |
|
| 155 |
+
|---|---|---|
|
| 156 |
+
| Task Completion | Γ0.60 | Fraction of tasks fully completed, weighted by priority |
|
| 157 |
+
| Deadline Adherence | Γ0.22 | Bonus for finishing before deadline; penalty for missing it |
|
| 158 |
+
| Energy Efficiency | Γ0.10 | Penalizes high worker fatigue and stress spikes |
|
| 159 |
+
| Dependency Bonus | Γ0.05 | Reward for respecting task dependency order |
|
| 160 |
+
| Interruption Bonus | Γ0.03 | Reward for minimizing context-switching interruptions |
|
| 161 |
|
| 162 |
Score is always in **(0.01, 0.99)** β never exactly 0 or 1.
|
| 163 |
|
| 164 |
+
Getting the weights right took several rounds. The energy penalty needed to be strong enough the agent couldn't ignore it, but not so dominant that it started refusing to assign tasks altogether. The final balance produces an agent that *anticipates* stress buildup rather than reacting to it after the fact β which is the behavior you actually want.
|
| 165 |
|
|
|
|
| 166 |
|
| 167 |
+
## π Reward Shaping Details
|
|
|
|
|
|
|
|
|
|
|
|
|
| 168 |
|
| 169 |
+
Step rewards provide **dense signal** across the full trajectory:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 170 |
|
| 171 |
+
| Event | Reward |
|
| 172 |
+
|-------|--------|
|
| 173 |
+
| Task progress (normal) | +0.10 Γ progress_delta Γ priority_weight |
|
| 174 |
+
| Milestone 25% | +0.04 Γ priority_weight |
|
| 175 |
+
| Milestone 50% | +0.07 Γ priority_weight |
|
| 176 |
+
| Milestone 75% | +0.09 Γ priority_weight |
|
| 177 |
+
| Task complete 100% | +0.18 Γ priority_weight |
|
| 178 |
+
| Context switch | β0.07 |
|
| 179 |
+
| Work on blocked task | β0.15 |
|
| 180 |
+
| Interruption arrives | β0.05 |
|
| 181 |
+
| Episode: burnout | β1.0 |
|
| 182 |
+
| Episode: all done (on time) | +1.0 |
|
| 183 |
+
| Episode: all done (late) | +0.5 |
|
| 184 |
|
| 185 |
+
Early versions of the reward function only rewarded task completion β and the agent learned to grind workers into the ground to hit numbers. Three full rebuilds later, the current structure produces measurably better behavior.
|
| 186 |
+
|
| 187 |
+
---
|
| 188 |
+
|
| 189 |
+
## π€ Training
|
| 190 |
+
|
| 191 |
+
We trained using **Hugging Face TRL with GRPO-based reinforcement learning** on a **Qwen 1.5B** base model.
|
| 192 |
+
|
| 193 |
+
The full training notebook is one click, all dependencies handled, re-runnable end to end:
|
| 194 |
+
|
| 195 |
+
π [Open in Colab](https://colab.research.google.com/drive/1_OoW4iH1acCni0H9POCcX2pp-6bOorzo?usp=sharing)
|
| 196 |
+
|
| 197 |
+
The training loop:
|
| 198 |
+
|
| 199 |
+
1. The model (manager agent) receives an observation from the environment
|
| 200 |
+
2. It generates an action β structured as a decision over the available action space
|
| 201 |
+
3. The action executes in the environment; a reward is returned
|
| 202 |
+
4. GRPO updates the model based on relative reward signal across a batch of rollouts
|
| 203 |
+
|
| 204 |
+
We ran for 1000 steps in the primary training run. The mean reward curve shows the agent moving from near-random behavior in the early steps to a clear upward trend by step 250, stabilizing at a higher plateau through steps 750β1000.
|
| 205 |
+
|
| 206 |
+
---
|
| 207 |
|
| 208 |
+
## π Results
|
| 209 |
+
|
| 210 |
+
**Before vs After GRPO** β measured during 1000-step fine-tuning on the CLM environment:
|
| 211 |
+
|
| 212 |
+
| | Before | After | Lift |
|
| 213 |
+
|---|---|---|---|
|
| 214 |
+
| Mean Reward | 0.101 | 0.265 | **+163%** |
|
| 215 |
+
|
| 216 |
+
Per-action reward breakdown after training:
|
| 217 |
+
|
| 218 |
+
| Action | Reward (After) | What changed |
|
| 219 |
+
|---|---|---|
|
| 220 |
+
| Focus | 0.249 | Highest β agent learned to protect deep work blocks |
|
| 221 |
+
| Work | Improved significantly | Better task-worker matching |
|
| 222 |
+
| Break | 0.040 | Positive β agent learned breaks aren't wasted time |
|
| 223 |
+
| Delay | 0.019 | Low but selective β used strategically, not as default |
|
| 224 |
+
|
| 225 |
+
**Episode #1** completed with a final score of **0.3393** across 11 steps on a medium-difficulty workload. The cumulative reward curve shows the agent managing energy and stress while handling a live schema drift event mid-episode. Task queue at close: email (critical, 100% complete), code_review_em2 (normal, 0%), code_review (high, 4%).
|
| 226 |
+
|
| 227 |
+
What we didn't program but observed: the agent started inserting breaks *before* workers hit the burnout threshold, not after. It also stopped switching workers away from tasks they were mid-focus on unless deadline pressure forced it. Neither of those were explicit rules β just costs in the reward function that the agent discovered independently.
|
| 228 |
+
|
| 229 |
+
See the full episode replay, reward/step graphs, energy and stress curves, and task progress live in the dashboard demo:
|
| 230 |
+
|
| 231 |
+
π [Full dashboard demo](https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing)
|
| 232 |
+
|
| 233 |
+
---
|
| 234 |
|
| 235 |
## ποΈ Architecture
|
| 236 |
|
|
|
|
| 258 |
API -->|OpenEnv spec| OE[openenv validate]
|
| 259 |
```
|
| 260 |
|
| 261 |
+
---
|
| 262 |
|
| 263 |
+
## π Setup
|
| 264 |
|
| 265 |
+
### Docker (for HF Space / production)
|
| 266 |
+
```bash
|
| 267 |
+
docker build -t clm-env .
|
| 268 |
+
docker run -p 7860:7860 clm-env
|
| 269 |
+
```
|
| 270 |
|
| 271 |
+
### Local development
|
| 272 |
+
```bash
|
| 273 |
+
pip install -r requirements.txt
|
| 274 |
+
uvicorn server.app:app --port 7860 --reload
|
| 275 |
+
```
|
| 276 |
+
|
| 277 |
+
### Run inference baseline
|
| 278 |
+
```bash
|
| 279 |
+
export HF_TOKEN="hf_your_token_here"
|
| 280 |
+
export API_BASE_URL="https://router.huggingface.co/v1"
|
| 281 |
+
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
|
| 282 |
+
python inference.py
|
| 283 |
+
```
|
| 284 |
+
|
| 285 |
+
### Optional: React Dashboard
|
| 286 |
+
```bash
|
| 287 |
+
cd frontend && npm install && npm run dev
|
| 288 |
+
# Visit http://localhost:5173
|
| 289 |
+
```
|
| 290 |
|
| 291 |
|
| 292 |
## βοΈ Environment Variables
|
|
|
|
| 296 |
| `API_BASE_URL` | LLM API endpoint (e.g. `https://router.huggingface.co/v1`) |
|
| 297 |
| `MODEL_NAME` | Model identifier (default: `Qwen/Qwen2.5-72B-Instruct`) |
|
| 298 |
| `HF_TOKEN` | Hugging Face API token |
|
| 299 |
+
|
| 300 |
+
---
|
| 301 |
+
|
| 302 |
+
## π Where This Goes
|
| 303 |
+
|
| 304 |
+
This started as a hackathon project. The problem it's solving isn't going away.
|
| 305 |
+
|
| 306 |
+
Near-term: developer-facing APIs that let teams plug human-aware scheduling into tools they already use β Slack, Linear, Notion. Not replacing them. Adding a layer that actually understands worker state.
|
| 307 |
+
|
| 308 |
+
Longer out: the same environment architecture adapts to other domains where human capacity matters. An adaptive learning system that knows when a student is cognitively overloaded, not just academically behind. A clinical scheduling tool that models physician fatigue before it compounds into errors.
|
| 309 |
+
|
| 310 |
+
The environment is the foundation. What you train on it is what changes.
|
| 311 |
+
|
| 312 |
+
---
|
| 313 |
+
|
| 314 |
+
## πͺ Honest Reflection
|
| 315 |
+
|
| 316 |
+
Reward shaping took way longer than it should have. We went through three complete versions before finding something that produced the behavior we actually wanted. If we were starting over, we'd prototype the reward function with a simple heuristic agent first β validate the signal makes sense before involving the LLM at all.
|
| 317 |
+
|
| 318 |
+
We'd also add worker personalization. Right now all three workers share the same fatigue model. Real people have different capacities, different stress tolerances, different recovery curves. Per-worker profiles that the manager has to individually learn would make this significantly more powerful β and more honest about what human-aware AI actually needs to do.
|
| 319 |
+
|
| 320 |
+
---
|
| 321 |
+
|
| 322 |
+
## π All Links
|
| 323 |
+
|
| 324 |
+
| Resource | Link |
|
| 325 |
+
|---|---|
|
| 326 |
+
| π€ HF Space (live environment) | Linked above (this Space) |
|
| 327 |
+
| π Training Notebook (Colab) | [Open in Colab](https://colab.research.google.com/drive/1_OoW4iH1acCni0H9POCcX2pp-6bOorzo?usp=sharing) |
|
| 328 |
+
| π₯ Dashboard Demo (full video) | [Google Drive](https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing) |
|
| 329 |
+
| π¬ Project Walkthrough (Loom) | [Loom](https://www.loom.com/share/7c7293efa0ba459ba2de243b0b5aacb2) |
|
| 330 |
+
|
| 331 |
+
---
|
| 332 |
+
|
| 333 |
+
*Built for the OpenEnv Hackathon, April 2026.*
|