soumi guria commited on
Commit
44963dd
Β·
1 Parent(s): 77f89a1

updated readme

Browse files
Files changed (1) hide show
  1. README.md +184 -53
README.md CHANGED
@@ -6,32 +6,73 @@ colorTo: red
6
  sdk: docker
7
  app_file: server/app.py
8
  pinned: false
9
- tags: [openenv, rl, scheduling, agent-eval, productivity]
10
  ---
11
 
12
- # 🧠 Cognitive Load Manager (CLM)
13
 
14
- **An OpenEnv RL Simulation for the Meta PyTorch Hackathon**
15
 
16
  [![OpenEnv](https://img.shields.io/badge/Powered_by-OpenEnv-brightgreen?style=for-the-badge)](#)
17
  [![Python 3.11](https://img.shields.io/badge/Python-3.11-blue?style=for-the-badge&logo=python)](#)
18
  [![React Dashboard](https://img.shields.io/badge/React-Live_Dashboard-blue?style=for-the-badge&logo=react)](#)
 
 
19
 
20
- CLM is a **real-world productivity simulation** where an AI agent plays the role of a human knowledge worker's task scheduler. It must manage heterogeneous work items like emails, meetings, code reviews, reports, and calls each with different cognitive demands, deadlines, priorities, and dependencies, while keeping the worker's energy and stress within safe bounds.
21
 
22
- *This is not a toy game.* CLM models how humans actually experience workload: stress accumulates when deadlines approach, fatigue reduces efficiency, context-switching has a cognitive cost, and deep focus yields better output at the expense of higher energy.
23
 
 
 
 
 
 
24
 
 
25
 
26
- ## 🎯 Why This Environment Matters
 
 
 
 
 
 
 
 
 
 
27
 
28
- Modern knowledge workers face **cognitive load management** as one of their most critical daily challenges, yet no RL environment has modelled this domain in a principled, agent-evaluatable way. CLM fills this gap:
29
 
30
- - **Useful for training agents** that assist with personal productivity tools, calendar management, and task triage systems.
31
- - **Useful for evaluating LLM planning ability** especially multi-step planning under resource constraints.
32
- - **Realistic dynamics**: energy, stress, fatigue, and task dependencies create emergent difficulty that pure search algorithms cannot exploit.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
 
34
 
 
 
 
 
 
 
35
 
36
  ## πŸ•ΉοΈ Actions
37
 
@@ -86,6 +127,7 @@ Action format:
86
  - `focus_mode` β€” whether the agent is currently in deep-work state
87
 
88
 
 
89
  ## πŸ“‹ Tasks & Baseline Scores
90
 
91
  | Level | Tasks | Deadlines | Dependencies | Interruptions | Baseline Score |
@@ -109,43 +151,86 @@ score = weighted_completion Γ— 0.60
109
  + interruption_bonus Γ— 0.03
110
  ```
111
 
112
- - **weighted_completion**: sum of (progress Γ— priority_weight) / total_weight
113
- - **deadline_adherence**: fraction of deadline tasks completed on time
114
- - **energy_efficiency**: bonus for finishing with energy > 0.10
115
- - **dependency_bonus**: reward for correctly sequencing dependent tasks
116
- - **interruption_bonus**: reward for handling injected urgent tasks
 
 
117
 
118
  Score is always in **(0.01, 0.99)** β€” never exactly 0 or 1.
119
 
 
120
 
121
- ## πŸš€ Setup
122
 
123
- ### Docker (for HF Space / production)
124
- ```bash
125
- docker build -t clm-env .
126
- docker run -p 7860:7860 clm-env
127
- ```
128
 
129
- ### Local development
130
- ```bash
131
- pip install -r requirements.txt
132
- uvicorn server.app:app --port 7860 --reload
133
- ```
134
 
135
- ### Run inference baseline
136
- ```bash
137
- export HF_TOKEN="hf_your_token_here"
138
- export API_BASE_URL="https://router.huggingface.co/v1"
139
- export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
140
- python inference.py
141
- ```
 
 
 
 
 
 
142
 
143
- ### Optional: React Dashboard
144
- ```bash
145
- cd frontend && npm install && npm run dev
146
- # Visit http://localhost:5173
147
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
148
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
149
 
150
  ## πŸ›οΈ Architecture
151
 
@@ -173,24 +258,35 @@ graph TD
173
  API -->|OpenEnv spec| OE[openenv validate]
174
  ```
175
 
 
176
 
177
- ## πŸ“Š Reward Shaping Details
178
 
179
- Step rewards provide **dense signal** across the full trajectory:
 
 
 
 
180
 
181
- | Event | Reward |
182
- |-------|--------|
183
- | Task progress (normal) | +0.10 Γ— progress_delta Γ— priority_weight |
184
- | Milestone 25% | +0.04 Γ— priority_weight |
185
- | Milestone 50% | +0.07 Γ— priority_weight |
186
- | Milestone 75% | +0.09 Γ— priority_weight |
187
- | Task complete 100% | +0.18 Γ— priority_weight |
188
- | Context switch | βˆ’0.07 |
189
- | Work on blocked task | βˆ’0.15 |
190
- | Interruption arrives | βˆ’0.05 |
191
- | Episode: burnout | βˆ’1.0 |
192
- | Episode: all done (on time) | +1.0 |
193
- | Episode: all done (late) | +0.5 |
 
 
 
 
 
 
194
 
195
 
196
  ## βš™οΈ Environment Variables
@@ -200,3 +296,38 @@ Step rewards provide **dense signal** across the full trajectory:
200
  | `API_BASE_URL` | LLM API endpoint (e.g. `https://router.huggingface.co/v1`) |
201
  | `MODEL_NAME` | Model identifier (default: `Qwen/Qwen2.5-72B-Instruct`) |
202
  | `HF_TOKEN` | Hugging Face API token |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  sdk: docker
7
  app_file: server/app.py
8
  pinned: false
9
+ tags: [openenv, rl, scheduling, agent-eval, productivity, multi-agent, grpo, reinforcement-learning]
10
  ---
11
 
12
+ # 🧠 Cognitive Load Manager (CLM)
13
 
14
+ **A Multi-Agent OpenEnv RL Environment β€” OpenEnv Hackathon, April 2026**
15
 
16
  [![OpenEnv](https://img.shields.io/badge/Powered_by-OpenEnv-brightgreen?style=for-the-badge)](#)
17
  [![Python 3.11](https://img.shields.io/badge/Python-3.11-blue?style=for-the-badge&logo=python)](#)
18
  [![React Dashboard](https://img.shields.io/badge/React-Live_Dashboard-blue?style=for-the-badge&logo=react)](#)
19
+ [![GRPO Training](https://img.shields.io/badge/Trained_with-GRPO%20%2B%20TRL-orange?style=for-the-badge)](#)
20
+ [![Qwen 1.5B](https://img.shields.io/badge/Model-Qwen_1.5B-purple?style=for-the-badge)](#)
21
 
22
+ ---
23
 
24
+ ## πŸŽ₯ See It Running First
25
 
26
+ | | |
27
+ |---|---|
28
+ | **2-min project walkthrough (Loom)** | πŸ‘‰ [Watch on Loom](https://www.loom.com/share/7c7293efa0ba459ba2de243b0b5aacb2) |
29
+ | **Full dashboard demo (Google Drive)** | πŸ‘‰ [Watch Demo](https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing) |
30
+ | **Training notebook (Colab β€” re-runnable)** | πŸ‘‰ [Open in Colab](https://colab.research.google.com/drive/1_OoW4iH1acCni0H9POCcX2pp-6bOorzo?usp=sharing) |
31
 
32
+ ---
33
 
34
+ ## The Problem
35
+
36
+ Productivity tools are good at one thing: telling you *what* to do. Deadlines, priorities, urgency tags β€” all mapped out. What none of them do is care whether you're running on four hours of sleep, mid-recovery from three back-to-back meetings, or operating at 40% capacity because the last task drained you.
37
+
38
+ That gap is real. Performance isn't linear. Fatigue compounds across a workday. Stress from one task bleeds into the next. Context switching has a measurable cognitive cost that most schedulers treat as zero.
39
+
40
+ The Cognitive Load Manager is built around that gap. It's a simulation environment where an AI agent learns to schedule work the way a *good manager* would β€” not just efficiently, but sustainably, with actual awareness of the humans doing the work.
41
+
42
+ ---
43
+
44
+ ## What We Built
45
 
46
+ CLM is a **multi-agent reinforcement learning environment** built on the OpenEnv interface. It simulates a real knowledge-work day β€” tasks of different types, deadlines with real consequences, worker states that shift throughout the episode, and mid-session surprises that force the agent to adapt.
47
 
48
+ The setup:
49
+
50
+ - **Three worker agents**, each carrying independent internal state: energy level, stress level, current task load, and fatigue accumulation that builds non-linearly across the session
51
+ - **One manager agent** β€” the AI being trained β€” that observes the full workspace state and makes scheduling decisions every step
52
+ - **A task pool** with deadlines, dependency chains, and varying complexity levels (email, code review, reports, meetings, calls)
53
+
54
+ The manager has to decide who gets what, when to push, when to delay, and when a worker genuinely needs a break. Every call has downstream consequences. Burn a worker out and their output quality drops, stress spikes, and you lose throughput precisely when you need it. Under-assign and deadlines slip. The agent has to find β€” and maintain β€” the line between the two.
55
+
56
+ What makes the environment harder than a standard scheduling problem:
57
+
58
+ - **Context-switching penalties** β€” moving between unrelated tasks isn't treated as free. Every switch costs something, and the agent learns to protect focus blocks.
59
+ - **Non-linear fatigue accumulation** β€” workers don't degrade evenly. The drop accelerates as the session progresses.
60
+ - **Mid-episode rule changes** β€” deadlines shift, urgent tasks inject mid-episode, priorities flip. In the live dashboard you can watch a "Schema Drift" alert fire mid-run (*"URGENT: Production server down β€” all code reviews now critical"*) and see the agent recalibrate in real time. There's no fixed plan to replay; the agent has to actually adapt.
61
+
62
+ This maps to **Theme 1 (Multi-Agent Interactions)** β€” three workers with independent states, a manager operating under partial observability, and emergent coordination between scheduling decisions and worker capacity. It also sits squarely in **Theme 3.1 (World Modeling / Professional Tasks)**: the manager is doing genuine orchestration β€” updating beliefs about worker state, sequencing task workflows, and handling dynamic interruptions through OpenEnv's standard step/reset interface.
63
+
64
+ ---
65
+
66
+ ## 🎯 Why This Environment Matters
67
 
68
+ No existing RL environment has modeled knowledge-work cognitive load in a principled, agent-evaluatable way. CLM fills that gap:
69
 
70
+ - **Useful for training agents** that assist with personal productivity tools, calendar management, and task triage systems
71
+ - **Useful for evaluating LLM planning ability** β€” especially multi-step planning under resource constraints and changing conditions
72
+ - **Realistic dynamics**: energy, stress, fatigue, and task dependencies create emergent difficulty that pure search algorithms cannot exploit
73
+ - **Dense reward signal** across the full trajectory, not just terminal rewards
74
+
75
+ ---
76
 
77
  ## πŸ•ΉοΈ Actions
78
 
 
127
  - `focus_mode` β€” whether the agent is currently in deep-work state
128
 
129
 
130
+
131
  ## πŸ“‹ Tasks & Baseline Scores
132
 
133
  | Level | Tasks | Deadlines | Dependencies | Interruptions | Baseline Score |
 
151
  + interruption_bonus Γ— 0.03
152
  ```
153
 
154
+ | Dimension | Weight | What it measures |
155
+ |---|---|---|
156
+ | Task Completion | Γ—0.60 | Fraction of tasks fully completed, weighted by priority |
157
+ | Deadline Adherence | Γ—0.22 | Bonus for finishing before deadline; penalty for missing it |
158
+ | Energy Efficiency | Γ—0.10 | Penalizes high worker fatigue and stress spikes |
159
+ | Dependency Bonus | Γ—0.05 | Reward for respecting task dependency order |
160
+ | Interruption Bonus | Γ—0.03 | Reward for minimizing context-switching interruptions |
161
 
162
  Score is always in **(0.01, 0.99)** β€” never exactly 0 or 1.
163
 
164
+ Getting the weights right took several rounds. The energy penalty needed to be strong enough the agent couldn't ignore it, but not so dominant that it started refusing to assign tasks altogether. The final balance produces an agent that *anticipates* stress buildup rather than reacting to it after the fact β€” which is the behavior you actually want.
165
 
 
166
 
167
+ ## πŸ“Š Reward Shaping Details
 
 
 
 
168
 
169
+ Step rewards provide **dense signal** across the full trajectory:
 
 
 
 
170
 
171
+ | Event | Reward |
172
+ |-------|--------|
173
+ | Task progress (normal) | +0.10 Γ— progress_delta Γ— priority_weight |
174
+ | Milestone 25% | +0.04 Γ— priority_weight |
175
+ | Milestone 50% | +0.07 Γ— priority_weight |
176
+ | Milestone 75% | +0.09 Γ— priority_weight |
177
+ | Task complete 100% | +0.18 Γ— priority_weight |
178
+ | Context switch | βˆ’0.07 |
179
+ | Work on blocked task | βˆ’0.15 |
180
+ | Interruption arrives | βˆ’0.05 |
181
+ | Episode: burnout | βˆ’1.0 |
182
+ | Episode: all done (on time) | +1.0 |
183
+ | Episode: all done (late) | +0.5 |
184
 
185
+ Early versions of the reward function only rewarded task completion β€” and the agent learned to grind workers into the ground to hit numbers. Three full rebuilds later, the current structure produces measurably better behavior.
186
+
187
+ ---
188
+
189
+ ## πŸ€– Training
190
+
191
+ We trained using **Hugging Face TRL with GRPO-based reinforcement learning** on a **Qwen 1.5B** base model.
192
+
193
+ The full training notebook is one click, all dependencies handled, re-runnable end to end:
194
+
195
+ πŸ‘‰ [Open in Colab](https://colab.research.google.com/drive/1_OoW4iH1acCni0H9POCcX2pp-6bOorzo?usp=sharing)
196
+
197
+ The training loop:
198
+
199
+ 1. The model (manager agent) receives an observation from the environment
200
+ 2. It generates an action β€” structured as a decision over the available action space
201
+ 3. The action executes in the environment; a reward is returned
202
+ 4. GRPO updates the model based on relative reward signal across a batch of rollouts
203
+
204
+ We ran for 1000 steps in the primary training run. The mean reward curve shows the agent moving from near-random behavior in the early steps to a clear upward trend by step 250, stabilizing at a higher plateau through steps 750–1000.
205
+
206
+ ---
207
 
208
+ ## πŸ“ˆ Results
209
+
210
+ **Before vs After GRPO** β€” measured during 1000-step fine-tuning on the CLM environment:
211
+
212
+ | | Before | After | Lift |
213
+ |---|---|---|---|
214
+ | Mean Reward | 0.101 | 0.265 | **+163%** |
215
+
216
+ Per-action reward breakdown after training:
217
+
218
+ | Action | Reward (After) | What changed |
219
+ |---|---|---|
220
+ | Focus | 0.249 | Highest β€” agent learned to protect deep work blocks |
221
+ | Work | Improved significantly | Better task-worker matching |
222
+ | Break | 0.040 | Positive β€” agent learned breaks aren't wasted time |
223
+ | Delay | 0.019 | Low but selective β€” used strategically, not as default |
224
+
225
+ **Episode #1** completed with a final score of **0.3393** across 11 steps on a medium-difficulty workload. The cumulative reward curve shows the agent managing energy and stress while handling a live schema drift event mid-episode. Task queue at close: email (critical, 100% complete), code_review_em2 (normal, 0%), code_review (high, 4%).
226
+
227
+ What we didn't program but observed: the agent started inserting breaks *before* workers hit the burnout threshold, not after. It also stopped switching workers away from tasks they were mid-focus on unless deadline pressure forced it. Neither of those were explicit rules β€” just costs in the reward function that the agent discovered independently.
228
+
229
+ See the full episode replay, reward/step graphs, energy and stress curves, and task progress live in the dashboard demo:
230
+
231
+ πŸ‘‰ [Full dashboard demo](https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing)
232
+
233
+ ---
234
 
235
  ## πŸ›οΈ Architecture
236
 
 
258
  API -->|OpenEnv spec| OE[openenv validate]
259
  ```
260
 
261
+ ---
262
 
263
+ ## πŸš€ Setup
264
 
265
+ ### Docker (for HF Space / production)
266
+ ```bash
267
+ docker build -t clm-env .
268
+ docker run -p 7860:7860 clm-env
269
+ ```
270
 
271
+ ### Local development
272
+ ```bash
273
+ pip install -r requirements.txt
274
+ uvicorn server.app:app --port 7860 --reload
275
+ ```
276
+
277
+ ### Run inference baseline
278
+ ```bash
279
+ export HF_TOKEN="hf_your_token_here"
280
+ export API_BASE_URL="https://router.huggingface.co/v1"
281
+ export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
282
+ python inference.py
283
+ ```
284
+
285
+ ### Optional: React Dashboard
286
+ ```bash
287
+ cd frontend && npm install && npm run dev
288
+ # Visit http://localhost:5173
289
+ ```
290
 
291
 
292
  ## βš™οΈ Environment Variables
 
296
  | `API_BASE_URL` | LLM API endpoint (e.g. `https://router.huggingface.co/v1`) |
297
  | `MODEL_NAME` | Model identifier (default: `Qwen/Qwen2.5-72B-Instruct`) |
298
  | `HF_TOKEN` | Hugging Face API token |
299
+
300
+ ---
301
+
302
+ ## πŸ”­ Where This Goes
303
+
304
+ This started as a hackathon project. The problem it's solving isn't going away.
305
+
306
+ Near-term: developer-facing APIs that let teams plug human-aware scheduling into tools they already use β€” Slack, Linear, Notion. Not replacing them. Adding a layer that actually understands worker state.
307
+
308
+ Longer out: the same environment architecture adapts to other domains where human capacity matters. An adaptive learning system that knows when a student is cognitively overloaded, not just academically behind. A clinical scheduling tool that models physician fatigue before it compounds into errors.
309
+
310
+ The environment is the foundation. What you train on it is what changes.
311
+
312
+ ---
313
+
314
+ ## πŸͺž Honest Reflection
315
+
316
+ Reward shaping took way longer than it should have. We went through three complete versions before finding something that produced the behavior we actually wanted. If we were starting over, we'd prototype the reward function with a simple heuristic agent first β€” validate the signal makes sense before involving the LLM at all.
317
+
318
+ We'd also add worker personalization. Right now all three workers share the same fatigue model. Real people have different capacities, different stress tolerances, different recovery curves. Per-worker profiles that the manager has to individually learn would make this significantly more powerful β€” and more honest about what human-aware AI actually needs to do.
319
+
320
+ ---
321
+
322
+ ## πŸ”— All Links
323
+
324
+ | Resource | Link |
325
+ |---|---|
326
+ | πŸ€— HF Space (live environment) | Linked above (this Space) |
327
+ | πŸ““ Training Notebook (Colab) | [Open in Colab](https://colab.research.google.com/drive/1_OoW4iH1acCni0H9POCcX2pp-6bOorzo?usp=sharing) |
328
+ | πŸŽ₯ Dashboard Demo (full video) | [Google Drive](https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing) |
329
+ | 🎬 Project Walkthrough (Loom) | [Loom](https://www.loom.com/share/7c7293efa0ba459ba2de243b0b5aacb2) |
330
+
331
+ ---
332
+
333
+ *Built for the OpenEnv Hackathon, April 2026.*