M ShreeRaj commited on
Commit
a726b9d
Β·
unverified Β·
1 Parent(s): 44963dd

Revise blog.md with new insights and links

Browse files

Updated blog content to reflect observations about the agent's behavior and improved training notebook link.

Files changed (1) hide show
  1. blog.md +15 -4
blog.md CHANGED
@@ -4,6 +4,14 @@
4
 
5
  ---
6
 
 
 
 
 
 
 
 
 
7
  There's something that always bugged me about productivity tools.
8
 
9
  They're really good at telling you *what* to do. Deadlines, priorities, due dates β€” all of it. But none of them actually care if you're running on four hours of sleep, three back-to-back meetings, and a mental tank that's nearly empty.
@@ -88,7 +96,7 @@ Getting the weights right took a few rounds. The energy penalty needed to be str
88
 
89
  We trained using **Hugging Face TRL with GRPO-based reinforcement learning** on a **Qwen 1.5B** base model.
90
 
91
- The full training notebook is here β€” one click, all dependencies handled, re-runnable end to end:
92
 
93
  πŸ‘‰ [https://colab.research.google.com/drive/1_OoW4iH1acCni0H9POCcX2pp-6bOorzo?usp=sharing](https://colab.research.google.com/drive/1_OoW4iH1acCni0H9POCcX2pp-6bOorzo?usp=sharing)
94
 
@@ -113,6 +121,11 @@ The numbers came out better than we expected.
113
  |---|---|---|---|
114
  | Mean Reward | 0.101 | 0.265 | **+163%** |
115
 
 
 
 
 
 
116
  Per-action reward breakdown after training:
117
 
118
  | Action | Reward (After) | What changed |
@@ -127,7 +140,6 @@ Per-action reward breakdown after training:
127
  What we didn't program but observed: the agent started inserting breaks *before* workers hit the burnout threshold, not after. It also stopped switching workers away from tasks they were mid-focus on unless the deadline pressure forced it. Neither of these were explicit rules β€” just costs in the reward function that the agent discovered on its own.
128
 
129
  See the full episode replay, reward/step graphs, energy and stress curves, and task progress live in the dashboard demo:
130
-
131
  πŸ‘‰ [https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing](https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing)
132
 
133
  ---
@@ -137,7 +149,6 @@ See the full episode replay, reward/step graphs, energy and stress curves, and t
137
  The environment is deployed as a Hugging Face Space β€” fully runnable, no local setup required. Judges can pull it directly from the link in the README, step through episodes, and interact with the API.
138
 
139
  For a quick walkthrough of what the environment does and what we trained, the Loom covers it in under two minutes:
140
-
141
  πŸ‘‰ [https://www.loom.com/share/7c7293efa0ba459ba2de243b0b5aacb2](https://www.loom.com/share/7c7293efa0ba459ba2de243b0b5aacb2)
142
 
143
  ---
@@ -173,4 +184,4 @@ We'd also add worker personalization. Right now all three workers share the same
173
 
174
  ---
175
 
176
- *Built for the OpenEnv Hackathon, April 2026.*
 
4
 
5
  ---
6
 
7
+ The agent started inserting breaks before workers hit the burnout threshold, not after.
8
+
9
+ We didn't program this. It emerged.
10
+
11
+ That one observation β€” watching the model figure out something we never explicitly told it β€” is what this whole build is about.
12
+
13
+ ---
14
+
15
  There's something that always bugged me about productivity tools.
16
 
17
  They're really good at telling you *what* to do. Deadlines, priorities, due dates β€” all of it. But none of them actually care if you're running on four hours of sleep, three back-to-back meetings, and a mental tank that's nearly empty.
 
96
 
97
  We trained using **Hugging Face TRL with GRPO-based reinforcement learning** on a **Qwen 1.5B** base model.
98
 
99
+ The full training notebook is here β€” one click, all dependencies handled, re-runnable end to end against the live HF Space at `anonymousdevil-cognitive-load-manager.hf.space`:
100
 
101
  πŸ‘‰ [https://colab.research.google.com/drive/1_OoW4iH1acCni0H9POCcX2pp-6bOorzo?usp=sharing](https://colab.research.google.com/drive/1_OoW4iH1acCni0H9POCcX2pp-6bOorzo?usp=sharing)
102
 
 
121
  |---|---|---|---|
122
  | Mean Reward | 0.101 | 0.265 | **+163%** |
123
 
124
+ For context: a random baseline agent scores approximately 0.05. The untrained Qwen 1.5B baseline scores 0.101. Our trained agent at 0.265 is a **5Γ— improvement over random** and a **+163% lift over the untrained baseline**.
125
+
126
+ ![Reward Curve](reward_curve.png)
127
+ *Mean reward per training step β€” agent improves from 0.101 to 0.265 over 1000 steps. Shaded band shows min/max range per step.*
128
+
129
  Per-action reward breakdown after training:
130
 
131
  | Action | Reward (After) | What changed |
 
140
  What we didn't program but observed: the agent started inserting breaks *before* workers hit the burnout threshold, not after. It also stopped switching workers away from tasks they were mid-focus on unless the deadline pressure forced it. Neither of these were explicit rules β€” just costs in the reward function that the agent discovered on its own.
141
 
142
  See the full episode replay, reward/step graphs, energy and stress curves, and task progress live in the dashboard demo:
 
143
  πŸ‘‰ [https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing](https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing)
144
 
145
  ---
 
149
  The environment is deployed as a Hugging Face Space β€” fully runnable, no local setup required. Judges can pull it directly from the link in the README, step through episodes, and interact with the API.
150
 
151
  For a quick walkthrough of what the environment does and what we trained, the Loom covers it in under two minutes:
 
152
  πŸ‘‰ [https://www.loom.com/share/7c7293efa0ba459ba2de243b0b5aacb2](https://www.loom.com/share/7c7293efa0ba459ba2de243b0b5aacb2)
153
 
154
  ---
 
184
 
185
  ---
186
 
187
+ *Built for the OpenEnv Hackathon, April 2026.*