Update blog.md

#1
by Shree2604 - opened
Files changed (1) hide show
  1. blog.md +14 -3
blog.md CHANGED
@@ -4,6 +4,14 @@
4
 
5
  ---
6
 
 
 
 
 
 
 
 
 
7
  There's something that always bugged me about productivity tools.
8
 
9
  They're really good at telling you *what* to do. Deadlines, priorities, due dates β€” all of it. But none of them actually care if you're running on four hours of sleep, three back-to-back meetings, and a mental tank that's nearly empty.
@@ -88,7 +96,7 @@ Getting the weights right took a few rounds. The energy penalty needed to be str
88
 
89
  We trained using **Hugging Face TRL with GRPO-based reinforcement learning** on a **Qwen 1.5B** base model.
90
 
91
- The full training notebook is here β€” one click, all dependencies handled, re-runnable end to end:
92
 
93
  πŸ‘‰ [https://colab.research.google.com/drive/1_OoW4iH1acCni0H9POCcX2pp-6bOorzo?usp=sharing](https://colab.research.google.com/drive/1_OoW4iH1acCni0H9POCcX2pp-6bOorzo?usp=sharing)
94
 
@@ -113,6 +121,11 @@ The numbers came out better than we expected.
113
  |---|---|---|---|
114
  | Mean Reward | 0.101 | 0.265 | **+163%** |
115
 
 
 
 
 
 
116
  Per-action reward breakdown after training:
117
 
118
  | Action | Reward (After) | What changed |
@@ -127,7 +140,6 @@ Per-action reward breakdown after training:
127
  What we didn't program but observed: the agent started inserting breaks *before* workers hit the burnout threshold, not after. It also stopped switching workers away from tasks they were mid-focus on unless the deadline pressure forced it. Neither of these were explicit rules β€” just costs in the reward function that the agent discovered on its own.
128
 
129
  See the full episode replay, reward/step graphs, energy and stress curves, and task progress live in the dashboard demo:
130
-
131
  πŸ‘‰ [https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing](https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing)
132
 
133
  ---
@@ -137,7 +149,6 @@ See the full episode replay, reward/step graphs, energy and stress curves, and t
137
  The environment is deployed as a Hugging Face Space β€” fully runnable, no local setup required. Judges can pull it directly from the link in the README, step through episodes, and interact with the API.
138
 
139
  For a quick walkthrough of what the environment does and what we trained, the Loom covers it in under two minutes:
140
-
141
  πŸ‘‰ [https://www.loom.com/share/7c7293efa0ba459ba2de243b0b5aacb2](https://www.loom.com/share/7c7293efa0ba459ba2de243b0b5aacb2)
142
 
143
  ---
 
4
 
5
  ---
6
 
7
+ The agent started inserting breaks before workers hit the burnout threshold, not after.
8
+
9
+ We didn't program this. It emerged.
10
+
11
+ That one observation β€” watching the model figure out something we never explicitly told it β€” is what this whole build is about.
12
+
13
+ ---
14
+
15
  There's something that always bugged me about productivity tools.
16
 
17
  They're really good at telling you *what* to do. Deadlines, priorities, due dates β€” all of it. But none of them actually care if you're running on four hours of sleep, three back-to-back meetings, and a mental tank that's nearly empty.
 
96
 
97
  We trained using **Hugging Face TRL with GRPO-based reinforcement learning** on a **Qwen 1.5B** base model.
98
 
99
+ The full training notebook is here β€” one click, all dependencies handled, re-runnable end to end against the live HF Space at `anonymousdevil-cognitive-load-manager.hf.space`:
100
 
101
  πŸ‘‰ [https://colab.research.google.com/drive/1_OoW4iH1acCni0H9POCcX2pp-6bOorzo?usp=sharing](https://colab.research.google.com/drive/1_OoW4iH1acCni0H9POCcX2pp-6bOorzo?usp=sharing)
102
 
 
121
  |---|---|---|---|
122
  | Mean Reward | 0.101 | 0.265 | **+163%** |
123
 
124
+ For context: a random baseline agent scores approximately 0.05. The untrained Qwen 1.5B baseline scores 0.101. Our trained agent at 0.265 is a **5Γ— improvement over random** and a **+163% lift over the untrained baseline**.
125
+
126
+ ![Reward Curve](reward_curve.png)
127
+ *Mean reward per training step β€” agent improves from 0.101 to 0.265 over 1000 steps. Shaded band shows min/max range per step.*
128
+
129
  Per-action reward breakdown after training:
130
 
131
  | Action | Reward (After) | What changed |
 
140
  What we didn't program but observed: the agent started inserting breaks *before* workers hit the burnout threshold, not after. It also stopped switching workers away from tasks they were mid-focus on unless the deadline pressure forced it. Neither of these were explicit rules β€” just costs in the reward function that the agent discovered on its own.
141
 
142
  See the full episode replay, reward/step graphs, energy and stress curves, and task progress live in the dashboard demo:
 
143
  πŸ‘‰ [https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing](https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing)
144
 
145
  ---
 
149
  The environment is deployed as a Hugging Face Space β€” fully runnable, no local setup required. Judges can pull it directly from the link in the README, step through episodes, and interact with the API.
150
 
151
  For a quick walkthrough of what the environment does and what we trained, the Loom covers it in under two minutes:
 
152
  πŸ‘‰ [https://www.loom.com/share/7c7293efa0ba459ba2de243b0b5aacb2](https://www.loom.com/share/7c7293efa0ba459ba2de243b0b5aacb2)
153
 
154
  ---