Update blog.md
#1
by Shree2604 - opened
blog.md
CHANGED
|
@@ -4,6 +4,14 @@
|
|
| 4 |
|
| 5 |
---
|
| 6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
There's something that always bugged me about productivity tools.
|
| 8 |
|
| 9 |
They're really good at telling you *what* to do. Deadlines, priorities, due dates β all of it. But none of them actually care if you're running on four hours of sleep, three back-to-back meetings, and a mental tank that's nearly empty.
|
|
@@ -88,7 +96,7 @@ Getting the weights right took a few rounds. The energy penalty needed to be str
|
|
| 88 |
|
| 89 |
We trained using **Hugging Face TRL with GRPO-based reinforcement learning** on a **Qwen 1.5B** base model.
|
| 90 |
|
| 91 |
-
The full training notebook is here β one click, all dependencies handled, re-runnable end to end:
|
| 92 |
|
| 93 |
π [https://colab.research.google.com/drive/1_OoW4iH1acCni0H9POCcX2pp-6bOorzo?usp=sharing](https://colab.research.google.com/drive/1_OoW4iH1acCni0H9POCcX2pp-6bOorzo?usp=sharing)
|
| 94 |
|
|
@@ -113,6 +121,11 @@ The numbers came out better than we expected.
|
|
| 113 |
|---|---|---|---|
|
| 114 |
| Mean Reward | 0.101 | 0.265 | **+163%** |
|
| 115 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 116 |
Per-action reward breakdown after training:
|
| 117 |
|
| 118 |
| Action | Reward (After) | What changed |
|
|
@@ -127,7 +140,6 @@ Per-action reward breakdown after training:
|
|
| 127 |
What we didn't program but observed: the agent started inserting breaks *before* workers hit the burnout threshold, not after. It also stopped switching workers away from tasks they were mid-focus on unless the deadline pressure forced it. Neither of these were explicit rules β just costs in the reward function that the agent discovered on its own.
|
| 128 |
|
| 129 |
See the full episode replay, reward/step graphs, energy and stress curves, and task progress live in the dashboard demo:
|
| 130 |
-
|
| 131 |
π [https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing](https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing)
|
| 132 |
|
| 133 |
---
|
|
@@ -137,7 +149,6 @@ See the full episode replay, reward/step graphs, energy and stress curves, and t
|
|
| 137 |
The environment is deployed as a Hugging Face Space β fully runnable, no local setup required. Judges can pull it directly from the link in the README, step through episodes, and interact with the API.
|
| 138 |
|
| 139 |
For a quick walkthrough of what the environment does and what we trained, the Loom covers it in under two minutes:
|
| 140 |
-
|
| 141 |
π [https://www.loom.com/share/7c7293efa0ba459ba2de243b0b5aacb2](https://www.loom.com/share/7c7293efa0ba459ba2de243b0b5aacb2)
|
| 142 |
|
| 143 |
---
|
|
|
|
| 4 |
|
| 5 |
---
|
| 6 |
|
| 7 |
+
The agent started inserting breaks before workers hit the burnout threshold, not after.
|
| 8 |
+
|
| 9 |
+
We didn't program this. It emerged.
|
| 10 |
+
|
| 11 |
+
That one observation β watching the model figure out something we never explicitly told it β is what this whole build is about.
|
| 12 |
+
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
There's something that always bugged me about productivity tools.
|
| 16 |
|
| 17 |
They're really good at telling you *what* to do. Deadlines, priorities, due dates β all of it. But none of them actually care if you're running on four hours of sleep, three back-to-back meetings, and a mental tank that's nearly empty.
|
|
|
|
| 96 |
|
| 97 |
We trained using **Hugging Face TRL with GRPO-based reinforcement learning** on a **Qwen 1.5B** base model.
|
| 98 |
|
| 99 |
+
The full training notebook is here β one click, all dependencies handled, re-runnable end to end against the live HF Space at `anonymousdevil-cognitive-load-manager.hf.space`:
|
| 100 |
|
| 101 |
π [https://colab.research.google.com/drive/1_OoW4iH1acCni0H9POCcX2pp-6bOorzo?usp=sharing](https://colab.research.google.com/drive/1_OoW4iH1acCni0H9POCcX2pp-6bOorzo?usp=sharing)
|
| 102 |
|
|
|
|
| 121 |
|---|---|---|---|
|
| 122 |
| Mean Reward | 0.101 | 0.265 | **+163%** |
|
| 123 |
|
| 124 |
+
For context: a random baseline agent scores approximately 0.05. The untrained Qwen 1.5B baseline scores 0.101. Our trained agent at 0.265 is a **5Γ improvement over random** and a **+163% lift over the untrained baseline**.
|
| 125 |
+
|
| 126 |
+

|
| 127 |
+
*Mean reward per training step β agent improves from 0.101 to 0.265 over 1000 steps. Shaded band shows min/max range per step.*
|
| 128 |
+
|
| 129 |
Per-action reward breakdown after training:
|
| 130 |
|
| 131 |
| Action | Reward (After) | What changed |
|
|
|
|
| 140 |
What we didn't program but observed: the agent started inserting breaks *before* workers hit the burnout threshold, not after. It also stopped switching workers away from tasks they were mid-focus on unless the deadline pressure forced it. Neither of these were explicit rules β just costs in the reward function that the agent discovered on its own.
|
| 141 |
|
| 142 |
See the full episode replay, reward/step graphs, energy and stress curves, and task progress live in the dashboard demo:
|
|
|
|
| 143 |
π [https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing](https://drive.google.com/file/d/149dz_1rIlXv-eR1fwYaxRJ-cV0mQNevJ/view?usp=sharing)
|
| 144 |
|
| 145 |
---
|
|
|
|
| 149 |
The environment is deployed as a Hugging Face Space β fully runnable, no local setup required. Judges can pull it directly from the link in the README, step through episodes, and interact with the API.
|
| 150 |
|
| 151 |
For a quick walkthrough of what the environment does and what we trained, the Loom covers it in under two minutes:
|
|
|
|
| 152 |
π [https://www.loom.com/share/7c7293efa0ba459ba2de243b0b5aacb2](https://www.loom.com/share/7c7293efa0ba459ba2de243b0b5aacb2)
|
| 153 |
|
| 154 |
---
|