Spaces:
Sleeping
Sleeping
Upload blog_post.md
Browse files- blog_post.md +64 -0
blog_post.md
ADDED
|
@@ -0,0 +1,64 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Teaching a 0.5B Model to Be an Executive Assistant
|
| 2 |
+
|
| 3 |
+
*An OpenEnv Hackathon submission β built in 36 hours, trained in 90 minutes.*
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## The setup
|
| 8 |
+
|
| 9 |
+
Every executive's inbox is the same problem on repeat. A meeting request comes in. There's a calendar conflict you have to spot. You write a polite reply, propose a time that actually works, and don't double-book anyone. It's not hard β it's just *fiddly*, and LLMs are surprisingly bad at it because the task fuses three separate skills: structured calendar reasoning, professional tone, and conflict resolution.
|
| 10 |
+
|
| 11 |
+
I built **ExecAssist**, an OpenEnv environment that simulates exactly that loop, and trained a small model on it with GRPO. The results were dramatic.
|
| 12 |
+
|
| 13 |
+
## The environment
|
| 14 |
+
|
| 15 |
+
ExecAssist gives an agent a realistic snapshot of an executive's morning: incoming emails (with sender, subject, body, priority), the existing calendar, working hours, and contact info. The agent has to produce a single JSON action β a written email reply, a calendar action (`book` / `propose_alternatives` / `reschedule` / `decline`), and the meeting details if booking.
|
| 16 |
+
|
| 17 |
+
There are three tasks of escalating difficulty:
|
| 18 |
+
|
| 19 |
+
- **Easy** β single email, clear availability. *Just don't mess up the basics.*
|
| 20 |
+
- **Medium** β the requested time conflicts with an existing meeting. *Spot it, propose alternatives.*
|
| 21 |
+
- **Hard** β three emails, multi-party coordination, priority conflicts. *Actually plan.*
|
| 22 |
+
|
| 23 |
+
Reward is a weighted blend of three independent graders (email quality, scheduling correctness, conflict resolution) plus four anti-reward-hacking penalties β short emails, missing meeting details, generic templated replies, and overly long responses all get docked. Multiple independent reward functions make the signal hard to game, which matters because GRPO will absolutely find any shortcut you leave open.
|
| 24 |
+
|
| 25 |
+
## The training run
|
| 26 |
+
|
| 27 |
+
I trained `Qwen2.5-0.5B-Instruct` β a tiny 500M-parameter model β using TRL's `GRPOTrainer` for 3 epochs over 90 collected scenarios on a free Colab T4. Total training time: **about 90 minutes.**
|
| 28 |
+
|
| 29 |
+
Two specific config choices mattered a lot. The first run (1 epoch, lr=5e-6, no KL term) **collapsed** β the model found a single safe response that scored 0.2 every time and refused to explore further. Classic GRPO failure mode. Adding `beta=0.1` (KL penalty against the base model), dropping the learning rate to `1e-6`, and bumping `num_generations` to 8 fixed it cleanly.
|
| 30 |
+
|
| 31 |
+
## The result
|
| 32 |
+
|
| 33 |
+
Across 10 evaluation samples per task:
|
| 34 |
+
|
| 35 |
+
| Task | Baseline (untrained 0.5B) | Trained (GRPO) | Improvement |
|
| 36 |
+
|--------|---------------------------|----------------|-------------|
|
| 37 |
+
| Easy | 0.345 | **0.995** | **+188%** |
|
| 38 |
+
| Medium | 0.227 | **0.745** | **+228%** |
|
| 39 |
+
| Hard | 0.249 | **0.737** | **+196%** |
|
| 40 |
+
|
| 41 |
+
The headline isn't just the means β it's the **variance collapse**. Baseline scores on the easy task ranged from 0.0 to 0.65 (the model rolled the dice). Trained scores: nine out of ten samples hit exactly 1.0. That's the model learning *how the task works*, not just getting lucky.
|
| 42 |
+
|
| 43 |
+
The training curve tells the same story. Reward starts oscillating around 0.1β0.4 in the first 50 steps, climbs steadily through the middle, and stabilizes between 0.6 and 0.9 in the final third. First quartile mean: 0.390. Last quartile mean: **0.648**. A 66% lift *during* training, on top of the much larger gap between the trained model and the untrained one at evaluation time.
|
| 44 |
+
|
| 45 |
+

|
| 46 |
+
|
| 47 |
+
## Why this is interesting beyond the numbers
|
| 48 |
+
|
| 49 |
+
There's a nice secondary result hiding in here. As a sanity check, I'd previously run the same three tasks against a frontier free-tier model β Nemotron 120B via OpenRouter, with no task-specific training β using the original LLM-judge reward path. It scored an average of **0.337** across the three tasks. After 90 minutes of GRPO, a model **240Γ smaller** is hitting **0.83 average** on the same environment.
|
| 50 |
+
|
| 51 |
+
That's the point of training-environment design. A well-shaped reward signal lets a tiny model beat a frontier model on a specific task, in under two hours, on free hardware.
|
| 52 |
+
|
| 53 |
+
## Try it yourself
|
| 54 |
+
|
| 55 |
+
- **Live environment:** [`devanshudon-exec-assist.hf.space/docs`](https://devanshudon-exec-assist.hf.space/docs) β interact with the API directly via Swagger
|
| 56 |
+
- **Tasks:** `POST /reset?task=easy|medium|hard` then `POST /step` with your action
|
| 57 |
+
- **Baseline:** `python inference.py` reproduces the untrained scores
|
| 58 |
+
- **Training:** the Colab notebook is in the repo β set runtime to T4, run all, ~50 minutes
|
| 59 |
+
|
| 60 |
+
The environment is genuinely hard in interesting ways. Feel free to break it.
|
| 61 |
+
|
| 62 |
+
---
|
| 63 |
+
|
| 64 |
+
*Built for the OpenEnv Hackathon (Apr 2026), Theme #3.2 β Personalized Tasks.*
|