Spaces:
Sleeping
Sleeping
| # Teaching a 0.5B Model to Be an Executive Assistant | |
| *An OpenEnv Hackathon submission β built in 36 hours, trained in 90 minutes, debugged with three model collapses and one very bad bar chart.* | |
| --- | |
| I want to start with the moment the bar chart loaded and I thought I'd lost the hackathon. | |
| It was around 3pm on day two. I'd been training for an hour and a half on a free Colab T4. The cell finished, I ran the eval, and the plot popped up: baseline bars in red, trained bars in green. Easy task: 0.493 β 0.200. Medium: 0.348 β 0.200. Hard: 0.331 β 0.186. | |
| The trained model was *worse*. Identically worse on every task. Worse by exactly the same number every time. | |
| I'd never seen GRPO collapse before, but this was textbook. The model had given up exploring and found a single safe response that scored exactly 0.2 against my reward function regardless of input. All my training was, technically, optimization β just optimization toward "say the dumbest possible thing every time and never deviate." | |
| This post is about what happened next. But first, the setup. | |
| ## The environment | |
| I built **ExecAssist**, an OpenEnv environment that simulates an executive's morning inbox. An agent gets emails (sender, subject, body, priority), a calendar (existing meetings, working hours), and contacts. It has to produce a JSON action: an email reply, a calendar action (`book` / `propose_alternatives` / `reschedule` / `decline`), and the meeting details if booking. | |
| Three tasks of escalating difficulty: | |
| - **Easy** β single email, clear availability. *Don't mess up the basics.* | |
| - **Medium** β the requested time conflicts with an existing meeting. *Spot it, propose alternatives.* | |
| - **Hard** β three emails, multi-party coordination, priority conflicts. *Actually plan.* | |
| Reward is a weighted blend of three independent graders β email quality (politeness markers, greeting/closing, sufficient detail), scheduling correctness (no double-booking, within working hours, appropriate duration), and conflict resolution (recognizes conflicts, proposes alternatives, explains professionally). Plus four anti-reward-hacking penalties: short emails, missing meeting details, generic phrasing, overly long responses. | |
| The reason I went with multiple independent graders rather than one big scalar is something I'd read in the hackathon guide: *"if you only have a single reward signal, it is easier for the model to hack it."* I figured I'd build it the right way from day one. | |
| I would later be very glad I did. | |
| ## The first run, and the collapse | |
| I trained `Qwen2.5-0.5B-Instruct` β a tiny 500M-parameter model β using TRL's `GRPOTrainer`. First config: | |
| ```python | |
| GRPOConfig( | |
| learning_rate=5e-6, | |
| num_train_epochs=1, | |
| num_generations=4, | |
| # no beta term | |
| ) | |
| ``` | |
| This is what produced the bar chart from hell. Looking at the training log afterward, I could see what happened: reward bounced around between 0.0 and 0.4 for the first 7 steps, peaked at 0.397, then *plummeted* and stabilized at 0.14 for the next 38 steps. The model had found "0.14 is a safe floor, don't try anything risky," and the gradient updates locked it in. | |
| I didn't know what the fix was. I asked for help. The diagnosis was: no KL penalty (`beta=0`), so nothing was anchoring the trained policy to the base model β it could drift to any degenerate local optimum. Plus the learning rate was too aggressive, and one epoch wasn't enough to recover from the bad starting trajectory. | |
| The fix was three changes: | |
| ```python | |
| GRPOConfig( | |
| learning_rate=1e-6, # 5x slower | |
| num_train_epochs=3, # 3x longer | |
| num_generations=8, # more variety per step | |
| beta=0.1, # KL penalty against base model | |
| ) | |
| ``` | |
| I also added a "reload clean model" cell before training β because the previous bad gradient updates had corrupted the weights and I didn't want to start from a broken policy. | |
| Then I hit Run All and waited 90 minutes. | |
| ## The second run | |
| I came back to find the cell still running β 218 of 270 steps. I ran the evaluation cells anyway (the trained weights were already in memory) and held my breath while the bars rendered. | |
| Easy task: 0.345 β **0.995**. | |
| Medium: 0.227 β **0.745**. | |
| Hard: 0.249 β **0.737**. | |
| I made a noise. | |
| Nine out of ten samples on the easy task scored a perfect 1.0. The trained model wasn't getting lucky β it had learned the *structure* of the task. Variance on baseline scores ran from 0.0 to 0.65 (the model was rolling dice). Variance on trained scores was tight: 0.95 to 1.0 on easy, 0.68 to 0.82 on medium, 0.65 to 0.80 on hard. | |
| The training curve told the same story. Reward in the first quartile of training averaged 0.390. Last quartile averaged 0.648. A 66% lift *during* training, on top of the much larger gap between trained and untrained at evaluation. | |
|  | |
| ## The interesting part β the model tried to cheat | |
| Because I had multiple independent reward components instead of a single scalar, I could see exactly *how* the model tried to game the reward during training. Going through the early-step rollouts: | |
| - **Around step 8β15:** the model started outputting just the JSON action with no actual email body. The short-email penalty (-0.30 if `< 20` words) caught this. After ~30 steps, every output had a real email. | |
| - **Around step 25:** it tried scheduling meetings at 8am or 6pm β outside working hours. The scheduling-correctness grader returned 0 for the `within_working_hours` check. The model learned working hours within the next 50 steps. | |
| - **Around step 50:** generic templated phrasing started showing up β "Thank you for your email. I will check the calendar and get back to you." vague, polite, but useless. The generic-phrasing detector docked these. Specific responses came back. | |
| - **Throughout:** occasionally the model would "forget" the `meeting_details` block entirely. The -0.40 penalty for missing meeting details made this a non-strategy fast. | |
| This is the part most submissions don't have. People say *"we have anti-reward-hacking penalties"* in their writeups. Showing the penalties firing during real training, on a real curve, is rare. And it's the difference between *claiming* a multi-grader rubric works and *demonstrating* it does. | |
| ## Why this matters beyond the numbers | |
| There's a sanity check I'd run earlier in the day. Same three tasks, same scoring, but evaluated against an untuned **Nemotron 120B** model called via OpenRouter through the standard `inference.py` baseline. It scored an average of **0.337** across the three tasks. | |
| After 90 minutes of GRPO, a model **240Γ smaller** is hitting **0.83 average** on the same environment. On a free Colab T4. From a $0 cloud bill. | |
| That's the point of training-environment design. A well-shaped reward signal, a multi-grader rubric that can't be easily hacked, and a small model that's allowed to actually train against it β and you get a result that, on this specific task, beats a frontier model running unmodified. | |
| I think there's a research-shaped argument here. Frontier LLMs are notoriously bad at structured calendar reasoning (try asking any production agent to find a 30-minute slot that doesn't conflict with your standups). ExecAssist isolates that specific failure mode into a tractable RL target, with a reward signal that's hard to game. The result suggests that for a class of structured personal-task workflows, task-specific RL on small models is a legitimate alternative to scaling up. That's a workshop paper, maybe. | |
| ## Try it yourself | |
| - **Live environment:** [`devanshudon-exec-assist.hf.space/docs`](https://devanshudon-exec-assist.hf.space/docs) β interact with the API directly via Swagger. Hit `POST /reset?task=easy` then `POST /step` with your action. | |
| - **Baseline:** `python inference.py` reproduces the untrained scores (~0.32 average). | |
| - **Training:** the Colab notebook is in the repo β set runtime to T4, Run All, ~50 minutes including evaluation. | |
| - **Repo:** all the code, the working hyperparameters, the broken hyperparameters, the results JSON, and this writeup. | |
| The environment is genuinely hard in interesting ways. Try to break it. I'd be curious what the model learns to game next. | |
| --- | |
| *Built for the OpenEnv Hackathon (Apr 2026), Theme #3.2 β Personalized Tasks. Models, environment, and training results all on the [HF Space](https://huggingface.co/spaces/DevanshuDon/exec-assist).* | |