# Teaching a 0.5B Model to Be an Executive Assistant

*An OpenEnv Hackathon submission — built in 36 hours, trained in 90 minutes, debugged with three model collapses and one very bad bar chart.*

---

I want to start with the moment the bar chart loaded and I thought I'd lost the hackathon.

It was around 3pm on day two. I'd been training for an hour and a half on a free Colab T4. The cell finished, I ran the eval, and the plot popped up: baseline bars in red, trained bars in green. Easy task: 0.493 → 0.200. Medium: 0.348 → 0.200. Hard: 0.331 → 0.186.

The trained model was *worse*. Identically worse on every task. Worse by exactly the same number every time.

I'd never seen GRPO collapse before, but this was textbook. The model had given up exploring and found a single safe response that scored exactly 0.2 against my reward function regardless of input. All my training was, technically, optimization — just optimization toward "say the dumbest possible thing every time and never deviate."

This post is about what happened next. But first, the setup.

## The environment

I built **ExecAssist**, an OpenEnv environment that simulates an executive's morning inbox. An agent gets emails (sender, subject, body, priority), a calendar (existing meetings, working hours), and contacts. It has to produce a JSON action: an email reply, a calendar action (`book` / `propose_alternatives` / `reschedule` / `decline`), and the meeting details if booking.

Three tasks of escalating difficulty:

- **Easy** — single email, clear availability. *Don't mess up the basics.*
- **Medium** — the requested time conflicts with an existing meeting. *Spot it, propose alternatives.*
- **Hard** — three emails, multi-party coordination, priority conflicts. *Actually plan.*

Reward is a weighted blend of three independent graders — email quality (politeness markers, greeting/closing, sufficient detail), scheduling correctness (no double-booking, within working hours, appropriate duration), and conflict resolution (recognizes conflicts, proposes alternatives, explains professionally). Plus four anti-reward-hacking penalties: short emails, missing meeting details, generic phrasing, overly long responses.

The reason I went with multiple independent graders rather than one big scalar is something I'd read in the hackathon guide: *"if you only have a single reward signal, it is easier for the model to hack it."* I figured I'd build it the right way from day one.

I would later be very glad I did.

## The first run, and the collapse

I trained `Qwen2.5-0.5B-Instruct` — a tiny 500M-parameter model — using TRL's `GRPOTrainer`. First config:

```python
GRPOConfig(
    learning_rate=5e-6,
    num_train_epochs=1,
    num_generations=4,
    # no beta term
)
```

This is what produced the bar chart from hell. Looking at the training log afterward, I could see what happened: reward bounced around between 0.0 and 0.4 for the first 7 steps, peaked at 0.397, then *plummeted* and stabilized at 0.14 for the next 38 steps. The model had found "0.14 is a safe floor, don't try anything risky," and the gradient updates locked it in.

I didn't know what the fix was. I asked for help. The diagnosis was: no KL penalty (`beta=0`), so nothing was anchoring the trained policy to the base model — it could drift to any degenerate local optimum. Plus the learning rate was too aggressive, and one epoch wasn't enough to recover from the bad starting trajectory.

The fix was three changes:

```python
GRPOConfig(
    learning_rate=1e-6,           # 5x slower
    num_train_epochs=3,           # 3x longer
    num_generations=8,            # more variety per step
    beta=0.1,                     # KL penalty against base model
)
```

I also added a "reload clean model" cell before training — because the previous bad gradient updates had corrupted the weights and I didn't want to start from a broken policy.

Then I hit Run All and waited 90 minutes.

## The second run

I came back to find the cell still running — 218 of 270 steps. I ran the evaluation cells anyway (the trained weights were already in memory) and held my breath while the bars rendered.

Easy task: 0.345 → **0.995**.  
Medium: 0.227 → **0.745**.  
Hard: 0.249 → **0.737**.

I made a noise.

Nine out of ten samples on the easy task scored a perfect 1.0. The trained model wasn't getting lucky — it had learned the *structure* of the task. Variance on baseline scores ran from 0.0 to 0.65 (the model was rolling dice). Variance on trained scores was tight: 0.95 to 1.0 on easy, 0.68 to 0.82 on medium, 0.65 to 0.80 on hard.

The training curve told the same story. Reward in the first quartile of training averaged 0.390. Last quartile averaged 0.648. A 66% lift *during* training, on top of the much larger gap between trained and untrained at evaluation.

![Training results — three panels showing reward curve with moving averages, baseline vs trained per task, and reward variance over time](./training_results.png)

## The interesting part — the model tried to cheat

Because I had multiple independent reward components instead of a single scalar, I could see exactly *how* the model tried to game the reward during training. Going through the early-step rollouts:

- **Around step 8–15:** the model started outputting just the JSON action with no actual email body. The short-email penalty (-0.30 if `< 20` words) caught this. After ~30 steps, every output had a real email.
- **Around step 25:** it tried scheduling meetings at 8am or 6pm — outside working hours. The scheduling-correctness grader returned 0 for the `within_working_hours` check. The model learned working hours within the next 50 steps.
- **Around step 50:** generic templated phrasing started showing up — "Thank you for your email. I will check the calendar and get back to you." vague, polite, but useless. The generic-phrasing detector docked these. Specific responses came back.
- **Throughout:** occasionally the model would "forget" the `meeting_details` block entirely. The -0.40 penalty for missing meeting details made this a non-strategy fast.

This is the part most submissions don't have. People say *"we have anti-reward-hacking penalties"* in their writeups. Showing the penalties firing during real training, on a real curve, is rare. And it's the difference between *claiming* a multi-grader rubric works and *demonstrating* it does.

## Why this matters beyond the numbers

There's a sanity check I'd run earlier in the day. Same three tasks, same scoring, but evaluated against an untuned **Nemotron 120B** model called via OpenRouter through the standard `inference.py` baseline. It scored an average of **0.337** across the three tasks.

After 90 minutes of GRPO, a model **240× smaller** is hitting **0.83 average** on the same environment. On a free Colab T4. From a $0 cloud bill.

That's the point of training-environment design. A well-shaped reward signal, a multi-grader rubric that can't be easily hacked, and a small model that's allowed to actually train against it — and you get a result that, on this specific task, beats a frontier model running unmodified.

I think there's a research-shaped argument here. Frontier LLMs are notoriously bad at structured calendar reasoning (try asking any production agent to find a 30-minute slot that doesn't conflict with your standups). ExecAssist isolates that specific failure mode into a tractable RL target, with a reward signal that's hard to game. The result suggests that for a class of structured personal-task workflows, task-specific RL on small models is a legitimate alternative to scaling up. That's a workshop paper, maybe.

## Try it yourself

- **Live environment:** [`devanshudon-exec-assist.hf.space/docs`](https://devanshudon-exec-assist.hf.space/docs) — interact with the API directly via Swagger. Hit `POST /reset?task=easy` then `POST /step` with your action.
- **Baseline:** `python inference.py` reproduces the untrained scores (~0.32 average).
- **Training:** the Colab notebook is in the repo — set runtime to T4, Run All, ~50 minutes including evaluation.
- **Repo:** all the code, the working hyperparameters, the broken hyperparameters, the results JSON, and this writeup.

The environment is genuinely hard in interesting ways. Try to break it. I'd be curious what the model learns to game next.

---

*Built for the OpenEnv Hackathon (Apr 2026), Theme #3.2 — Personalized Tasks. Models, environment, and training results all on the [HF Space](https://huggingface.co/spaces/DevanshuDon/exec-assist).*