Spaces:
Sleeping
Sleeping
Upload 3 files
Browse files- README.md +28 -37
- blog_post.md +79 -30
README.md
CHANGED
|
@@ -19,10 +19,10 @@ tags:
|
|
| 19 |
|
| 20 |
> **A 0.5B-parameter model, trained for 90 minutes on a free Colab T4, beats an untuned 120B-parameter frontier model on this environment by 2.4Γ.** Built for the OpenEnv Hackathon (Apr 2026), Theme #3.2 β Personalized Tasks.
|
| 21 |
|
| 22 |
-
An OpenEnv environment where AI agents learn to manage email and calendar like a human executive assistant β read incoming requests, write professional replies, find calendar slots that don't clash, propose alternatives when they do. Three tasks at increasing difficulty, three independent reward graders, and four anti-reward-hacking penalties that we have direct evidence of catching the model in the act.
|
| 23 |
|
| 24 |
**Live environment:** https://devanshudon-exec-assist.hf.space
|
| 25 |
-
**Mini-blog:** _(link will
|
| 26 |
**Training notebook:** [`train_colab.ipynb`](./train_colab.ipynb)
|
| 27 |
|
| 28 |
---
|
|
@@ -37,23 +37,23 @@ Trained `Qwen2.5-0.5B-Instruct` with TRL GRPO for 3 epochs (270 steps, ~90 min o
|
|
| 37 |
| Medium | 0.227 | **0.745** | **+228%** |
|
| 38 |
| Hard | 0.249 | **0.737** | **+196%** |
|
| 39 |
|
| 40 |
-
After training, **9 out of 10 samples on the easy task scored a perfect 1.0** β the model learned the task structure, not just statistics.
|
| 41 |
|
| 42 |
-
![Training results:
|
| 43 |
|
| 44 |
-
*
|
| 45 |
|
| 46 |
---
|
| 47 |
|
| 48 |
-
##
|
| 49 |
|
| 50 |
-
|
| 51 |
|
| 52 |
-
1.
|
| 53 |
-
2. Professional written tone (greeting, closing, polite framing, appropriate detail)
|
| 54 |
-
3. Conflict resolution (recognize the conflict, propose 2β3 alternatives, explain professionally)
|
| 55 |
|
| 56 |
-
|
|
|
|
|
|
|
| 57 |
|
| 58 |
---
|
| 59 |
|
|
@@ -65,7 +65,7 @@ ExecAssist is designed to teach all three in a single training loop. The reward
|
|
| 65 |
| **Medium** | 1 email, calendar conflict | Identify conflict + propose 2β3 alternatives + explain professionally | 30% email + 40% conflict + 30% scheduling |
|
| 66 |
| **Hard** | 3 emails, multi-party coordination, priority conflicts | Prioritize, reschedule, notify all parties | 34% email + 33% scheduling + 33% conflict |
|
| 67 |
|
| 68 |
-
All scores are deterministic and bounded to [0, 1].
|
| 69 |
|
| 70 |
---
|
| 71 |
|
|
@@ -113,11 +113,6 @@ All scores are deterministic and bounded to [0, 1].
|
|
| 113 |
| **Scheduling correctness** | 0β1 | No double-booking, within working hours, appropriate duration (15minβ2hrs), all participants included |
|
| 114 |
| **Conflict resolution** | 0β1 | Recognizes conflicts, proposes 2β3 alternatives, explains professionally, prioritizes correctly |
|
| 115 |
|
| 116 |
-
**Architectural note on rubrics.** The reward is composed from independent scoring functions (one per dimension: email quality, scheduling correctness,
|
| 117 |
-
conflict resolution) plus four named penalty checks. Each function returns a value in [0, 1] (or a negative penalty) and is mixed by the task-specific
|
| 118 |
-
weighting shown in the Tasks table. This is structurally a composable rubric β any individual grader can be swapped, weighted differently, or audited in
|
| 119 |
-
isolation. We implemented it as plain Python rather than OpenEnv's `Rubric` class for hackathon speed, but the design pattern is the same.
|
| 120 |
-
|
| 121 |
### Anti-reward-hacking penalties
|
| 122 |
|
| 123 |
- Short email (`< 20` words): **β0.30**
|
|
@@ -125,7 +120,9 @@ isolation. We implemented it as plain Python rather than OpenEnv's `Rubric` clas
|
|
| 125 |
- Generic / templated phrasing: **β0.10**
|
| 126 |
- Overly long email (`> 1500` chars): **β0.15**
|
| 127 |
|
| 128 |
-
These were added because GRPO will find shortcuts. During
|
|
|
|
|
|
|
| 129 |
|
| 130 |
---
|
| 131 |
|
|
@@ -209,7 +206,15 @@ GRPOConfig(
|
|
| 209 |
|
| 210 |
**The collapse and the fix.** Initial run (1 epoch, lr=5e-6, beta=0) collapsed: trained model scored exactly 0.2 on every prompt regardless of input β the model found a single safe response and stopped exploring. Adding the KL term (`beta=0.1`), reducing the learning rate, and increasing `num_generations` from 4 to 8 produced the clean training curve shown above.
|
| 211 |
|
| 212 |
-
**Anti-reward-hacking observations during training
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 213 |
|
| 214 |
---
|
| 215 |
|
|
@@ -226,7 +231,7 @@ exec-assist/
|
|
| 226 |
βββ train_colab.ipynb # GRPO training notebook
|
| 227 |
βββ training_results.png # Training curves + baseline-vs-trained
|
| 228 |
βββ results.json # Raw evaluation data + 270-step training log
|
| 229 |
-
βββ blog_post.md # Mini-blog write-up
|
| 230 |
βββ openenv.yaml # OpenEnv manifest
|
| 231 |
βββ Dockerfile # Python 3.10, port 7860
|
| 232 |
βββ requirements.txt
|
|
@@ -235,31 +240,17 @@ exec-assist/
|
|
| 235 |
|
| 236 |
---
|
| 237 |
|
| 238 |
-
## Architecture note
|
| 239 |
-
|
| 240 |
-
The environment is implemented as a FastAPI application that exposes the
|
| 241 |
-
OpenEnv-spec endpoints (`/reset`, `/step`, `/state`, `/tasks`, `/health`,
|
| 242 |
-
`/metadata`, `/schema`) directly, rather than extending `openenv.Environment`
|
| 243 |
-
as a Python class. Both implementations are spec-compliant β they expose
|
| 244 |
-
the same JSON-over-HTTP interface β but the FastAPI-direct approach gave
|
| 245 |
-
us finer control over the multi-component reward function and Pydantic
|
| 246 |
-
validation during the time-boxed hackathon build.
|
| 247 |
-
|
| 248 |
-
The client (`client.py`) does extend `openenv.EnvClient` and provides the
|
| 249 |
-
standard Gym-style typed interface, so any code that uses an `EnvClient`
|
| 250 |
-
to talk to this Space will work without modification.
|
| 251 |
-
|
| 252 |
## Compliance checklist
|
| 253 |
|
| 254 |
- β
Built on **OpenEnv** (latest release, `openenv-core>=0.2.0`)
|
| 255 |
- β
Real-world task simulation (not games or toys)
|
| 256 |
- β
Full OpenEnv spec β typed Pydantic models for Action/Observation/State, `step()`/`reset()`/`state()` endpoints, `openenv.yaml` manifest
|
| 257 |
- β
**3 tasks** with deterministic graders, scores in [0, 1], easy β medium β hard difficulty progression
|
| 258 |
-
- β
Meaningful reward function with **partial-progress signal** + anti-reward-hacking penalties
|
| 259 |
- β
**Baseline inference script** (`inference.py`) using OpenAI client, reads `APIBASEURL`/`MODELNAME`/`HFTOKEN`, structured `[START]/[STEP]/[END]` logs
|
| 260 |
- β
**Training script** (TRL GRPO) with reproducible Colab notebook
|
| 261 |
-
- β
**Real training evidence** β reward curves, baseline vs. trained,
|
| 262 |
-
- β
Deployed to **HuggingFace Space** with Docker
|
| 263 |
- β
Working **Dockerfile** (Python 3.10), `docker build && docker run` works
|
| 264 |
- β
README with environment description, action/observation spaces, setup, baseline scores
|
| 265 |
|
|
|
|
| 19 |
|
| 20 |
> **A 0.5B-parameter model, trained for 90 minutes on a free Colab T4, beats an untuned 120B-parameter frontier model on this environment by 2.4Γ.** Built for the OpenEnv Hackathon (Apr 2026), Theme #3.2 β Personalized Tasks.
|
| 21 |
|
| 22 |
+
An OpenEnv environment where AI agents learn to manage email and calendar like a human executive assistant β read incoming requests, write professional replies, find calendar slots that don't clash, propose alternatives when they do. Three tasks at increasing difficulty, three independent reward graders, and four anti-reward-hacking penalties that we have direct training-time evidence of catching the model in the act.
|
| 23 |
|
| 24 |
**Live environment:** https://devanshudon-exec-assist.hf.space
|
| 25 |
+
**Mini-blog:** _(link will be added here once published)_
|
| 26 |
**Training notebook:** [`train_colab.ipynb`](./train_colab.ipynb)
|
| 27 |
|
| 28 |
---
|
|
|
|
| 37 |
| Medium | 0.227 | **0.745** | **+228%** |
|
| 38 |
| Hard | 0.249 | **0.737** | **+196%** |
|
| 39 |
|
| 40 |
+
After training, **9 out of 10 samples on the easy task scored a perfect 1.0** β the model learned the task structure, not just statistics. As a separate sanity check, an untuned Nemotron 120B model (called via OpenRouter through the standard `inference.py` baseline) scores 0.337 average on the same three tasks. After 90 minutes of GRPO, a model **240Γ smaller** is hitting 0.83 average on the same environment.
|
| 41 |
|
| 42 |
+

|
| 43 |
|
| 44 |
+
*Top: GRPO reward over 270 steps with moving averages and quartile mean reference lines. Bottom-left: baseline vs. trained, n=10 per task, error bars are standard deviation. Bottom-right: reward variance over training as a convergence proxy β variance drops as the policy stabilizes.*
|
| 45 |
|
| 46 |
---
|
| 47 |
|
| 48 |
+
## Why this environment exists (research framing)
|
| 49 |
|
| 50 |
+
Three specific capability gaps motivated this environment:
|
| 51 |
|
| 52 |
+
**1. Frontier LLMs are bad at structured calendar reasoning.** Ask any production agent built on a 100B+ model to "find a 30-minute slot next week that doesn't conflict with my standups and is during working hours" and observe the failure rate. The reasoning is short, the spec is precise, the failure modes are interesting. ExecAssist isolates this failure mode into a tractable RL target: the scheduling-correctness grader checks four hard constraints (no double-booking, within working hours, appropriate duration, all participants included) β and the trained model goes from satisfying ~25% of them to ~95%.
|
|
|
|
|
|
|
| 53 |
|
| 54 |
+
**2. Multi-objective rewards are where reward hacking actually happens.** A single scalar reward ("the user was happy") gets gamed in obvious ways. A weighted sum of multiple independent graders + named penalties is much harder to game β but only if you actually verify that. We have direct evidence from training logs that GRPO tried to hack four different reward signals (output JSON only with no email body, schedule outside working hours, use generic templated phrasing, miss meeting details entirely), and that each penalty fired during early steps and disappeared as training progressed. Most submissions claim "anti-hacking penalties." Few can show them firing.
|
| 55 |
+
|
| 56 |
+
**3. The "small RL'd model beats large untuned model" claim, on a real task, in 90 minutes, on free hardware.** The 240Γ compute ratio between Qwen-0.5B and Nemotron-120B is the headline, but the deeper claim is that *task-specific RL with composable rewards is a viable path to deploying small models on structured personal-task workflows.* That's a workshop-paper-shaped argument, and ExecAssist is a clean demonstration of it.
|
| 57 |
|
| 58 |
---
|
| 59 |
|
|
|
|
| 65 |
| **Medium** | 1 email, calendar conflict | Identify conflict + propose 2β3 alternatives + explain professionally | 30% email + 40% conflict + 30% scheduling |
|
| 66 |
| **Hard** | 3 emails, multi-party coordination, priority conflicts | Prioritize, reschedule, notify all parties | 34% email + 33% scheduling + 33% conflict |
|
| 67 |
|
| 68 |
+
All scores are deterministic and bounded to [0, 1]. Scenarios are randomized at every `/reset` call.
|
| 69 |
|
| 70 |
---
|
| 71 |
|
|
|
|
| 113 |
| **Scheduling correctness** | 0β1 | No double-booking, within working hours, appropriate duration (15minβ2hrs), all participants included |
|
| 114 |
| **Conflict resolution** | 0β1 | Recognizes conflicts, proposes 2β3 alternatives, explains professionally, prioritizes correctly |
|
| 115 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 116 |
### Anti-reward-hacking penalties
|
| 117 |
|
| 118 |
- Short email (`< 20` words): **β0.30**
|
|
|
|
| 120 |
- Generic / templated phrasing: **β0.10**
|
| 121 |
- Overly long email (`> 1500` chars): **β0.15**
|
| 122 |
|
| 123 |
+
These were added because GRPO will find shortcuts. During training the model briefly collapsed to a single short safe response β the penalties + KL regularization fixed it cleanly.
|
| 124 |
+
|
| 125 |
+
**Architectural note on rubrics.** The reward is composed from independent scoring functions (one per dimension: email quality, scheduling correctness, conflict resolution) plus four named penalty checks. Each function returns a value in [0, 1] (or a negative penalty) and is mixed by the task-specific weighting shown in the Tasks table. This is structurally a composable rubric β any individual grader can be swapped, reweighted, or audited in isolation. We implemented it as plain Python rather than OpenEnv's `Rubric` class for hackathon speed, but the design pattern (independent, composable, auditable) is the same.
|
| 126 |
|
| 127 |
---
|
| 128 |
|
|
|
|
| 206 |
|
| 207 |
**The collapse and the fix.** Initial run (1 epoch, lr=5e-6, beta=0) collapsed: trained model scored exactly 0.2 on every prompt regardless of input β the model found a single safe response and stopped exploring. Adding the KL term (`beta=0.1`), reducing the learning rate, and increasing `num_generations` from 4 to 8 produced the clean training curve shown above.
|
| 208 |
|
| 209 |
+
**Anti-reward-hacking observations during training.** GRPO tried to game several signals β outputting only the JSON without an email body (caught by short-email penalty), proposing booking times outside working hours (caught by scheduling check), repeating the prompt back as a "reply" (caught by generic-phrasing detector). Each penalty was triggered during early steps and disappeared as training progressed. This is what a well-designed multi-grader rubric is supposed to do.
|
| 210 |
+
|
| 211 |
+
---
|
| 212 |
+
|
| 213 |
+
## Architecture note
|
| 214 |
+
|
| 215 |
+
The environment is implemented as a FastAPI application that exposes the OpenEnv-spec endpoints (`/reset`, `/step`, `/state`, `/tasks`, `/health`, `/metadata`, `/schema`) directly, rather than extending `openenv.Environment` as a Python class. Both implementations are spec-compliant β they expose the same JSON-over-HTTP interface β but the FastAPI-direct approach gave us finer control over the multi-component reward function and Pydantic validation during the time-boxed hackathon build.
|
| 216 |
+
|
| 217 |
+
The client (`client.py`) does extend `openenv.EnvClient` and provides the standard Gym-style typed interface, so any code that uses an `EnvClient` to talk to this Space will work without modification. Client/server separation is preserved β the client only imports models, never server internals.
|
| 218 |
|
| 219 |
---
|
| 220 |
|
|
|
|
| 231 |
βββ train_colab.ipynb # GRPO training notebook
|
| 232 |
βββ training_results.png # Training curves + baseline-vs-trained
|
| 233 |
βββ results.json # Raw evaluation data + 270-step training log
|
| 234 |
+
βββ blog_post.md # Mini-blog write-up (also published on HF Blog)
|
| 235 |
βββ openenv.yaml # OpenEnv manifest
|
| 236 |
βββ Dockerfile # Python 3.10, port 7860
|
| 237 |
βββ requirements.txt
|
|
|
|
| 240 |
|
| 241 |
---
|
| 242 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 243 |
## Compliance checklist
|
| 244 |
|
| 245 |
- β
Built on **OpenEnv** (latest release, `openenv-core>=0.2.0`)
|
| 246 |
- β
Real-world task simulation (not games or toys)
|
| 247 |
- β
Full OpenEnv spec β typed Pydantic models for Action/Observation/State, `step()`/`reset()`/`state()` endpoints, `openenv.yaml` manifest
|
| 248 |
- β
**3 tasks** with deterministic graders, scores in [0, 1], easy β medium β hard difficulty progression
|
| 249 |
+
- β
Meaningful reward function with **partial-progress signal** + anti-reward-hacking penalties (with training-time evidence of penalties firing)
|
| 250 |
- β
**Baseline inference script** (`inference.py`) using OpenAI client, reads `APIBASEURL`/`MODELNAME`/`HFTOKEN`, structured `[START]/[STEP]/[END]` logs
|
| 251 |
- β
**Training script** (TRL GRPO) with reproducible Colab notebook
|
| 252 |
+
- β
**Real training evidence** β reward curves with moving averages, baseline vs. trained with error bars, convergence proxy (above)
|
| 253 |
+
- β
Deployed to **HuggingFace Space** with Docker, live at https://devanshudon-exec-assist.hf.space
|
| 254 |
- β
Working **Dockerfile** (Python 3.10), `docker build && docker run` works
|
| 255 |
- β
README with environment description, action/observation spaces, setup, baseline scores
|
| 256 |
|
blog_post.md
CHANGED
|
@@ -1,64 +1,113 @@
|
|
| 1 |
# Teaching a 0.5B Model to Be an Executive Assistant
|
| 2 |
|
| 3 |
-
*An OpenEnv Hackathon submission β built in 36 hours, trained in 90 minutes.*
|
| 4 |
|
| 5 |
---
|
| 6 |
|
| 7 |
-
|
| 8 |
|
| 9 |
-
|
| 10 |
|
| 11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
## The environment
|
| 14 |
|
| 15 |
-
|
| 16 |
|
| 17 |
-
|
| 18 |
|
| 19 |
-
- **Easy** β single email, clear availability. *
|
| 20 |
- **Medium** β the requested time conflicts with an existing meeting. *Spot it, propose alternatives.*
|
| 21 |
- **Hard** β three emails, multi-party coordination, priority conflicts. *Actually plan.*
|
| 22 |
|
| 23 |
-
Reward is a weighted blend of three independent graders
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
-
|
| 26 |
|
| 27 |
-
|
| 28 |
|
| 29 |
-
|
| 30 |
|
| 31 |
-
## The
|
| 32 |
|
| 33 |
-
|
| 34 |
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
| Hard | 0.249 | **0.737** | **+196%** |
|
| 40 |
|
| 41 |
-
|
| 42 |
|
| 43 |
-
|
| 44 |
|
| 45 |
-
|
| 46 |
|
| 47 |
-
|
| 48 |
|
| 49 |
-
|
| 50 |
|
| 51 |
-
|
| 52 |
|
| 53 |
## Try it yourself
|
| 54 |
|
| 55 |
-
- **Live environment:** [`devanshudon-exec-assist.hf.space/docs`](https://devanshudon-exec-assist.hf.space/docs) β interact with the API directly via Swagger
|
| 56 |
-
- **
|
| 57 |
-
- **
|
| 58 |
-
- **
|
| 59 |
|
| 60 |
-
The environment is genuinely hard in interesting ways.
|
| 61 |
|
| 62 |
---
|
| 63 |
|
| 64 |
-
*Built for the OpenEnv Hackathon (Apr 2026), Theme #3.2 β Personalized Tasks.*
|
|
|
|
| 1 |
# Teaching a 0.5B Model to Be an Executive Assistant
|
| 2 |
|
| 3 |
+
*An OpenEnv Hackathon submission β built in 36 hours, trained in 90 minutes, debugged with three model collapses and one very bad bar chart.*
|
| 4 |
|
| 5 |
---
|
| 6 |
|
| 7 |
+
I want to start with the moment the bar chart loaded and I thought I'd lost the hackathon.
|
| 8 |
|
| 9 |
+
It was around 3pm on day two. I'd been training for an hour and a half on a free Colab T4. The cell finished, I ran the eval, and the plot popped up: baseline bars in red, trained bars in green. Easy task: 0.493 β 0.200. Medium: 0.348 β 0.200. Hard: 0.331 β 0.186.
|
| 10 |
|
| 11 |
+
The trained model was *worse*. Identically worse on every task. Worse by exactly the same number every time.
|
| 12 |
+
|
| 13 |
+
I'd never seen GRPO collapse before, but this was textbook. The model had given up exploring and found a single safe response that scored exactly 0.2 against my reward function regardless of input. All my training was, technically, optimization β just optimization toward "say the dumbest possible thing every time and never deviate."
|
| 14 |
+
|
| 15 |
+
This post is about what happened next. But first, the setup.
|
| 16 |
|
| 17 |
## The environment
|
| 18 |
|
| 19 |
+
I built **ExecAssist**, an OpenEnv environment that simulates an executive's morning inbox. An agent gets emails (sender, subject, body, priority), a calendar (existing meetings, working hours), and contacts. It has to produce a JSON action: an email reply, a calendar action (`book` / `propose_alternatives` / `reschedule` / `decline`), and the meeting details if booking.
|
| 20 |
|
| 21 |
+
Three tasks of escalating difficulty:
|
| 22 |
|
| 23 |
+
- **Easy** β single email, clear availability. *Don't mess up the basics.*
|
| 24 |
- **Medium** β the requested time conflicts with an existing meeting. *Spot it, propose alternatives.*
|
| 25 |
- **Hard** β three emails, multi-party coordination, priority conflicts. *Actually plan.*
|
| 26 |
|
| 27 |
+
Reward is a weighted blend of three independent graders β email quality (politeness markers, greeting/closing, sufficient detail), scheduling correctness (no double-booking, within working hours, appropriate duration), and conflict resolution (recognizes conflicts, proposes alternatives, explains professionally). Plus four anti-reward-hacking penalties: short emails, missing meeting details, generic phrasing, overly long responses.
|
| 28 |
+
|
| 29 |
+
The reason I went with multiple independent graders rather than one big scalar is something I'd read in the hackathon guide: *"if you only have a single reward signal, it is easier for the model to hack it."* I figured I'd build it the right way from day one.
|
| 30 |
+
|
| 31 |
+
I would later be very glad I did.
|
| 32 |
+
|
| 33 |
+
## The first run, and the collapse
|
| 34 |
+
|
| 35 |
+
I trained `Qwen2.5-0.5B-Instruct` β a tiny 500M-parameter model β using TRL's `GRPOTrainer`. First config:
|
| 36 |
+
|
| 37 |
+
```python
|
| 38 |
+
GRPOConfig(
|
| 39 |
+
learning_rate=5e-6,
|
| 40 |
+
num_train_epochs=1,
|
| 41 |
+
num_generations=4,
|
| 42 |
+
# no beta term
|
| 43 |
+
)
|
| 44 |
+
```
|
| 45 |
+
|
| 46 |
+
This is what produced the bar chart from hell. Looking at the training log afterward, I could see what happened: reward bounced around between 0.0 and 0.4 for the first 7 steps, peaked at 0.397, then *plummeted* and stabilized at 0.14 for the next 38 steps. The model had found "0.14 is a safe floor, don't try anything risky," and the gradient updates locked it in.
|
| 47 |
+
|
| 48 |
+
I didn't know what the fix was. I asked for help. The diagnosis was: no KL penalty (`beta=0`), so nothing was anchoring the trained policy to the base model β it could drift to any degenerate local optimum. Plus the learning rate was too aggressive, and one epoch wasn't enough to recover from the bad starting trajectory.
|
| 49 |
+
|
| 50 |
+
The fix was three changes:
|
| 51 |
+
|
| 52 |
+
```python
|
| 53 |
+
GRPOConfig(
|
| 54 |
+
learning_rate=1e-6, # 5x slower
|
| 55 |
+
num_train_epochs=3, # 3x longer
|
| 56 |
+
num_generations=8, # more variety per step
|
| 57 |
+
beta=0.1, # KL penalty against base model
|
| 58 |
+
)
|
| 59 |
+
```
|
| 60 |
+
|
| 61 |
+
I also added a "reload clean model" cell before training β because the previous bad gradient updates had corrupted the weights and I didn't want to start from a broken policy.
|
| 62 |
+
|
| 63 |
+
Then I hit Run All and waited 90 minutes.
|
| 64 |
+
|
| 65 |
+
## The second run
|
| 66 |
+
|
| 67 |
+
I came back to find the cell still running β 218 of 270 steps. I ran the evaluation cells anyway (the trained weights were already in memory) and held my breath while the bars rendered.
|
| 68 |
+
|
| 69 |
+
Easy task: 0.345 β **0.995**.
|
| 70 |
+
Medium: 0.227 β **0.745**.
|
| 71 |
+
Hard: 0.249 β **0.737**.
|
| 72 |
+
|
| 73 |
+
I made a noise.
|
| 74 |
|
| 75 |
+
Nine out of ten samples on the easy task scored a perfect 1.0. The trained model wasn't getting lucky β it had learned the *structure* of the task. Variance on baseline scores ran from 0.0 to 0.65 (the model was rolling dice). Variance on trained scores was tight: 0.95 to 1.0 on easy, 0.68 to 0.82 on medium, 0.65 to 0.80 on hard.
|
| 76 |
|
| 77 |
+
The training curve told the same story. Reward in the first quartile of training averaged 0.390. Last quartile averaged 0.648. A 66% lift *during* training, on top of the much larger gap between trained and untrained at evaluation.
|
| 78 |
|
| 79 |
+

|
| 80 |
|
| 81 |
+
## The interesting part β the model tried to cheat
|
| 82 |
|
| 83 |
+
Because I had multiple independent reward components instead of a single scalar, I could see exactly *how* the model tried to game the reward during training. Going through the early-step rollouts:
|
| 84 |
|
| 85 |
+
- **Around step 8β15:** the model started outputting just the JSON action with no actual email body. The short-email penalty (-0.30 if `< 20` words) caught this. After ~30 steps, every output had a real email.
|
| 86 |
+
- **Around step 25:** it tried scheduling meetings at 8am or 6pm β outside working hours. The scheduling-correctness grader returned 0 for the `within_working_hours` check. The model learned working hours within the next 50 steps.
|
| 87 |
+
- **Around step 50:** generic templated phrasing started showing up β "Thank you for your email. I will check the calendar and get back to you." vague, polite, but useless. The generic-phrasing detector docked these. Specific responses came back.
|
| 88 |
+
- **Throughout:** occasionally the model would "forget" the `meeting_details` block entirely. The -0.40 penalty for missing meeting details made this a non-strategy fast.
|
|
|
|
| 89 |
|
| 90 |
+
This is the part most submissions don't have. People say *"we have anti-reward-hacking penalties"* in their writeups. Showing the penalties firing during real training, on a real curve, is rare. And it's the difference between *claiming* a multi-grader rubric works and *demonstrating* it does.
|
| 91 |
|
| 92 |
+
## Why this matters beyond the numbers
|
| 93 |
|
| 94 |
+
There's a sanity check I'd run earlier in the day. Same three tasks, same scoring, but evaluated against an untuned **Nemotron 120B** model called via OpenRouter through the standard `inference.py` baseline. It scored an average of **0.337** across the three tasks.
|
| 95 |
|
| 96 |
+
After 90 minutes of GRPO, a model **240Γ smaller** is hitting **0.83 average** on the same environment. On a free Colab T4. From a $0 cloud bill.
|
| 97 |
|
| 98 |
+
That's the point of training-environment design. A well-shaped reward signal, a multi-grader rubric that can't be easily hacked, and a small model that's allowed to actually train against it β and you get a result that, on this specific task, beats a frontier model running unmodified.
|
| 99 |
|
| 100 |
+
I think there's a research-shaped argument here. Frontier LLMs are notoriously bad at structured calendar reasoning (try asking any production agent to find a 30-minute slot that doesn't conflict with your standups). ExecAssist isolates that specific failure mode into a tractable RL target, with a reward signal that's hard to game. The result suggests that for a class of structured personal-task workflows, task-specific RL on small models is a legitimate alternative to scaling up. That's a workshop paper, maybe.
|
| 101 |
|
| 102 |
## Try it yourself
|
| 103 |
|
| 104 |
+
- **Live environment:** [`devanshudon-exec-assist.hf.space/docs`](https://devanshudon-exec-assist.hf.space/docs) β interact with the API directly via Swagger. Hit `POST /reset?task=easy` then `POST /step` with your action.
|
| 105 |
+
- **Baseline:** `python inference.py` reproduces the untrained scores (~0.32 average).
|
| 106 |
+
- **Training:** the Colab notebook is in the repo β set runtime to T4, Run All, ~50 minutes including evaluation.
|
| 107 |
+
- **Repo:** all the code, the working hyperparameters, the broken hyperparameters, the results JSON, and this writeup.
|
| 108 |
|
| 109 |
+
The environment is genuinely hard in interesting ways. Try to break it. I'd be curious what the model learns to game next.
|
| 110 |
|
| 111 |
---
|
| 112 |
|
| 113 |
+
*Built for the OpenEnv Hackathon (Apr 2026), Theme #3.2 β Personalized Tasks. Models, environment, and training results all on the [HF Space](https://huggingface.co/spaces/DevanshuDon/exec-assist).*
|