File size: 14,917 Bytes
d73a92d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f611682
d73a92d
f611682
cb7bf3f
f611682
d73a92d
 
f611682
d73a92d
c60efab
d73a92d
 
 
 
 
 
 
 
 
 
 
 
 
f611682
d73a92d
f611682
d73a92d
f611682
d73a92d
 
 
f611682
d73a92d
f611682
d73a92d
f611682
d73a92d
f611682
3786220
f611682
d73a92d
 
 
 
 
 
 
f611682
 
d73a92d
 
3786220
d73a92d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f611682
d73a92d
 
 
 
 
 
 
 
 
f611682
3786220
f611682
269af96
f611682
d73a92d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f611682
d73a92d
 
 
 
 
 
 
 
f611682
d73a92d
 
 
 
 
 
 
 
 
 
 
 
f611682
d73a92d
f611682
2cce177
f611682
d73a92d
 
 
f611682
63cf282
 
d73a92d
 
f611682
d73a92d
 
 
 
 
f611682
d73a92d
f611682
3786220
 
 
 
 
f611682
269af96
f611682
d73a92d
 
 
 
 
 
 
 
f611682
d73a92d
 
 
 
 
 
 
f611682
d73a92d
 
 
 
 
 
 
 
7fbf775
 
 
 
 
 
f611682
7fbf775
f611682
7fbf775
a5018d2
d73a92d
 
 
 
f611682
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
---
title: ExecAssist
emoji: πŸ“§
colorFrom: indigo
colorTo: blue
sdk: docker
app_port: 7860
pinned: false
license: mit
tags:
  - openenv
  - rl
  - executive-assistant
  - grpo
  - trl
---

# ExecAssist

> A 0.5B-parameter model, trained for 90 minutes on a free Colab T4, beats an untuned 120B-parameter frontier model on this environment by 2.4Γ—. Built for the OpenEnv Hackathon (April 2026), Theme #3.2: Personalized Tasks.

ExecAssist is an OpenEnv environment where AI agents learn to manage email and calendar the way a human executive assistant does. The agent reads incoming requests, writes professional replies, finds calendar slots that don't clash, and proposes alternatives when they do. Three tasks at increasing difficulty, three independent reward graders, and four anti-reward-hacking penalties that fired during real training (we have logs).

**Live environment:** https://devanshudon-exec-assist.hf.space  
**Mini-blog:** https://devanshudon-exec-assist.hf.space/blog  
**Training notebook:** [`train_colab.ipynb`](./train_colab.ipynb)
**Video walkthrough:** [Watch on YouTube](https://youtu.be/iit3PgGBR4Y)

---

## πŸ† Headline result

Trained `Qwen2.5-0.5B-Instruct` with TRL GRPO for 3 epochs (270 steps, ~90 min on free Colab T4):

| Task   | Baseline (untrained 0.5B) | Trained (GRPO) | Improvement |
|--------|---------------------------|----------------|-------------|
| Easy   | 0.345                     | **0.995**      | **+188%**   |
| Medium | 0.227                     | **0.745**      | **+228%**   |
| Hard   | 0.249                     | **0.737**      | **+196%**   |

Nine out of ten samples on the easy task scored a perfect 1.0 after training. The model learned the structure of the task, not just statistics. As a separate sanity check, we ran an untuned Nemotron 120B through the standard `inference.py` baseline (via OpenRouter) and it scored 0.337 average across the same three tasks. After 90 minutes of GRPO, a model 240Γ— smaller is hitting 0.83 average on the same environment.

![Training results: top panel shows reward curve with 10-step and 30-step moving averages, Q1 mean 0.390 to Q4 mean 0.648; bottom-left shows baseline vs trained per-task with error bars and improvement percentages; bottom-right shows reward variance decreasing as the policy converges](./training_results.png)

*Top: GRPO reward over 270 steps with moving averages and quartile mean reference lines. Bottom-left: baseline vs. trained, n=10 per task, error bars are standard deviation. Bottom-right: reward variance over training as a convergence proxy. Lower variance means the policy is stabilizing.*

---

## Why this environment exists

Three specific capability gaps motivated this build.

**1. Frontier LLMs are bad at structured calendar reasoning.** Ask any production agent built on a 100B+ model to find a 30-minute slot next week that doesn't conflict with your standups and is during working hours. Watch it fail. The reasoning is short, the spec is precise, and the failure modes are interesting. ExecAssist isolates this failure mode into something tractable: the scheduling-correctness grader checks four hard constraints (no double-booking, within working hours, appropriate duration, all participants included). The trained model goes from satisfying about 25% of those to about 95%.

**2. Multi-objective rewards are where reward hacking actually happens.** A single scalar reward like "the user was happy" gets gamed in obvious ways. A weighted sum of multiple independent graders plus named penalties is much harder to game, but only if you actually verify it. We have direct evidence from the training logs that GRPO tried to hack four different reward signals: outputting JSON only with no email body, scheduling outside working hours, using generic templated phrasing, and missing meeting details entirely. Each penalty fired during early steps and disappeared as training progressed. Most submissions claim "anti-hacking penalties." Few can show them firing.

**3. Small RL'd model beats large untuned model, on a real task, in 90 minutes, on free hardware.** The 240Γ— compute ratio between Qwen-0.5B and Nemotron-120B is the headline. The deeper claim is that task-specific RL with composable rewards is a real path to deploying small models on structured personal-task workflows. That's a workshop-paper-shaped argument, and ExecAssist is a clean demonstration of it.

---

## Tasks

| Task | Difficulty | Description | Reward weighting |
|------|-----------|-------------|-------------------|
| **Easy** | 1 email, clear availability | Draft polite reply, book meeting in open slot | 50% email + 50% scheduling |
| **Medium** | 1 email, calendar conflict | Identify conflict, propose 2–3 alternatives, explain professionally | 30% email + 40% conflict + 30% scheduling |
| **Hard** | 3 emails, multi-party coordination, priority conflicts | Prioritize, reschedule, notify all parties | 34% email + 33% scheduling + 33% conflict |

All scores are deterministic and bounded to [0, 1]. Scenarios are randomized at every `/reset` call.

---

## Environment design

### Observation space

```python
{
  "task": "easy" | "medium" | "hard",
  "description": str,
  "emails": [{"sender", "subject", "body", "priority", "timestamp"}, ...],
  "calendar": {
    "existing_meetings": [{"id", "participants", "start_time", "end_time", "subject", "priority"}, ...],
    "working_hours": {"monday": "9-17", ...},
    "executive_name": str
  },
  "contacts": {email: {"name", "email", "timezone", "title"}, ...},
  "action_required": str
}
```

### Action space

```python
{
  "email_reply": str,
  "calendar_action": "book" | "propose_alternatives" | "reschedule" | "decline",
  "meeting_details": {
    "participants": [str, ...],
    "start_time": "ISO-8601",
    "end_time": "ISO-8601",
    "subject": str,
    "location": str | None,
    "proposed_alternatives": [...] | None
  }
}
```

### Reward function (multiple independent graders)

| Component | Range | What it checks |
|-----------|-------|----------------|
| **Email quality** | 0–1 | Politeness markers, greeting/closing, sufficient detail (20+ words), professional tone, optional LLM-judge for nuance |
| **Scheduling correctness** | 0–1 | No double-booking, within working hours, appropriate duration (15min to 2hrs), all participants included |
| **Conflict resolution** | 0–1 | Recognizes conflicts, proposes 2–3 alternatives, explains professionally, prioritizes correctly |

### Anti-reward-hacking penalties

- Short email (`< 20` words): **βˆ’0.30**
- Missing `meeting_details`: **βˆ’0.40**
- Generic / templated phrasing: **βˆ’0.10**
- Overly long email (`> 1500` chars): **βˆ’0.15**

These are here because GRPO will absolutely find shortcuts if you leave them open. During training the model briefly collapsed to a single short safe response. The penalties plus KL regularization fixed it cleanly.

**A note on rubric design.** The reward is composed from independent scoring functions, one per dimension (email quality, scheduling correctness, conflict resolution), plus four named penalty checks. Each function returns a value in [0, 1] (or a negative penalty) and is mixed by the task-specific weighting shown in the Tasks table. Structurally this is a composable rubric. Any individual grader can be swapped, reweighted, or audited in isolation.

We checked at submission time whether OpenEnv exposes a `Rubric` base class to subclass directly. Running `from openenv import Rubric` against the published `openenv-core` package raises `ImportError`, so the class isn't available yet. The plain-Python implementation here produces the same composable, auditable behavior at the function level.

---

## API endpoints

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/reset?task=easy\|medium\|hard` | POST | Start new episode, returns observation |
| `/step` | POST | Submit action, returns observation/reward/done/info |
| `/state` | GET | Current state |
| `/tasks` | GET | List all tasks |
| `/health` | GET | Health check |
| `/metadata` | GET | Environment info |
| `/schema` | GET | Action / observation / state schemas |

Full interactive docs: https://devanshudon-exec-assist.hf.space/docs

---

## Setup & usage

### Run the environment locally

```bash
git clone https://huggingface.co/spaces/DevanshuDon/exec-assist
cd exec-assist
pip install -r requirements.txt
uvicorn server.app:app --port 8000
# open http://127.0.0.1:8000/docs
```

### Reproduce the baseline

```bash
export APIBASEURL=https://openrouter.ai/api/v1
export MODELNAME=nvidia/nemotron-3-super-120b-a12b:free
export HFTOKEN=your-openrouter-key
python inference.py
```

Expected output (structured `[START] / [STEP] / [END]` logs as required):

```
[START] task=easy env=exec-assist model=...
[STEP] step=1 action=assistant(easy) reward=0.32 done=true error=null
[END] success=false steps=1 score=0.315 rewards=0.32
```

### Run the trained model

Open `train_colab.ipynb` in Google Colab, set runtime to T4 GPU, then Run All. Total time around 50 minutes including evaluation. Outputs `training_results.png` and `results.json`.

### Docker

```bash
docker build -t exec-assist .
docker run -p 7860:7860 exec-assist
```

---

## Training pipeline

**Stack:** TRL `GRPOTrainer` plus HuggingFace Transformers, Qwen2.5-0.5B-Instruct, free Colab T4.

**Approach:** pre-collect 90 scenarios from the deployed HF Space (30 each across easy / medium / hard) into a `Dataset`, with each scenario stored as a column. The reward function receives the scenario as a kwarg and scores deterministically without API calls during training. This decouples training from environment latency and means the loop runs at GPU speed instead of HTTP speed.

**Hyperparameters (the version that actually works):**

```python
GRPOConfig(
    learning_rate=1e-6,           # critical, 5e-6 caused collapse
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    num_generations=8,            # diversity within group
    num_train_epochs=3,
    beta=0.1,                     # KL penalty, prevents mode collapse
    fp16=False, bf16=False,       # fp32 for stable gradients
    gradient_checkpointing=True,
)
```

**The collapse and the fix.** First run (1 epoch, `lr=5e-6`, no `beta`) collapsed hard. Trained model scored exactly 0.2 on every prompt regardless of input. The model had found a single safe response and stopped exploring. Adding the KL term (`beta=0.1`), dropping the learning rate by 5Γ—, and bumping `num_generations` from 4 to 8 produced the clean training curve shown above.

**Anti-reward-hacking observations during training.** GRPO went after several signals before the penalties pinned it down. It outputted JSON without an email body (caught by the short-email penalty). It proposed booking times outside working hours (caught by the scheduling check). It repeated the prompt back as a "reply" (caught by the generic-phrasing detector). Each penalty fired during early steps and disappeared as training progressed. That's what a well-designed multi-grader rubric is supposed to do.

---

## Architecture note

The environment is implemented as a FastAPI application that exposes the OpenEnv-spec endpoints (`/reset`, `/step`, `/state`, `/tasks`, `/health`, `/metadata`, `/schema`) directly. We checked whether OpenEnv exposes an `Environment` base class to subclass: running `from openenv import Environment` against the published `openenv-core` package raises `ImportError`, so the class isn't available yet. FastAPI gives us complete control over the JSON-over-HTTP interface, which is what the spec actually requires.

The client (`client.py`) does extend `openenv.EnvClient` (which IS exposed in the published package) and provides the standard Gym-style typed interface. Any code that uses an `EnvClient` to talk to this Space will work without modification. Client/server separation is preserved. The client imports typed models only, never server internals.

---

## Repository structure

```
exec-assist/
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ app.py            # FastAPI app + environment logic + landing page + blog endpoint
β”‚   β”œβ”€β”€ models.py         # Pydantic Action/Observation/State models
β”‚   └── data.py           # Scenario generation, scoring functions, LLM judge
β”œβ”€β”€ client.py             # EnvClient wrapper (Gym-style)
β”œβ”€β”€ inference.py          # Baseline inference (required, structured logs)
β”œβ”€β”€ train_colab.ipynb     # GRPO training notebook
β”œβ”€β”€ training_results.png  # Training curves + baseline-vs-trained
β”œβ”€β”€ results.json          # Raw evaluation data + 270-step training log
β”œβ”€β”€ blog_post.md          # Mini-blog write-up (also live at /blog)
β”œβ”€β”€ openenv.yaml          # OpenEnv manifest
β”œβ”€β”€ Dockerfile            # Python 3.10, port 7860
β”œβ”€β”€ requirements.txt
└── README.md             # This file
```

---

## Notes for reviewers

A few things worth pointing out for anyone evaluating this:

- The 270-step training log in `results.json` is the actual `trainer.state.log_history` from the run that produced these results, not a curated subset.
- The `inference.py` baseline emits the structured `[START] / [STEP] / [END]` log format the rubric specifies, and reads `APIBASEURL` / `MODELNAME` / `HFTOKEN` as documented. The 0.337 average is reproducible.
- The training notebook (`train_colab.ipynb`) ships with the *working* hyperparameters, not the broken first attempt. `lr=1e-6`, `beta=0.1`, 3 epochs. Anyone re-running it on a free T4 should land within ~5% of the numbers above.
- The `Dockerfile` builds cleanly from a fresh clone (verified). Python 3.10 because `openenv-core>=0.2.0` requires it.
- GRPO loss values were logged to TensorBoard during training but weren't exported to the published `results.json` because of Colab session limits. The reward signal is the primary training metric for RLVR-style training (which GRPO is), and the variance plot in the figure above serves as a convergence diagnostic showing the policy stabilizing over time.
- Architecture decisions and tradeoffs (FastAPI-direct vs. `Environment` base class, plain Python vs. `Rubric` class) are discussed in the two architecture notes above. Both base classes were verified to not be exposed in the published `openenv-core` package at submission time.
- The training notebook ships with report_to='wandb' enabled for experiment tracking. Run wandb login once before executing the training cell to log the run to your W&B account. Loss, reward, KL, and gradient norms are all tracked there in real time. The original training session's W&B run wasn't retained due to Colab session limits, but anyone re-running the notebook will get a fresh tracked run.
---

## Author

**Devanshu** ([@DevanshuDon](https://huggingface.co/DevanshuDon)). Built for OpenEnv Hackathon, April 2026.