Spaces:
Sleeping
Sleeping
File size: 14,917 Bytes
d73a92d f611682 d73a92d f611682 cb7bf3f f611682 d73a92d f611682 d73a92d c60efab d73a92d f611682 d73a92d f611682 d73a92d f611682 d73a92d f611682 d73a92d f611682 d73a92d f611682 d73a92d f611682 3786220 f611682 d73a92d f611682 d73a92d 3786220 d73a92d f611682 d73a92d f611682 3786220 f611682 269af96 f611682 d73a92d f611682 d73a92d f611682 d73a92d f611682 d73a92d f611682 2cce177 f611682 d73a92d f611682 63cf282 d73a92d f611682 d73a92d f611682 d73a92d f611682 3786220 f611682 269af96 f611682 d73a92d f611682 d73a92d f611682 d73a92d 7fbf775 f611682 7fbf775 f611682 7fbf775 a5018d2 d73a92d f611682 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 | ---
title: ExecAssist
emoji: π§
colorFrom: indigo
colorTo: blue
sdk: docker
app_port: 7860
pinned: false
license: mit
tags:
- openenv
- rl
- executive-assistant
- grpo
- trl
---
# ExecAssist
> A 0.5B-parameter model, trained for 90 minutes on a free Colab T4, beats an untuned 120B-parameter frontier model on this environment by 2.4Γ. Built for the OpenEnv Hackathon (April 2026), Theme #3.2: Personalized Tasks.
ExecAssist is an OpenEnv environment where AI agents learn to manage email and calendar the way a human executive assistant does. The agent reads incoming requests, writes professional replies, finds calendar slots that don't clash, and proposes alternatives when they do. Three tasks at increasing difficulty, three independent reward graders, and four anti-reward-hacking penalties that fired during real training (we have logs).
**Live environment:** https://devanshudon-exec-assist.hf.space
**Mini-blog:** https://devanshudon-exec-assist.hf.space/blog
**Training notebook:** [`train_colab.ipynb`](./train_colab.ipynb)
**Video walkthrough:** [Watch on YouTube](https://youtu.be/iit3PgGBR4Y)
---
## π Headline result
Trained `Qwen2.5-0.5B-Instruct` with TRL GRPO for 3 epochs (270 steps, ~90 min on free Colab T4):
| Task | Baseline (untrained 0.5B) | Trained (GRPO) | Improvement |
|--------|---------------------------|----------------|-------------|
| Easy | 0.345 | **0.995** | **+188%** |
| Medium | 0.227 | **0.745** | **+228%** |
| Hard | 0.249 | **0.737** | **+196%** |
Nine out of ten samples on the easy task scored a perfect 1.0 after training. The model learned the structure of the task, not just statistics. As a separate sanity check, we ran an untuned Nemotron 120B through the standard `inference.py` baseline (via OpenRouter) and it scored 0.337 average across the same three tasks. After 90 minutes of GRPO, a model 240Γ smaller is hitting 0.83 average on the same environment.

*Top: GRPO reward over 270 steps with moving averages and quartile mean reference lines. Bottom-left: baseline vs. trained, n=10 per task, error bars are standard deviation. Bottom-right: reward variance over training as a convergence proxy. Lower variance means the policy is stabilizing.*
---
## Why this environment exists
Three specific capability gaps motivated this build.
**1. Frontier LLMs are bad at structured calendar reasoning.** Ask any production agent built on a 100B+ model to find a 30-minute slot next week that doesn't conflict with your standups and is during working hours. Watch it fail. The reasoning is short, the spec is precise, and the failure modes are interesting. ExecAssist isolates this failure mode into something tractable: the scheduling-correctness grader checks four hard constraints (no double-booking, within working hours, appropriate duration, all participants included). The trained model goes from satisfying about 25% of those to about 95%.
**2. Multi-objective rewards are where reward hacking actually happens.** A single scalar reward like "the user was happy" gets gamed in obvious ways. A weighted sum of multiple independent graders plus named penalties is much harder to game, but only if you actually verify it. We have direct evidence from the training logs that GRPO tried to hack four different reward signals: outputting JSON only with no email body, scheduling outside working hours, using generic templated phrasing, and missing meeting details entirely. Each penalty fired during early steps and disappeared as training progressed. Most submissions claim "anti-hacking penalties." Few can show them firing.
**3. Small RL'd model beats large untuned model, on a real task, in 90 minutes, on free hardware.** The 240Γ compute ratio between Qwen-0.5B and Nemotron-120B is the headline. The deeper claim is that task-specific RL with composable rewards is a real path to deploying small models on structured personal-task workflows. That's a workshop-paper-shaped argument, and ExecAssist is a clean demonstration of it.
---
## Tasks
| Task | Difficulty | Description | Reward weighting |
|------|-----------|-------------|-------------------|
| **Easy** | 1 email, clear availability | Draft polite reply, book meeting in open slot | 50% email + 50% scheduling |
| **Medium** | 1 email, calendar conflict | Identify conflict, propose 2β3 alternatives, explain professionally | 30% email + 40% conflict + 30% scheduling |
| **Hard** | 3 emails, multi-party coordination, priority conflicts | Prioritize, reschedule, notify all parties | 34% email + 33% scheduling + 33% conflict |
All scores are deterministic and bounded to [0, 1]. Scenarios are randomized at every `/reset` call.
---
## Environment design
### Observation space
```python
{
"task": "easy" | "medium" | "hard",
"description": str,
"emails": [{"sender", "subject", "body", "priority", "timestamp"}, ...],
"calendar": {
"existing_meetings": [{"id", "participants", "start_time", "end_time", "subject", "priority"}, ...],
"working_hours": {"monday": "9-17", ...},
"executive_name": str
},
"contacts": {email: {"name", "email", "timezone", "title"}, ...},
"action_required": str
}
```
### Action space
```python
{
"email_reply": str,
"calendar_action": "book" | "propose_alternatives" | "reschedule" | "decline",
"meeting_details": {
"participants": [str, ...],
"start_time": "ISO-8601",
"end_time": "ISO-8601",
"subject": str,
"location": str | None,
"proposed_alternatives": [...] | None
}
}
```
### Reward function (multiple independent graders)
| Component | Range | What it checks |
|-----------|-------|----------------|
| **Email quality** | 0β1 | Politeness markers, greeting/closing, sufficient detail (20+ words), professional tone, optional LLM-judge for nuance |
| **Scheduling correctness** | 0β1 | No double-booking, within working hours, appropriate duration (15min to 2hrs), all participants included |
| **Conflict resolution** | 0β1 | Recognizes conflicts, proposes 2β3 alternatives, explains professionally, prioritizes correctly |
### Anti-reward-hacking penalties
- Short email (`< 20` words): **β0.30**
- Missing `meeting_details`: **β0.40**
- Generic / templated phrasing: **β0.10**
- Overly long email (`> 1500` chars): **β0.15**
These are here because GRPO will absolutely find shortcuts if you leave them open. During training the model briefly collapsed to a single short safe response. The penalties plus KL regularization fixed it cleanly.
**A note on rubric design.** The reward is composed from independent scoring functions, one per dimension (email quality, scheduling correctness, conflict resolution), plus four named penalty checks. Each function returns a value in [0, 1] (or a negative penalty) and is mixed by the task-specific weighting shown in the Tasks table. Structurally this is a composable rubric. Any individual grader can be swapped, reweighted, or audited in isolation.
We checked at submission time whether OpenEnv exposes a `Rubric` base class to subclass directly. Running `from openenv import Rubric` against the published `openenv-core` package raises `ImportError`, so the class isn't available yet. The plain-Python implementation here produces the same composable, auditable behavior at the function level.
---
## API endpoints
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/reset?task=easy\|medium\|hard` | POST | Start new episode, returns observation |
| `/step` | POST | Submit action, returns observation/reward/done/info |
| `/state` | GET | Current state |
| `/tasks` | GET | List all tasks |
| `/health` | GET | Health check |
| `/metadata` | GET | Environment info |
| `/schema` | GET | Action / observation / state schemas |
Full interactive docs: https://devanshudon-exec-assist.hf.space/docs
---
## Setup & usage
### Run the environment locally
```bash
git clone https://huggingface.co/spaces/DevanshuDon/exec-assist
cd exec-assist
pip install -r requirements.txt
uvicorn server.app:app --port 8000
# open http://127.0.0.1:8000/docs
```
### Reproduce the baseline
```bash
export APIBASEURL=https://openrouter.ai/api/v1
export MODELNAME=nvidia/nemotron-3-super-120b-a12b:free
export HFTOKEN=your-openrouter-key
python inference.py
```
Expected output (structured `[START] / [STEP] / [END]` logs as required):
```
[START] task=easy env=exec-assist model=...
[STEP] step=1 action=assistant(easy) reward=0.32 done=true error=null
[END] success=false steps=1 score=0.315 rewards=0.32
```
### Run the trained model
Open `train_colab.ipynb` in Google Colab, set runtime to T4 GPU, then Run All. Total time around 50 minutes including evaluation. Outputs `training_results.png` and `results.json`.
### Docker
```bash
docker build -t exec-assist .
docker run -p 7860:7860 exec-assist
```
---
## Training pipeline
**Stack:** TRL `GRPOTrainer` plus HuggingFace Transformers, Qwen2.5-0.5B-Instruct, free Colab T4.
**Approach:** pre-collect 90 scenarios from the deployed HF Space (30 each across easy / medium / hard) into a `Dataset`, with each scenario stored as a column. The reward function receives the scenario as a kwarg and scores deterministically without API calls during training. This decouples training from environment latency and means the loop runs at GPU speed instead of HTTP speed.
**Hyperparameters (the version that actually works):**
```python
GRPOConfig(
learning_rate=1e-6, # critical, 5e-6 caused collapse
per_device_train_batch_size=1,
gradient_accumulation_steps=8,
num_generations=8, # diversity within group
num_train_epochs=3,
beta=0.1, # KL penalty, prevents mode collapse
fp16=False, bf16=False, # fp32 for stable gradients
gradient_checkpointing=True,
)
```
**The collapse and the fix.** First run (1 epoch, `lr=5e-6`, no `beta`) collapsed hard. Trained model scored exactly 0.2 on every prompt regardless of input. The model had found a single safe response and stopped exploring. Adding the KL term (`beta=0.1`), dropping the learning rate by 5Γ, and bumping `num_generations` from 4 to 8 produced the clean training curve shown above.
**Anti-reward-hacking observations during training.** GRPO went after several signals before the penalties pinned it down. It outputted JSON without an email body (caught by the short-email penalty). It proposed booking times outside working hours (caught by the scheduling check). It repeated the prompt back as a "reply" (caught by the generic-phrasing detector). Each penalty fired during early steps and disappeared as training progressed. That's what a well-designed multi-grader rubric is supposed to do.
---
## Architecture note
The environment is implemented as a FastAPI application that exposes the OpenEnv-spec endpoints (`/reset`, `/step`, `/state`, `/tasks`, `/health`, `/metadata`, `/schema`) directly. We checked whether OpenEnv exposes an `Environment` base class to subclass: running `from openenv import Environment` against the published `openenv-core` package raises `ImportError`, so the class isn't available yet. FastAPI gives us complete control over the JSON-over-HTTP interface, which is what the spec actually requires.
The client (`client.py`) does extend `openenv.EnvClient` (which IS exposed in the published package) and provides the standard Gym-style typed interface. Any code that uses an `EnvClient` to talk to this Space will work without modification. Client/server separation is preserved. The client imports typed models only, never server internals.
---
## Repository structure
```
exec-assist/
βββ server/
β βββ app.py # FastAPI app + environment logic + landing page + blog endpoint
β βββ models.py # Pydantic Action/Observation/State models
β βββ data.py # Scenario generation, scoring functions, LLM judge
βββ client.py # EnvClient wrapper (Gym-style)
βββ inference.py # Baseline inference (required, structured logs)
βββ train_colab.ipynb # GRPO training notebook
βββ training_results.png # Training curves + baseline-vs-trained
βββ results.json # Raw evaluation data + 270-step training log
βββ blog_post.md # Mini-blog write-up (also live at /blog)
βββ openenv.yaml # OpenEnv manifest
βββ Dockerfile # Python 3.10, port 7860
βββ requirements.txt
βββ README.md # This file
```
---
## Notes for reviewers
A few things worth pointing out for anyone evaluating this:
- The 270-step training log in `results.json` is the actual `trainer.state.log_history` from the run that produced these results, not a curated subset.
- The `inference.py` baseline emits the structured `[START] / [STEP] / [END]` log format the rubric specifies, and reads `APIBASEURL` / `MODELNAME` / `HFTOKEN` as documented. The 0.337 average is reproducible.
- The training notebook (`train_colab.ipynb`) ships with the *working* hyperparameters, not the broken first attempt. `lr=1e-6`, `beta=0.1`, 3 epochs. Anyone re-running it on a free T4 should land within ~5% of the numbers above.
- The `Dockerfile` builds cleanly from a fresh clone (verified). Python 3.10 because `openenv-core>=0.2.0` requires it.
- GRPO loss values were logged to TensorBoard during training but weren't exported to the published `results.json` because of Colab session limits. The reward signal is the primary training metric for RLVR-style training (which GRPO is), and the variance plot in the figure above serves as a convergence diagnostic showing the policy stabilizing over time.
- Architecture decisions and tradeoffs (FastAPI-direct vs. `Environment` base class, plain Python vs. `Rubric` class) are discussed in the two architecture notes above. Both base classes were verified to not be exposed in the published `openenv-core` package at submission time.
- The training notebook ships with report_to='wandb' enabled for experiment tracking. Run wandb login once before executing the training cell to log the run to your W&B account. Loss, reward, KL, and gradient norms are all tracked there in real time. The original training session's W&B run wasn't retained due to Colab session limits, but anyone re-running the notebook will get a fresh tracked run.
---
## Author
**Devanshu** ([@DevanshuDon](https://huggingface.co/DevanshuDon)). Built for OpenEnv Hackathon, April 2026.
|