File size: 16,754 Bytes
d02bacd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
# AutoDataLab++: Teaching a Chief of Staff to Know When to Stop

**Team:** Peaky Blinders β€” Sai Kamal Nannuri ([Uchihakamal1816](https://github.com/Uchihakamal1816)) & Ajeet Kumar ([ajeetkartikay](https://github.com/ajeetkartikay))
**Live demo:** [uchihamadara1816-autodatalab2-0.hf.space/ui/](https://uchihamadara1816-autodatalab2-0.hf.space/ui/)
**Code:** [github.com/Uchihakamal1816/AutoDataLab-](https://github.com/Uchihakamal1816/AutoDataLab-) (the trailing dash is the actual repo slug)

---

## TL;DR

We built an OpenEnv environment where a **Chief of Staff (CoS)** policy must route work across four specialists β€” Data Analyst, Finance, Strategy, HR β€” and **submit a complete CEO brief inside a step budget**. Skipping an expert leaves the brief incomplete and the grader penalises it; over-consulting wastes the step budget and the shaped reward penalises that too. Either error hurts. The policy has to learn the *capability gap* between knowing facts and knowing when it has enough.

We trained a **2-layer MLP routing policy with REINFORCE** (600 episodes, lr 0.003) and benchmarked it against an untrained baseline and a handcoded oracle β€” fallback off, 5 seeds, 3 hard tasks.

![Headline result: terminal reward, fallback disabled β€” base vs trained MLP CoS vs oracle upper bound](training/evidence/plots/headline_terminal_reward.png)

| Task | Base (naive) | **Trained MLP CoS** (REINFORCE, 600 ep) | Oracle router (upper bound) |
|---|---|---|---|
| `hard_brief`   | 0.27 / 0.32 | **0.73 / 0.74** | 0.88 / 0.90 |
| `expert_brief` | 0.27 / 0.32 | **0.73 / 0.74** | 0.88 / 0.89 |
| `crisis_brief` | 0.27 / 0.32 | **0.73 / 0.74** | 0.88 / 0.89 |

*Cells show RAG-off / RAG-on. Numbers are terminal grader scores in `[0, 1]`, mean over 5 seeds, with the env's expert-auto-fill safety net OFF (`eval_mode=true`). Raw runs in [`training/evidence/headline_benchmark.json`](training/evidence/headline_benchmark.json); reproduce with `python3 training/scripts/run_headline_benchmark.py`.*

The trained MLP closes **~75 %** of the base→oracle gap (+0.46 absolute reward over the naive baseline). The remaining ~0.15 of headroom is what a future GRPO/SFT LLM run can chase. We also wired up a Qwen2.5-1.5B SFT/DPO/GRPO pipeline as Experiment B; we treat it as a working pipeline, not a converged result, and we're explicit about that throughout.

---

## Why this is interesting

Most LLM benchmarks test single-turn reasoning. Real enterprise work isn't like that. A CEO doesn't ask one question; they ask *"look at the data, project the quarter, recommend an action, and draft the memo."* That's a **routing problem**: which expert, in what order, with what info, and β€” critically β€” when to stop.

Next-token training rewards plausible-sounding text. **It does not reward stopping.** That's why a prompted-only LLM tends to either spam "consult strategy" or invent a plausible answer instead of asking HR. Our hypothesis is that a *small* policy trained with RL on the routing decision can outperform a much larger one that's just prompted to orchestrate, because the small one is being shaped exactly on the bit that matters.

So we made an environment where that bit *is* the reward β€” and where ignoring HR doesn't get hidden behind a polished sentence.

---

## The environment

Six tasks, three of them on the critical path:

| Task | Difficulty | Required experts |
|---|---|---|
| `easy_brief`    | easy   | analyst, finance, hr |
| `medium_brief`  | medium | analyst, finance, strategy, hr |
| `hard_brief`    | hard   | analyst, finance, strategy, hr |
| `expert_brief`  | hard   | analyst, finance, strategy, hr |
| `risk_brief`    | hard   | analyst, finance, strategy, hr |
| `crisis_brief`  | hard   | analyst, finance, strategy, hr |

The CoS has a small discrete action space: `consult(expert)`, `ask(expert, sub_question)` (a focused consult), `summarize`, `submit`, `noop`. Episodes max out at 12–14 steps. The reward is shaped:

- **+0.10** for a productive first consult of a required expert
- **+0.02** for any other useful action (e.g. summarising once)
- **βˆ’0.05** for repeating a consult, **βˆ’0.10** for a third+ consult of the same expert (`shaping="strict"`)
- **βˆ’0.05** for summarising before required experts have reported
- **βˆ’0.02** per step (efficiency pressure)
- **+ terminal grader score** in `(0.001, 0.999)` on submit

The four specialists are real subenvs:

- **Data Analyst** β€” wraps `subenvs/autodatalab/analytics.py`: cleaning, KPIs, revenue derivation, data-quality scoring.
- **Finance** β€” deterministic projection, variance vs plan, break-even sensitivity.
- **Strategy** β€” 3-bullet operating plan with citations to upstream numbers; emits structured `Present / Future` buy-sell-hold-trim guidance on the NVDA / AAPL / JPM watchlist.
- **HR / Comms** β€” wraps `subenvs/email/`: drafts memos and scores them on a blended structure + tone + audience rubric.

Each specialist returns a typed `ExpertReport`. The CoS composes them into a final `Brief`. Schemas live in `ceo_brief_env/models.py`.

### The honest-evaluation flag

The env ships with a *production* mode that auto-completes any missing required expert before grading β€” so end-users always see a full brief in the UI. That's a nice product, but it lies to a researcher: a policy that never thought to consult HR still gets HR's report stapled in.

So `/reset` and `/visualize/run` both accept **`eval_mode=true`**. When set, the env runs with `auto_fill_required=False` and `shaping="strict"`. Submitting an incomplete brief is *allowed*, the grader sees the gap, and the terminal score reflects the policy's own routing competence. **All headline numbers in this post are reported in this regime.** It's also our headline contribution as an environment: we believe this is the first OpenEnv env whose reward explicitly punishes the LLM for not knowing when to stop.

---

## What we trained β€” Experiment A (the headline number)

The trained model in the headline is **a 2-layer MLP routing policy with REINFORCE**, trained on featurised observations of the env. Code: [`training/scripts/train_cos_local.py`](training/scripts/train_cos_local.py). Checkpoint: [`training/checkpoints/cos_final.pt`](training/checkpoints/cos_final.pt).

- **Input features**: a small fixed-length vector β€” task one-hot, expert-consulted bits, current step, data-quality score, whether each required expert has reported.
- **Output**: logits over the discrete action space (consult-each-expert, ask-each-expert, summarize, submit, noop).
- **Algorithm**: REINFORCE with a small entropy bonus, batched over 600 episodes, lr 0.003.
- **Training time**: ~5 minutes on CPU.

We chose this as the headline trained model because it's **honest evidence that the env is trainable end-to-end** β€” the policy improves from random to near-oracle performance under shaped reward. Before-vs-after numbers, evaluated under the *production* env (safety net ON, all 6 tasks):

| Task | Before training | After training |
|---|---|---|
| `easy_brief`    | 0.451 | 0.866 |
| `medium_brief`  | 0.467 | 0.887 |
| `hard_brief`    | 0.399 | 0.883 |
| `expert_brief`  | 0.523 | 0.883 |
| `risk_brief`    | 0.283 | 0.868 |
| `crisis_brief`  | 0.307 | 0.881 |
| **mean**        | **0.405** | **0.878** |

(Source: [`training/reward_curves/before_after.json`](training/reward_curves/before_after.json). Curve in [`training/reward_curves/reward_curve.png`](training/reward_curves/reward_curve.png).)

The same checkpoint, evaluated under the **honest regime** (safety net OFF, 5 seeds, 3 hard tasks) lands at **~0.73 terminal reward**. Both numbers are real: same model, two regimes. The 0.88-with-safety-net number tells you *the brief content is correct when the policy gets all four experts on board*; the 0.73-without-safety-net number tells you *the policy gets ~3.7 of the 4 required experts right on its own*. We report the second one as the headline because it's the one that doesn't depend on the env helping.

---

## Experiment B β€” Qwen2.5-1.5B routing pipeline (working, not converged)

We also built a full HF-TRL pipeline to train **Qwen2.5-1.5B-Instruct** as the CoS using SFT, DPO, and a thin custom GRPO loop on top of HF transformers + PEFT (LoRA).

**Pipeline:** [`training/scripts/kaggle_run_all_1p5b_experiments.py`](training/scripts/kaggle_run_all_1p5b_experiments.py)
- SFT on demonstration trajectories (`trl.SFTTrainer`)
- DPO on chosen-vs-rejected routing pairs (`trl.DPOTrainer`)
- GRPO + RLVR on top of the SFT adapter, with an oracle log-prob anchor (`--sft-anchor`) to prevent collapse
- Optional GRPO and PPO variants for ablation

**Why we don't put this in the headline.** Free-tier compute capped us at ~70 GRPO update steps. That's enough to verify the loop runs end-to-end and that the SFT + GRPO trajectories converge to the canonical routing pattern; it is **not** enough to claim a converged RL policy. The metric streams ([`training/evidence/rl_training_metrics/`](training/evidence/rl_training_metrics/)) reflect that β€” reward is flat-ish around zero with high variance, which is what a 70-step run on a 4-action problem should look like.

What the pipeline *did* deliver, evaluated under the production env:

| Method | Routes all 4 required experts? | Needed env fallback? | Terminal score (RAG off / on) |
|---|---|---|---|
| DPO-only             | 3 / 4 (HR missing)        | yes, every run | 0.882 / 0.892 |
| SFT                  | 4 / 4 | no  | 0.882 / 0.892 |
| SFT + DPO            | 4 / 4 | no  | 0.882 / 0.892 |
| SFT + GRPO + RLVR    | 4 / 4 | no  | 0.882 / 0.892 |

(Source: per-method `evidence.json` files under [`training/evidence/`](training/evidence/). Decoded at low temperature, so the rollouts are deterministic β€” they show the policy *route*, not stochastic variance. The headline benchmark uses 5 seeds for that.)

The interesting line is the **DPO-only row**: the policy converges to "consult analyst β†’ finance β†’ strategy β†’ strategy β†’ strategy β†’ strategy" and never picks HR. The env's safety net then auto-completes HR before grading, so the *terminal* score still looks fine β€” but the *trajectory* tells you the policy hasn't actually solved the routing problem. That gap between "score" and "behaviour" is exactly why we added `eval_mode=true`: under that flag the DPO row would collapse to ~0.27, the same as the naive baseline, and the SFT/GRPO row would stay near 0.88.

If a judge wants the LLM-driven version of the headline benchmark, point any LoRA adapter at [`inference.py`](inference.py) (or load it inside [`training/scripts/kaggle_three_llms_text.py`](training/scripts/kaggle_three_llms_text.py)) and re-run with `eval_mode=true`. The same script that produced our headline plot will slot the LLM into a 5th bar between `trained_mlp` and `oracle_router`.

---

## RAG, and why it's the right fit for this env

The `memory/` package contains a small corpus of company artifacts: SOPs (data quality, finance forecasting, comms style, strategy playbook), policies (compliance constraints), and history (last-quarter review, exemplar memos). The retriever is **BM25 over the markdown corpus** β€” no neural embeddings. We chose that on purpose:

1. **It externalises domain knowledge from the policy.** The CoS doesn't memorise SOPs; it just learns *when to invoke the expert that knows them*. That keeps the action space small and the RL credit assignment clean.
2. **It's deterministic.** RL rollouts need reproducibility. Hallucinated specifics ruin reward signal. Pinning experts to retrieved company text makes the brief auditable and the reward consistent across seeds.
3. **It separates *what* to say from *who* should say it.** The policy learns the latter; the corpus owns the former. New SOPs ship without retraining.
4. **It gives the policy a measurable, exploitable axis.** From the headline benchmark, RAG-on is **+0.013 terminal reward** for both `trained_mlp` and `oracle_router`, across all 3 hard tasks. The naive baseline gets +0.057 from RAG, but it's still capped at ~0.32 because the brief is incomplete β€” RAG doesn't fix bad routing.

RAG isn't a magic boost here. It's a steady, citable, ~1 point lift that the policy can learn to prefer when it's available β€” exactly what you want when the policy is small.

---

## Honest known gaps

We want the calibration to be visible:

- **GRPO runs are short.** ~70 update steps on a 4-action routing problem is enough to see the loop work end-to-end, not enough to claim a converged RL policy. The shaped-reward signal becomes meaningful at longer horizons; we only had compute for the smoke run. We do *not* put GRPO in the headline.
- **Per-method evidence rollouts are deterministic.** They demonstrate the routing pattern under each training method, not the variance you'd get from a stochastic policy. The headline benchmark uses 5 seeds and reports std bars precisely to put a real spread on the trained-vs-base comparison.
- **`eval_mode` is opt-in.** Production `/reset` defaults to `eval_mode=false` so the demo UI always shows a complete brief. The policy-comparison endpoint `/visualize/run` and the benchmark script default to `eval_mode=true` so terminal scores reflect the policy's own routing competence. Both modes are documented and reproducible.
- **`memory/` corpus is intentionally small.** RAG is a working axis, not a knowledge-base experiment. Scaling the corpus is straightforward (drop more markdown into `memory/`); it just isn't part of this submission.
- **Oracle is an upper bound, not a model.** The handcoded canonical-order policy. We label it that way in the plot, the table, and the JSON schema. It's there so judges can see how much routing headroom remains.

---

## What we'd do with more time

- **Path-quality reward.** The current grader is content-driven; a routing-efficiency component (penalising extra steps even when the brief is complete) would steepen the gradient for RL.
- **Multi-turn refinement.** Today each expert returns once. Letting the CoS re-query an expert with a refined sub-question opens up real planning behaviour; the `ask(expert, sub_question)` action exists for this and is partially wired up.
- **Adversarial tasks.** A task where one expert lies (data quality is bad but Analyst doesn't catch it) would force the CoS to learn validation routing.
- **Longer Qwen RL.** The pipeline is ready; we just need a few hours of free-tier GPU to push GRPO past the smoke-run regime.
- **Cross-org RAG.** BM25 over a single repo is the simplest case; a per-tenant retriever would let the same env sit in front of any company's SOP corpus.

---

## The stack

- **OpenEnv** β€” environment spec, validation (`openenv validate` passes on all 6 tasks), deployment.
- **Qwen2.5-1.5B-Instruct** β€” base policy model for Experiment B.
- **Hugging Face TRL** β€” `SFTTrainer` and `DPOTrainer` for Experiment B; thin custom GRPO/PPO loops on top of HF transformers + PEFT (LoRA) for the RL ablations.
- **PyTorch** β€” the MLP routing policy in Experiment A.
- **FastAPI + uvicorn** β€” serving (port 7860, matches `openenv.yaml`).
- **Hugging Face Spaces (Docker)** β€” deployment.
- **Pydantic** β€” typed action / observation / report / brief schemas.
- **BM25 (rank-bm25)** β€” RAG retriever. Deterministic, fast, no embeddings.

---

## Try it in 30 seconds

```bash
# Smoke-test the live deployment
curl https://uchihamadara1816-autodatalab2-0.hf.space/health
# {"status":"healthy"}

# Honest-evaluation episode (fallback OFF β€” the headline regime)
curl -X POST https://uchihamadara1816-autodatalab2-0.hf.space/reset \
     -H 'content-type: application/json' \
     -d '{"task":"hard_brief","use_rag":true,"eval_mode":true}'

# Or open the office UI
# https://uchihamadara1816-autodatalab2-0.hf.space/ui/
```

In the UI: pick `crisis_brief`, toggle RAG on, run **`MLP trained CoS`** then **`naive baseline`** side-by-side. The trained MLP lights up all four office boxes and submits cleanly; the naive baseline stops after the analyst and ships an incomplete brief. That's the headline capability gap, visible in 30 seconds.

To reproduce the full headline plot locally:

```bash
git clone https://github.com/Uchihakamal1816/AutoDataLab-.git autodatalab-plus
cd autodatalab-plus && pip install -e .
python3 training/scripts/run_headline_benchmark.py
# writes training/evidence/headline_benchmark.json
# writes training/evidence/plots/headline_terminal_reward.{png,svg}
```

---

## Acknowledgments

Built for the OpenEnv hackathon. Thanks to the OpenEnv team for the spec and the validation tooling, Hugging Face for the Spaces hosting and the TRL stack, and Kaggle for the free GPU minutes that ran our SFT and short GRPO passes.