Spaces:
Sleeping
AutoDataLab++: Teaching a Chief of Staff to Know When to Stop
Team: Peaky Blinders β Sai Kamal Nannuri (Uchihakamal1816) & Ajeet Kumar (ajeetkartikay) Live demo: uchihamadara1816-autodatalab2-0.hf.space/ui/ Code: github.com/Uchihakamal1816/AutoDataLab- (the trailing dash is the actual repo slug)
TL;DR
We built an OpenEnv environment where a Chief of Staff (CoS) policy must route work across four specialists β Data Analyst, Finance, Strategy, HR β and submit a complete CEO brief inside a step budget. Skipping an expert leaves the brief incomplete and the grader penalises it; over-consulting wastes the step budget and the shaped reward penalises that too. Either error hurts. The policy has to learn the capability gap between knowing facts and knowing when it has enough.
We trained a 2-layer MLP routing policy with REINFORCE (600 episodes, lr 0.003) and benchmarked it against an untrained baseline and a handcoded oracle β fallback off, 5 seeds, 3 hard tasks.
| Task | Base (naive) | Trained MLP CoS (REINFORCE, 600 ep) | Oracle router (upper bound) |
|---|---|---|---|
hard_brief |
0.27 / 0.32 | 0.73 / 0.74 | 0.88 / 0.90 |
expert_brief |
0.27 / 0.32 | 0.73 / 0.74 | 0.88 / 0.89 |
crisis_brief |
0.27 / 0.32 | 0.73 / 0.74 | 0.88 / 0.89 |
Cells show RAG-off / RAG-on. Numbers are terminal grader scores in [0, 1], mean over 5 seeds, with the env's expert-auto-fill safety net OFF (eval_mode=true). Raw runs in training/evidence/headline_benchmark.json; reproduce with python3 training/scripts/run_headline_benchmark.py.
The trained MLP closes ~75 % of the baseβoracle gap (+0.46 absolute reward over the naive baseline). The remaining ~0.15 of headroom is what a future GRPO/SFT LLM run can chase. We also wired up a Qwen2.5-1.5B SFT/DPO/GRPO pipeline as Experiment B; we treat it as a working pipeline, not a converged result, and we're explicit about that throughout.
Why this is interesting
Most LLM benchmarks test single-turn reasoning. Real enterprise work isn't like that. A CEO doesn't ask one question; they ask "look at the data, project the quarter, recommend an action, and draft the memo." That's a routing problem: which expert, in what order, with what info, and β critically β when to stop.
Next-token training rewards plausible-sounding text. It does not reward stopping. That's why a prompted-only LLM tends to either spam "consult strategy" or invent a plausible answer instead of asking HR. Our hypothesis is that a small policy trained with RL on the routing decision can outperform a much larger one that's just prompted to orchestrate, because the small one is being shaped exactly on the bit that matters.
So we made an environment where that bit is the reward β and where ignoring HR doesn't get hidden behind a polished sentence.
The environment
Six tasks, three of them on the critical path:
| Task | Difficulty | Required experts |
|---|---|---|
easy_brief |
easy | analyst, finance, hr |
medium_brief |
medium | analyst, finance, strategy, hr |
hard_brief |
hard | analyst, finance, strategy, hr |
expert_brief |
hard | analyst, finance, strategy, hr |
risk_brief |
hard | analyst, finance, strategy, hr |
crisis_brief |
hard | analyst, finance, strategy, hr |
The CoS has a small discrete action space: consult(expert), ask(expert, sub_question) (a focused consult), summarize, submit, noop. Episodes max out at 12β14 steps. The reward is shaped:
- +0.10 for a productive first consult of a required expert
- +0.02 for any other useful action (e.g. summarising once)
- β0.05 for repeating a consult, β0.10 for a third+ consult of the same expert (
shaping="strict") - β0.05 for summarising before required experts have reported
- β0.02 per step (efficiency pressure)
- + terminal grader score in
(0.001, 0.999)on submit
The four specialists are real subenvs:
- Data Analyst β wraps
subenvs/autodatalab/analytics.py: cleaning, KPIs, revenue derivation, data-quality scoring. - Finance β deterministic projection, variance vs plan, break-even sensitivity.
- Strategy β 3-bullet operating plan with citations to upstream numbers; emits structured
Present / Futurebuy-sell-hold-trim guidance on the NVDA / AAPL / JPM watchlist. - HR / Comms β wraps
subenvs/email/: drafts memos and scores them on a blended structure + tone + audience rubric.
Each specialist returns a typed ExpertReport. The CoS composes them into a final Brief. Schemas live in ceo_brief_env/models.py.
The honest-evaluation flag
The env ships with a production mode that auto-completes any missing required expert before grading β so end-users always see a full brief in the UI. That's a nice product, but it lies to a researcher: a policy that never thought to consult HR still gets HR's report stapled in.
So /reset and /visualize/run both accept eval_mode=true. When set, the env runs with auto_fill_required=False and shaping="strict". Submitting an incomplete brief is allowed, the grader sees the gap, and the terminal score reflects the policy's own routing competence. All headline numbers in this post are reported in this regime. It's also our headline contribution as an environment: we believe this is the first OpenEnv env whose reward explicitly punishes the LLM for not knowing when to stop.
What we trained β Experiment A (the headline number)
The trained model in the headline is a 2-layer MLP routing policy with REINFORCE, trained on featurised observations of the env. Code: training/scripts/train_cos_local.py. Checkpoint: training/checkpoints/cos_final.pt.
- Input features: a small fixed-length vector β task one-hot, expert-consulted bits, current step, data-quality score, whether each required expert has reported.
- Output: logits over the discrete action space (consult-each-expert, ask-each-expert, summarize, submit, noop).
- Algorithm: REINFORCE with a small entropy bonus, batched over 600 episodes, lr 0.003.
- Training time: ~5 minutes on CPU.
We chose this as the headline trained model because it's honest evidence that the env is trainable end-to-end β the policy improves from random to near-oracle performance under shaped reward. Before-vs-after numbers, evaluated under the production env (safety net ON, all 6 tasks):
| Task | Before training | After training |
|---|---|---|
easy_brief |
0.451 | 0.866 |
medium_brief |
0.467 | 0.887 |
hard_brief |
0.399 | 0.883 |
expert_brief |
0.523 | 0.883 |
risk_brief |
0.283 | 0.868 |
crisis_brief |
0.307 | 0.881 |
| mean | 0.405 | 0.878 |
(Source: training/reward_curves/before_after.json. Curve in training/reward_curves/reward_curve.png.)
The same checkpoint, evaluated under the honest regime (safety net OFF, 5 seeds, 3 hard tasks) lands at ~0.73 terminal reward. Both numbers are real: same model, two regimes. The 0.88-with-safety-net number tells you the brief content is correct when the policy gets all four experts on board; the 0.73-without-safety-net number tells you the policy gets ~3.7 of the 4 required experts right on its own. We report the second one as the headline because it's the one that doesn't depend on the env helping.
Experiment B β Qwen2.5-1.5B routing pipeline (working, not converged)
We also built a full HF-TRL pipeline to train Qwen2.5-1.5B-Instruct as the CoS using SFT, DPO, and a thin custom GRPO loop on top of HF transformers + PEFT (LoRA).
Pipeline: training/scripts/kaggle_run_all_1p5b_experiments.py
- SFT on demonstration trajectories (
trl.SFTTrainer) - DPO on chosen-vs-rejected routing pairs (
trl.DPOTrainer) - GRPO + RLVR on top of the SFT adapter, with an oracle log-prob anchor (
--sft-anchor) to prevent collapse - Optional GRPO and PPO variants for ablation
Why we don't put this in the headline. Free-tier compute capped us at ~70 GRPO update steps. That's enough to verify the loop runs end-to-end and that the SFT + GRPO trajectories converge to the canonical routing pattern; it is not enough to claim a converged RL policy. The metric streams (training/evidence/rl_training_metrics/) reflect that β reward is flat-ish around zero with high variance, which is what a 70-step run on a 4-action problem should look like.
What the pipeline did deliver, evaluated under the production env:
| Method | Routes all 4 required experts? | Needed env fallback? | Terminal score (RAG off / on) |
|---|---|---|---|
| DPO-only | 3 / 4 (HR missing) | yes, every run | 0.882 / 0.892 |
| SFT | 4 / 4 | no | 0.882 / 0.892 |
| SFT + DPO | 4 / 4 | no | 0.882 / 0.892 |
| SFT + GRPO + RLVR | 4 / 4 | no | 0.882 / 0.892 |
(Source: per-method evidence.json files under training/evidence/. Decoded at low temperature, so the rollouts are deterministic β they show the policy route, not stochastic variance. The headline benchmark uses 5 seeds for that.)
The interesting line is the DPO-only row: the policy converges to "consult analyst β finance β strategy β strategy β strategy β strategy" and never picks HR. The env's safety net then auto-completes HR before grading, so the terminal score still looks fine β but the trajectory tells you the policy hasn't actually solved the routing problem. That gap between "score" and "behaviour" is exactly why we added eval_mode=true: under that flag the DPO row would collapse to ~0.27, the same as the naive baseline, and the SFT/GRPO row would stay near 0.88.
If a judge wants the LLM-driven version of the headline benchmark, point any LoRA adapter at inference.py (or load it inside training/scripts/kaggle_three_llms_text.py) and re-run with eval_mode=true. The same script that produced our headline plot will slot the LLM into a 5th bar between trained_mlp and oracle_router.
RAG, and why it's the right fit for this env
The memory/ package contains a small corpus of company artifacts: SOPs (data quality, finance forecasting, comms style, strategy playbook), policies (compliance constraints), and history (last-quarter review, exemplar memos). The retriever is BM25 over the markdown corpus β no neural embeddings. We chose that on purpose:
- It externalises domain knowledge from the policy. The CoS doesn't memorise SOPs; it just learns when to invoke the expert that knows them. That keeps the action space small and the RL credit assignment clean.
- It's deterministic. RL rollouts need reproducibility. Hallucinated specifics ruin reward signal. Pinning experts to retrieved company text makes the brief auditable and the reward consistent across seeds.
- It separates what to say from who should say it. The policy learns the latter; the corpus owns the former. New SOPs ship without retraining.
- It gives the policy a measurable, exploitable axis. From the headline benchmark, RAG-on is +0.013 terminal reward for both
trained_mlpandoracle_router, across all 3 hard tasks. The naive baseline gets +0.057 from RAG, but it's still capped at ~0.32 because the brief is incomplete β RAG doesn't fix bad routing.
RAG isn't a magic boost here. It's a steady, citable, ~1 point lift that the policy can learn to prefer when it's available β exactly what you want when the policy is small.
Honest known gaps
We want the calibration to be visible:
- GRPO runs are short. ~70 update steps on a 4-action routing problem is enough to see the loop work end-to-end, not enough to claim a converged RL policy. The shaped-reward signal becomes meaningful at longer horizons; we only had compute for the smoke run. We do not put GRPO in the headline.
- Per-method evidence rollouts are deterministic. They demonstrate the routing pattern under each training method, not the variance you'd get from a stochastic policy. The headline benchmark uses 5 seeds and reports std bars precisely to put a real spread on the trained-vs-base comparison.
eval_modeis opt-in. Production/resetdefaults toeval_mode=falseso the demo UI always shows a complete brief. The policy-comparison endpoint/visualize/runand the benchmark script default toeval_mode=trueso terminal scores reflect the policy's own routing competence. Both modes are documented and reproducible.memory/corpus is intentionally small. RAG is a working axis, not a knowledge-base experiment. Scaling the corpus is straightforward (drop more markdown intomemory/); it just isn't part of this submission.- Oracle is an upper bound, not a model. The handcoded canonical-order policy. We label it that way in the plot, the table, and the JSON schema. It's there so judges can see how much routing headroom remains.
What we'd do with more time
- Path-quality reward. The current grader is content-driven; a routing-efficiency component (penalising extra steps even when the brief is complete) would steepen the gradient for RL.
- Multi-turn refinement. Today each expert returns once. Letting the CoS re-query an expert with a refined sub-question opens up real planning behaviour; the
ask(expert, sub_question)action exists for this and is partially wired up. - Adversarial tasks. A task where one expert lies (data quality is bad but Analyst doesn't catch it) would force the CoS to learn validation routing.
- Longer Qwen RL. The pipeline is ready; we just need a few hours of free-tier GPU to push GRPO past the smoke-run regime.
- Cross-org RAG. BM25 over a single repo is the simplest case; a per-tenant retriever would let the same env sit in front of any company's SOP corpus.
The stack
- OpenEnv β environment spec, validation (
openenv validatepasses on all 6 tasks), deployment. - Qwen2.5-1.5B-Instruct β base policy model for Experiment B.
- Hugging Face TRL β
SFTTrainerandDPOTrainerfor Experiment B; thin custom GRPO/PPO loops on top of HF transformers + PEFT (LoRA) for the RL ablations. - PyTorch β the MLP routing policy in Experiment A.
- FastAPI + uvicorn β serving (port 7860, matches
openenv.yaml). - Hugging Face Spaces (Docker) β deployment.
- Pydantic β typed action / observation / report / brief schemas.
- BM25 (rank-bm25) β RAG retriever. Deterministic, fast, no embeddings.
Try it in 30 seconds
# Smoke-test the live deployment
curl https://uchihamadara1816-autodatalab2-0.hf.space/health
# {"status":"healthy"}
# Honest-evaluation episode (fallback OFF β the headline regime)
curl -X POST https://uchihamadara1816-autodatalab2-0.hf.space/reset \
-H 'content-type: application/json' \
-d '{"task":"hard_brief","use_rag":true,"eval_mode":true}'
# Or open the office UI
# https://uchihamadara1816-autodatalab2-0.hf.space/ui/
In the UI: pick crisis_brief, toggle RAG on, run MLP trained CoS then naive baseline side-by-side. The trained MLP lights up all four office boxes and submits cleanly; the naive baseline stops after the analyst and ships an incomplete brief. That's the headline capability gap, visible in 30 seconds.
To reproduce the full headline plot locally:
git clone https://github.com/Uchihakamal1816/AutoDataLab-.git autodatalab-plus
cd autodatalab-plus && pip install -e .
python3 training/scripts/run_headline_benchmark.py
# writes training/evidence/headline_benchmark.json
# writes training/evidence/plots/headline_terminal_reward.{png,svg}
Acknowledgments
Built for the OpenEnv hackathon. Thanks to the OpenEnv team for the spec and the validation tooling, Hugging Face for the Spaces hosting and the TRL stack, and Kaggle for the free GPU minutes that ran our SFT and short GRPO passes.
