Building a cross-format office-document RL environment

Community Article Published April 26, 2026

What this is
The design that still holds up
The task split
The 6-component reward
Anti-hacking - 4 independent defenses
SFT training (cheap, converged, unhelpful)
Then we evaluated
Pointers
What this is

bpHigh/financial-task-env is an OpenEnv environment for training agents to edit office documents with code. It started as a the online submission with 10 hand-picked Finch spreadsheet tasks. For Round 2/ Offline hack phase it expanded to 119 tasks across three file formats , Excel (xlsx), Word (docx), PowerPoint (pptx) and with a multi-layer reward function designed to be hard to hack and a four-defense anti-exploit stack hardened against agents that try to game it.

The environment is live as a Hugging Face Space at https://bphigh-financial-task-env.hf.space. The TRL/OpenEnv-compatible client lives in client.py; the dashboard shows a leaderboard, training plots, anti-hacking design, and step-by-step replays of the teacher solving tasks. Everything in the repo is reproducible.

The design that still holds up

The task split

Family	Source	Train	Eval	Total	What it tests
`xlsx`	Finch (FinWorkBench) - hand-curated (Round 1)	10	0	10	Diverse Financial workbook tasks for the original submission
`xlsx`	Finch - stratified pull (Round 2)	40	10	50	Stratified across 7 task-type tags
`docx`	OSWorld-Verified `libreoffice_writer`	17	4	21	16 distinct evaluator functions ported from OSWorld
`pptx`	PPTArena	30	8	38	16 distinct edit_types, including singletons (transitions, animations, A/V)
Total		97	22	119

The eval split is stratified so that at least one task per tag bucket per family so the environment doesn't get gamed by overfitting to one task type.

The 6-component reward

Each step the agent runs code and gets back stdout/stderr plus an inline reward decomposition. Six components, all clamped, totalling 0.10/step max:

Component	Range	Fires when
`exec_health`	0-0.020	Subprocess exit code 0; bonus if stdout non-empty
`lib_engagement`	0-0.015	Code uses the family's expected library (openpyxl / python-docx / python-pptx)
`mutation`	0-0.025	Working file's SHA-256 changed since last step
`validity`	0-0.015	Mutated file still parses with its loader
`progress`	0-0.015	Structural distance to the gold reference decreased
`eval_check`	0-0.010	Per-task evaluator score went up (docx-only)

The final reward at submit is not the sum of step rewards, it's the grader's score on the submitted file (or text answer). The step rewards exist to give shaped signal for mid-episode learning. The clamps matter: an agent can't farm tiny rewards for 1000 steps and beat someone who actually solves the task.

Anti-hacking - 4 independent defenses

This part I'm proud of. After watching teacher trajectories, we found the model had figured out it could "submit submit_file with the source file path on step 1" and getting some reward because the source parses, has the right size, etc. and without making any edits. So we built four overlapping defenses, any of which independently kills that exploit:

Phase 4 - Per-episode gold stash: at task reset, gold files are moved to a uuid4-randomized tmp directory. An agent globbing data/ finds nothing.
Phase 7 - Grader byte-equality check: if the submitted file is byte-equal to the source, the grader returns the floor reward.
Phase 9 - Env early-submit gate: the env refuses submit / submit_file until at least one code step has run. Configured via FINANCIAL_ENV_MIN_CODE_STEPS=1 (default).
Phase 8 - SFT corpus filter: any teacher trajectory that did submit_file on step 1 was dropped before SFT, so the student never learns the cheat from the teacher.

Each layer covers a slightly different attack surface.

SFT training (cheap, converged, unhelpful)

Two SFT runs on 1× L40S 48GB at $1.80/hr. Same recipe both times: LoRA r=32 on all-linear targets, bf16, assistant-only loss masking, TRL SFTTrainer. The only difference: 4K vs 8K context length.

	4K context	8K context
Hardware	L40S 48GB	L40S 48GB
Runtime	198s	354s
Loss start → end	0.412 → 0.069	0.384 → 0.103
Final train_loss	0.196	0.193
Cost	~$0.50	~$0.80

Loss converges nicely. The 8K run sees the long debugging trajectories from Kimi (5-8 of 53 episodes get truncated at 4K), so it's the more interesting comparison point.

Then we evaluated

Model	Avg score	Success rate	xlsx (n=10)	docx (n=4)	pptx (n=8)
MiniMaxAI/MiniMax-M2.1 (frontier baseline)	0.390	41%	0.293	0.445	0.485
moonshotai/Kimi-K2.5 (teacher)	0.481	52%	0.370	0.472	0.673
Qwen/Qwen2.5-Coder-3B-Instruct (vanilla student)	0.001	0%	0.001	0.001	0.001
Qwen2.5-Coder-3B + LoRA SFT (4K)	0.006	0%	0.007	0.005	0.005
Qwen2.5-Coder-3B + LoRA SFT (8K)	0.011	0%	0.018	0.005	0.005

The SFT lift is real - 6×-11× over vanilla - but every episode still bottoms out at the env's 0.005 reward floor. The model produces parseable code, but it doesn't mutate files in ways the grader rewards. SFT loss is well-converged on the training distribution; the gap is generalization, not under-training. We trained on 53 trajectories; the eval has 22 tasks all OOD; that's a brutal regime for a 3B model with LoRA.

Pointers

Code: https://github.com/bp-high/openenv_financial_task_env
Env Space (live): https://huggingface.co/spaces/bpHigh/financial-task-env
SFT-4K adapter: https://huggingface.co/bpHigh/qwen3b-office-sft-kimi
SFT-8K adapter: https://huggingface.co/bpHigh/qwen3b-office-sft-kimi-long
Trackio: https://huggingface.co/spaces/bpHigh/trackio-office-grpo
Edits log (build journal, all 13 phases): edits.md
Re-deploy checklist: edits.md - bottom section

If you're a judge reading this and want to see the env do something in 30 seconds, open the dashboard's "Replay" widget and watch the docx task, it's the shortest end-to-end demonstration of the 6-component reward in motion.

The headers in edits.md - Phases 1 through 13 are the long version of this blog.

Teaching LLMs to Play Carrom: A Physics-Based RL Environment for Frontier Agents

April 22, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote