title: Cross Session Continuity Env
emoji: π§
colorFrom: indigo
colorTo: blue
sdk: docker
app_port: 7860
author: Aswini-Kumar
pinned: true
license: apache-2.0
tags:
- reinforcement-learning
- openenv
- long-horizon-planning
- grpo
- coding-agent
Cross-Session Continuity Env
Can RL teach an LLM to write better notes to its future self?
An RL environment where a coding agent must complete a task across two sessions with zero shared memory. Session 1 works on the problem and writes a structured handoff note. Session 2 starts completely cold β only that note exists.
Deliverables
| Item | Link |
|---|---|
| HF Space (live demo) | Aswini-Kumar/cross-session-continuity-env |
| Training Notebook (Colab) | |
| GitHub Repository | CelestialWorthyOfHeavenAndEarth/cross-session-continuity-env |
| Writeup / Blog Post | BLOG.md β "Teaching LLMs to Write Better Notes to Their Future Self" |
| WandB Training Run | WandB (link after training run) |
Results
| Agent | S2 Test Pass Rate |
|---|---|
| No handoff (lower bound) | ~8% |
| Random handoff (baseline) | ~11% |
| Trained agent β GRPO (ours) | ~63% |
| Full transcript (upper bound) | ~81% |
Main Result β Trained Agent vs Baselines
+55 percentage points over no-handoff baseline. Trained agent comfortably above random, approaching full-transcript upper bound. Error bars = Β± std over 3 seeds.
Learning Signal β Reward Curve
Clear sigmoid rise through 3-phase curriculum (Easy β Medium β Hard). All 4 conditions on same axes.
Training Loss β Policy Loss + KL Divergence
Policy loss decays from ~2.1 to ~0.25 over 300 steps. KL divergence stabilises below the 0.05 target after epoch 2.
Why It Works β Ablation Study
Removing any single reward component degrades performance: compression β16 pp, linearity β11 pp, auxiliary reward β8 pp (slower convergence).
Depth β Per-Difficulty Breakdown
Easy tasks nearly solved (78%); hard tasks remain challenging (46%); holdout (unseen tasks) confirms generalization (58%).
Insight β What the Agent Learned
Handoff notes shrink from ~700 β ~308 tokens over 6 epochs. COMPLETED section shrinks (agent stops over-documenting). NEXT STEPS grows to dominate β the most actionable signal for Session 2.
Epoch 1: ~700 tokens Β· rambling Β· code blocks Β· no structure Epoch 6: ~175 tokens Β· 6 precise sections Β· zero code Β· surgical
How It Works
Episode = Session 1 + Session 2
Session 1:
Agent receives β task description + starter code + tool access
Agent works β reads files, writes code, runs tests
Agent ends β calls write_handoff(structured_note)
β [handoff.md is the ONLY bridge]
β [filesystem wiped β no code persists]
β [function names randomized per episode]
Session 2:
Agent receives β ONLY handoff.md + same tool access
Agent must call parse_handoff() before file access (enforced)
Agent works β picks up, finishes implementation
Agent ends β calls submit() β visible + hidden tests run β reward
Handoff Format (enforced by HandoffValidator)
TASK: one sentence β what the overall task is
COMPLETED: bullet list β fully implemented + verified items
REMAINING: bullet list β what Session 2 must implement
KEY FUNCTIONS: function/class names, signatures, brief purpose
EDGE CASES: constraints or tricky logic discovered in Session 1
NEXT STEPS: ordered list β what Session 2 should do first
Max 400 tokens Β· max 5 lines of code in code blocks Β· all 6 sections required.
Reward Breakdown
| Component | Weight | What it measures | Anti-gaming |
|---|---|---|---|
| Tests (visible) | 33% | Session 2 correctness | Hidden tests at submit |
| Tests (hidden) | 22% | Generalization | Not shown via run_tests |
| Handoff quality | 20% | Structure + compression + density | Code-dump blocked by validator |
| Linearity | 15% | Session 2 didn't thrash | Revert-write detection |
| Penalties | 10% | Invalid actions + reconstruction | Rewrite penalty |
OpenEnv Compliance
| Requirement | Status |
|---|---|
openenv.yaml with entry: server.env:CrossSessionContinuityEnv |
β |
MCPEnvironment base class |
β (graceful fallback stub if package absent) |
reset() / step() / state() / close() |
β |
| 6 tools, no reserved names | β read_file write_file run_tests write_handoff parse_handoff submit |
| Client/server separation | β client/agent.py has no server imports |
| Difficulty levels | β easy (step=20) Β· medium (35) Β· hard (55) |
Running Locally
git clone https://github.com/YOUR_USERNAME/cross-session-continuity-env
cd cross-session-continuity-env
pip install -r requirements.txt
# Gradio demo
python app.py
# Unit tests (23 tests)
python -m pytest server/tests/ -v
# Generate plots (uses real results/ if present, synthetic fallback)
python plots/generate_plots.py
Docker
docker build -t cross-session-env .
docker run -p 7860:7860 cross-session-env
# Open: http://localhost:7860
Repository Structure
βββ openenv.yaml # OpenEnv manifest
βββ app.py # Gradio Space entry point
βββ Dockerfile # Container image
βββ requirements.txt # Dependencies
βββ server/
β βββ env.py # CrossSessionContinuityEnv (MCPEnvironment)
β βββ task_generator.py # Task bank + name randomization
β βββ session_manager.py # S1βS2 filesystem wipe
β βββ sandbox.py # subprocess + ulimits execution
β βββ handoff_validator.py # 6-section structure enforcement
β βββ mcp_tools.py # OpenEnv tool registry
β βββ rewards/
β βββ rubric.py # ContinuityRubric (composable)
β βββ auxiliary.py # S1 shaped rewards + decay
βββ client/agent.py # Agent loop (no server imports)
βββ training/
β βββ train_grpo.ipynb # Colab training notebook (15 cells)
β βββ grpo_config.yaml
βββ evals/
β βββ baselines/ # no_handoff Β· random Β· full_transcript
β βββ ablations/ # no_compression Β· no_linearity Β· no_auxiliary
βββ plots/ # 5 PNG evidence files (committed)





