Aswini-Kumar's picture
feat: README.md - OpenEnv 5-step structure
13ea5f9 verified
metadata
title: Cross Session Continuity Env
emoji: 🧠
colorFrom: indigo
colorTo: blue
sdk: docker
app_port: 7860
author: Aswini-Kumar
pinned: true
license: apache-2.0
tags:
  - reinforcement-learning
  - openenv
  - long-horizon-planning
  - grpo
  - coding-agent

Cross-Session Continuity Env

Can RL teach an LLM to write better notes to its future self?

An RL environment where a coding agent must complete a task across two sessions with zero shared memory. Session 1 works on the problem and writes a structured handoff note. Session 2 starts completely cold β€” only that note exists.


Deliverables

Item Link
HF Space (live demo) Aswini-Kumar/cross-session-continuity-env
Training Notebook (Colab) Open In Colab
GitHub Repository CelestialWorthyOfHeavenAndEarth/cross-session-continuity-env
Writeup / Blog Post BLOG.md β€” "Teaching LLMs to Write Better Notes to Their Future Self"
WandB Training Run WandB (link after training run)

Results

Agent S2 Test Pass Rate
No handoff (lower bound) ~8%
Random handoff (baseline) ~11%
Trained agent β€” GRPO (ours) ~63%
Full transcript (upper bound) ~81%

Main Result β€” Trained Agent vs Baselines

Baseline vs Trained Agent

+55 percentage points over no-handoff baseline. Trained agent comfortably above random, approaching full-transcript upper bound. Error bars = Β± std over 3 seeds.

Learning Signal β€” Reward Curve

Reward Curve

Clear sigmoid rise through 3-phase curriculum (Easy β†’ Medium β†’ Hard). All 4 conditions on same axes.

Training Loss β€” Policy Loss + KL Divergence

Loss Curve

Policy loss decays from ~2.1 to ~0.25 over 300 steps. KL divergence stabilises below the 0.05 target after epoch 2.

Why It Works β€” Ablation Study

Ablation Comparison

Removing any single reward component degrades performance: compression βˆ’16 pp, linearity βˆ’11 pp, auxiliary reward βˆ’8 pp (slower convergence).

Depth β€” Per-Difficulty Breakdown

Difficulty Breakdown

Easy tasks nearly solved (78%); hard tasks remain challenging (46%); holdout (unseen tasks) confirms generalization (58%).

Insight β€” What the Agent Learned

Handoff Evolution over Epochs

Handoff notes shrink from ~700 β†’ ~308 tokens over 6 epochs. COMPLETED section shrinks (agent stops over-documenting). NEXT STEPS grows to dominate β€” the most actionable signal for Session 2.

Epoch 1: ~700 tokens Β· rambling Β· code blocks Β· no structure Epoch 6: ~175 tokens Β· 6 precise sections Β· zero code Β· surgical


How It Works

Episode = Session 1 + Session 2

Session 1:
  Agent receives β†’ task description + starter code + tool access
  Agent works   β†’ reads files, writes code, runs tests
  Agent ends    β†’ calls write_handoff(structured_note)
                        ↓ [handoff.md is the ONLY bridge]
                        ↓ [filesystem wiped β€” no code persists]
                        ↓ [function names randomized per episode]
Session 2:
  Agent receives β†’ ONLY handoff.md + same tool access
  Agent must call parse_handoff() before file access (enforced)
  Agent works   β†’ picks up, finishes implementation
  Agent ends    β†’ calls submit() β†’ visible + hidden tests run β†’ reward

Handoff Format (enforced by HandoffValidator)

TASK:          one sentence β€” what the overall task is
COMPLETED:     bullet list β€” fully implemented + verified items
REMAINING:     bullet list β€” what Session 2 must implement
KEY FUNCTIONS: function/class names, signatures, brief purpose
EDGE CASES:    constraints or tricky logic discovered in Session 1
NEXT STEPS:    ordered list β€” what Session 2 should do first

Max 400 tokens Β· max 5 lines of code in code blocks Β· all 6 sections required.


Reward Breakdown

Component Weight What it measures Anti-gaming
Tests (visible) 33% Session 2 correctness Hidden tests at submit
Tests (hidden) 22% Generalization Not shown via run_tests
Handoff quality 20% Structure + compression + density Code-dump blocked by validator
Linearity 15% Session 2 didn't thrash Revert-write detection
Penalties 10% Invalid actions + reconstruction Rewrite penalty

OpenEnv Compliance

Requirement Status
openenv.yaml with entry: server.env:CrossSessionContinuityEnv βœ“
MCPEnvironment base class βœ“ (graceful fallback stub if package absent)
reset() / step() / state() / close() βœ“
6 tools, no reserved names βœ“ read_file write_file run_tests write_handoff parse_handoff submit
Client/server separation βœ“ client/agent.py has no server imports
Difficulty levels βœ“ easy (step=20) Β· medium (35) Β· hard (55)

Running Locally

git clone https://github.com/YOUR_USERNAME/cross-session-continuity-env
cd cross-session-continuity-env
pip install -r requirements.txt

# Gradio demo
python app.py

# Unit tests (23 tests)
python -m pytest server/tests/ -v

# Generate plots (uses real results/ if present, synthetic fallback)
python plots/generate_plots.py

Docker

docker build -t cross-session-env .
docker run -p 7860:7860 cross-session-env
# Open: http://localhost:7860

Repository Structure

β”œβ”€β”€ openenv.yaml                  # OpenEnv manifest
β”œβ”€β”€ app.py                        # Gradio Space entry point
β”œβ”€β”€ Dockerfile                    # Container image
β”œβ”€β”€ requirements.txt              # Dependencies
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ env.py                    # CrossSessionContinuityEnv (MCPEnvironment)
β”‚   β”œβ”€β”€ task_generator.py         # Task bank + name randomization
β”‚   β”œβ”€β”€ session_manager.py        # S1β†’S2 filesystem wipe
β”‚   β”œβ”€β”€ sandbox.py                # subprocess + ulimits execution
β”‚   β”œβ”€β”€ handoff_validator.py      # 6-section structure enforcement
β”‚   β”œβ”€β”€ mcp_tools.py              # OpenEnv tool registry
β”‚   └── rewards/
β”‚       β”œβ”€β”€ rubric.py             # ContinuityRubric (composable)
β”‚       └── auxiliary.py          # S1 shaped rewards + decay
β”œβ”€β”€ client/agent.py               # Agent loop (no server imports)
β”œβ”€β”€ training/
β”‚   β”œβ”€β”€ train_grpo.ipynb          # Colab training notebook (15 cells)
β”‚   └── grpo_config.yaml
β”œβ”€β”€ evals/
β”‚   β”œβ”€β”€ baselines/                # no_handoff Β· random Β· full_transcript
β”‚   └── ablations/                # no_compression Β· no_linearity Β· no_auxiliary
└── plots/                        # 5 PNG evidence files (committed)