# Cross-Session Continuity Env — Implementation Plan (v2) > **Changelog from v1:** Addressed 20 potential failure modes identified in review. > Each section marked [UPDATED], [NEW], or [UNCHANGED] for traceability. --- ## 1. Problem Statement [UNCHANGED] **Capability Gap:** LLMs have no persistent memory across sessions. When a session ends, everything is gone. In real-world usage this is a critical failure mode — long tasks (codebases, research, planning) rarely fit in a single context window. **What we train:** Can RL teach an LLM to write surgical, information-dense handoff notes to its future self, such that a cold-start agent in session 2 can complete the task successfully using only those notes? **Why it's novel:** No existing RL environment specifically trains or benchmarks cross-session state transfer behavior. This is underexplored and publishable. **Theme:** Primarily Theme 2 (Long-Horizon Planning). Secondary fit with Theme 3.1 — agent uses real tools (file I/O, test runner) in a dynamic coding environment. --- ## 2. High-Level Architecture [UPDATED] ``` Episode = Session 1 + Session 2 (ONE training episode, ONE reward signal) Session 1: Agent receives → task description + starter code + tool access Agent works → reads files, writes code, runs tests [Auxiliary rewards fire here — see Section 8] Agent ends → calls write_handoff(structured_note) → session 1 terminates ↓ [handoff.md is the ONLY bridge] ↓ [filesystem wiped — no code persists] ↓ [function/variable names randomized per episode] Session 2: Agent receives → ONLY handoff.md + same tool access Agent must call parse_handoff() before file access (enforced) Agent works → picks up, finishes implementation Agent ends → calls submit() → visible + hidden tests run → reward computed Reward flows back through both sessions via GRPO (with normalization) PPO run in parallel as stability baseline ``` --- ## 3. Repository Structure [UPDATED] ``` cross-session-continuity-env/ │ ├── openenv.yaml ├── README.md ├── requirements.txt # pinned: openenv==x.y.z │ ├── server/ │ ├── env.py # MCPEnvironment subclass │ ├── task_generator.py # task + test generation with name randomization │ ├── session_manager.py # session 1 → 2 transition, filesystem wipe │ ├── sandbox.py # safe execution, strict ulimits │ ├── handoff_validator.py # NEW: validates handoff structure │ └── rewards/ │ ├── rubric.py # composable rubrics (UPDATED) │ └── auxiliary.py # NEW: session 1 auxiliary rewards │ ├── client/ │ └── agent.py # agent loop — no server imports, with retry logic │ ├── tasks/ │ ├── easy/ # single file, 3 visible + 1 hidden test │ ├── medium/ # 2-3 files, 5 visible + 2 hidden tests │ ├── hard/ # 5 files, 8 visible + 3 hidden tests │ └── eval_holdout/ # NEW: unseen tasks for evaluation only │ ├── training/ │ ├── train_grpo.ipynb # primary training (GRPO) │ ├── train_ppo.ipynb # NEW: PPO baseline for stability comparison │ └── grpo_config.yaml │ ├── evals/ │ ├── baselines/ │ │ ├── no_handoff.py # NEW: session 2 with no note at all │ │ ├── random_handoff.py # NEW: random text as handoff │ │ └── full_transcript.py # NEW: upper bound — full S1 transcript │ ├── ablations/ │ │ ├── no_compression_reward.py # NEW: ablation │ │ ├── no_linearity_reward.py # NEW: ablation │ │ └── no_auxiliary_reward.py # NEW: ablation │ └── trained_run.py │ ├── plots/ # all committed as PNG with captions │ ├── reward_curve.png │ ├── handoff_length_curve.png │ ├── baseline_vs_trained.png # all 4 baselines on same axes │ ├── ablation_comparison.png # NEW │ ├── difficulty_breakdown.png # NEW: easy/medium/hard separately │ └── handoff_diff_over_epochs.png # NEW: interpretability │ └── demos/ └── recorded_run_seed42.url # URL only — no large files in repo ``` --- ## 4. OpenEnv Compliance [UNCHANGED] ### 4.1 openenv.yaml ```yaml name: cross-session-continuity-env version: 0.1.0 theme: long-horizon-planning description: > An RL environment where an LLM agent must complete a coding task across two sessions with zero shared memory. The agent writes a structured handoff note at the end of session 1; session 2 receives only that note. Reward depends entirely on session 2 success. entry: server/env.py tools: - read_file - write_file - run_tests - write_handoff - parse_handoff - submit sessions: 2 difficulty_levels: - easy - medium - hard ``` ### 4.2 Reserved Tool Names — Avoided `reset`, `step`, `state`, `close` are OpenEnv reserved — none used. Our tools: `read_file`, `write_file`, `run_tests`, `write_handoff`, `parse_handoff`, `submit` — all clear. ### 4.3 Client/Server Separation - `client/agent.py` talks to env via MCP protocol only - Client never imports from `server/` - All state lives server-side ### 4.4 Gym-style API ```python env.reset() # starts episode, returns session 1 observation env.step() # action → (obs, reward, done, info) env.state() # current env state dict ``` --- ## 5. Environment Implementation [UPDATED] Key changes from v1: - Dynamic step limits by difficulty - Auxiliary reward hooks in session 1 - Handoff structure validation before session 2 starts - Invalid action handling with retry budget - Agent must call `parse_handoff()` before file access in session 2 - Filesystem wiped on session transition ```python # server/env.py from openenv import MCPEnvironment from .task_generator import TaskGenerator from .session_manager import SessionManager from .sandbox import Sandbox from .rewards.rubric import ContinuityRubric from .rewards.auxiliary import AuxiliaryRewarder from .handoff_validator import HandoffValidator STEP_LIMITS = {"easy": 20, "medium": 35, "hard": 55} class CrossSessionContinuityEnv(MCPEnvironment): def __init__(self, difficulty="medium"): self.task_gen = TaskGenerator(difficulty) self.session_mgr = SessionManager() self.sandbox = Sandbox(timeout=10) self.rubric = ContinuityRubric() self.aux = AuxiliaryRewarder() self.validator = HandoffValidator() self.difficulty = difficulty self.step_limit = STEP_LIMITS[difficulty] def reset(self, task_id=None, seed=None): self.task = self.task_gen.sample(task_id, seed=seed) # names randomized self.session = 1 self.handoff = None self.step_count = 0 self.invalid_action_count = 0 self.retry_budget = 3 self.s1_test_history = [] self.s2_edit_history = [] self.handoff_parsed = False self.s2_failed_runs = 0 return { "session": 1, "task": self.task.description, "starter_code": self.task.starter_code, "message": "Session 1 started. Complete what you can, then call write_handoff().", "step_limit": self.step_limit } def step(self, action): self.step_count += 1 # Step limit enforcement if self.step_count > self.step_limit and self.session == 1: return { "warning": "Step limit reached. Call write_handoff() now or episode terminates.", "penalty": -0.1 } # Invalid action guard if not self._is_valid_action(action): self.invalid_action_count += 1 self.retry_budget -= 1 if self.retry_budget <= 0: return {"done": True, "reward": 0.0, "error": "Retry budget exhausted"} return {"error": f"Invalid action '{action.tool}'. Retries left: {self.retry_budget}"} if action.tool == "read_file": if self.session == 2 and not self.handoff_parsed: return {"error": "Call parse_handoff() before accessing files in session 2."} content = self.task.files.get(action.path, "File not found.") return {"output": content, "session": self.session} if action.tool == "parse_handoff": if self.session != 2: return {"error": "parse_handoff only available in session 2"} self.handoff_parsed = True return {"output": self.handoff, "session": 2} if action.tool == "write_file": prev = self.task.files.get(action.path, "") self.task.files[action.path] = action.content if self.session == 2: self.s2_edit_history.append({"path": action.path, "prev": prev, "new": action.content}) return {"output": f"Written to {action.path}", "session": self.session} if action.tool == "run_tests": result = self.sandbox.run_tests(self.task.files, self.task.test_code) if self.session == 1: self.s1_test_history.append(result.passed) aux = self.aux.s1_reward(result, self.task) return {"output": result.summary, "passed": result.passed, "auxiliary_reward": aux, "session": 1} else: if result.passed == 0: self.s2_failed_runs += 1 return {"output": result.summary, "passed": result.passed, "session": 2} if action.tool == "write_handoff": if self.session != 1: return {"error": "write_handoff only available in session 1"} validation = self.validator.validate(action.content) if not validation.valid: return {"error": f"Handoff rejected: {validation.reason}. " f"Required sections: {self.validator.REQUIRED_SECTIONS}"} self.handoff = action.content self.session = 2 self.handoff_parsed = False self.task = self.session_mgr.transition(self.task) # wipe filesystem self.retry_budget = 3 return { "session": 2, "message": "Session 2 started. Call parse_handoff() first." } if action.tool == "submit": if self.session != 2: return {"error": "submit only available in session 2"} visible = self.sandbox.run_tests(self.task.files, self.task.test_code) hidden = self.sandbox.run_tests(self.task.files, self.task.hidden_test_code) reward = self.rubric.score( visible_results=visible, hidden_results=hidden, handoff=self.handoff, s2_edit_history=self.s2_edit_history, s2_failed_runs=self.s2_failed_runs, invalid_actions=self.invalid_action_count ) return {"done": True, "reward": reward, "visible": visible.summary, "hidden": hidden.summary} def state(self): return { "session": self.session, "step_count": self.step_count, "step_limit": self.step_limit, "handoff_written": self.handoff is not None, "handoff_length": len(self.handoff.split()) if self.handoff else 0, "difficulty": self.difficulty, "invalid_actions": self.invalid_action_count } def _is_valid_action(self, action): s1_tools = {"read_file", "write_file", "run_tests", "write_handoff"} s2_tools = {"parse_handoff", "read_file", "write_file", "run_tests", "submit"} return action.tool in (s1_tools if self.session == 1 else s2_tools) ``` --- ## 6. Handoff Format — Standardized [NEW] **Issue addressed (#19):** Free-form text leads to inconsistent quality and lets the agent game the compression metric with dense-but-useless prose. **Fix:** Enforce a required 6-section structure. `HandoffValidator` rejects the note and returns an error (not a penalty) so the agent can retry within its retry budget. ### 6.1 Required handoff template ``` TASK: [one sentence: what the overall task is] COMPLETED: [bullet list: what is fully implemented and verified by tests] REMAINING: [bullet list: what session 2 must still implement] KEY FUNCTIONS: [function/class names, signatures, and brief purpose] EDGE CASES: [constraints or tricky logic discovered in session 1] NEXT STEPS: [ordered list: what session 2 should do first] ``` ### 6.2 HandoffValidator ```python # server/handoff_validator.py class HandoffValidator: REQUIRED_SECTIONS = ["TASK:", "COMPLETED:", "REMAINING:", "KEY FUNCTIONS:", "EDGE CASES:", "NEXT STEPS:"] MAX_CODE_BLOCK_LINES = 5 # prevents code dumping MAX_TOKENS = 400 # hard ceiling def validate(self, content: str) -> ValidationResult: for section in self.REQUIRED_SECTIONS: if section not in content: return ValidationResult(valid=False, reason=f"Missing required section: '{section}'") code_lines = self._count_code_block_lines(content) if code_lines > self.MAX_CODE_BLOCK_LINES: return ValidationResult(valid=False, reason=f"Code block too long ({code_lines} lines, max {self.MAX_CODE_BLOCK_LINES}).") token_count = len(content.split()) if token_count > self.MAX_TOKENS: return ValidationResult(valid=False, reason=f"Handoff too long ({token_count} tokens, max {self.MAX_TOKENS}).") return ValidationResult(valid=True) def _count_code_block_lines(self, content): in_block, count = False, 0 for line in content.split("\n"): if line.strip().startswith("```"): in_block = not in_block elif in_block: count += 1 return count ``` **Why this prevents gaming:** Code dumps are blocked. The agent must write structured prose. The reconstruction penalty in the rubric catches the remaining shortcut — session 2 ignoring the note and reconstructing from pretrained priors. --- ## 7. Task Generator [UPDATED] ### 7.1 Name Randomization (addresses issue #5 — session separation) Each episode, function and variable names are remapped so the agent cannot reconstruct the solution from pretrained knowledge alone without reading the handoff. ```python # server/task_generator.py import random NAME_BANK = { "merge_intervals": ["combine_ranges", "fuse_spans", "join_segments"], "RateLimiter": ["ThrottleGuard", "RequestBucket", "AccessGate"], "process_data": ["transform_records", "handle_payload", "digest_input"], # expanded for each task in the bank } class TaskGenerator: def sample(self, task_id=None, seed=None): if seed: random.seed(seed) task = self._load_template(task_id) task = self._randomize_names(task) task = self._inject_hidden_tests(task) return task def _randomize_names(self, task): for canonical, variants in NAME_BANK.items(): replacement = random.choice(variants) task.description = task.description.replace(canonical, replacement) task.starter_code = {k: v.replace(canonical, replacement) for k, v in task.starter_code.items()} task.test_code = task.test_code.replace(canonical, replacement) return task ``` ### 7.2 Hidden Tests (addresses issue #4 — test suite exploitability) Every task has visible tests (shown via `run_tests`) and hidden tests (only run at `submit`). The agent cannot overfit to the visible test surface. ``` easy: 3 visible + 1 hidden adversarial medium: 5 visible + 2 hidden adversarial hard: 8 visible + 3 hidden adversarial ``` Hidden tests are hand-written: empty inputs, max-size inputs, concurrent calls, type coercions — things a template-following agent won't naturally handle. ### 7.3 Handoff-Critical Task Design (addresses issue #7 — difficulty calibration) All tasks are designed so session 1 **cannot** finish within the step limit. Verified empirically: step limits allow ~60-70% task completion in session 1. Any task where session 1 finishes fully is moved to a warmup set and excluded from training. ### 7.4 Eval Holdout Set (addresses issue #11 — template overfitting) `tasks/eval_holdout/` — 10 tasks never seen during training. Used only for final evaluation to check generalization. Never used in curriculum or hyperparameter tuning. --- ## 8. Reward Rubric [UPDATED] ### 8.1 Session 1 Auxiliary Rewards (addresses issue #1 — credit assignment) Session 1 has no direct reward — credit assignment across two sessions is the core RL challenge here. Pure GRPO on delayed reward causes early plateau. **Fix:** Shaped auxiliary rewards during session 1, decaying over training. ```python # server/rewards/auxiliary.py class AuxiliaryRewarder: def s1_reward(self, test_result, task): reward = 0.0 if test_result.compiled: reward += 0.05 reward += 0.02 * test_result.passed # small per-test bonus return reward def decay_factor(self, epoch, total_epochs): # Fades out at 60% of training — agent transitions to final reward signal return max(0.0, 1.0 - (epoch / (total_epochs * 0.6))) ``` These are multiplied by `decay_factor` so early training gets denser signal, and late training relies on the real reward. This prevents the agent from over-optimizing partial pass rates at the expense of handoff quality. ### 8.2 Main Rubric (addresses issues #3, #6, #2, #4) ```python # server/rewards/rubric.py from openenv import Rubric HANDOFF_TOKEN_BUDGET = 300 class ContinuityRubric(Rubric): def score(self, visible_results, hidden_results, handoff, s2_edit_history, s2_failed_runs, invalid_actions): # Component 1: Test score — visible + hidden weighted v_score = visible_results.passed / max(visible_results.total, 1) h_score = hidden_results.passed / max(hidden_results.total, 1) test_score = 0.6 * v_score + 0.4 * h_score # hidden tests carry real weight # Component 2: Handoff quality (replaces naive token count) quality_score = self._handoff_quality(handoff) # Component 3: Linearity (replaces re-read counting — see issue #3) linearity_score = self._linearity(s2_edit_history, s2_failed_runs) # Reconstruction penalty (addresses issue #2 shortcut) rewrite_penalty = self._rewrite_penalty(s2_edit_history) # Invalid action penalty action_penalty = min(invalid_actions * 0.02, 0.1) total = ( 0.55 * test_score + 0.20 * quality_score + 0.15 * linearity_score - rewrite_penalty - action_penalty ) return { "total": round(max(0.0, total), 4), "test_score": test_score, "quality_score": quality_score, "linearity_score": linearity_score, "rewrite_penalty": rewrite_penalty, "action_penalty": action_penalty } def _handoff_quality(self, handoff): # Replaces naive token count — measures structure + density + compression if not handoff: return 0.0 score = 0.0 tokens = handoff.split() token_count = len(tokens) # Compression if token_count <= HANDOFF_TOKEN_BUDGET: score += 0.4 else: overage = token_count - HANDOFF_TOKEN_BUDGET score += max(0.0, 0.4 - (overage / HANDOFF_TOKEN_BUDGET) * 0.4) # Structure: reward presence of all required sections sections = ["COMPLETED:", "REMAINING:", "KEY FUNCTIONS:", "NEXT STEPS:"] score += 0.3 * (sum(1 for s in sections if s in handoff) / len(sections)) # Information density: unique word ratio penalizes repetition unique_ratio = len(set(tokens)) / max(token_count, 1) score += 0.2 * min(unique_ratio * 2, 1.0) # Structural formatting bonus has_bullets = any(l.strip().startswith(("-", "*", "1.", "TODO")) for l in handoff.split("\n")) score += 0.1 if has_bullets else 0.0 return round(score, 4) def _linearity(self, edit_history, failed_runs): # Track thrashing (reverting writes) and failed test runs # Better signal than counting re-reads (addresses issue #3) if not edit_history: return 0.5 thrash_count = sum( 1 for i in range(1, len(edit_history)) if edit_history[i]["new"] == edit_history[i-1]["prev"] ) thrash_penalty = min(thrash_count * 0.1, 0.5) run_penalty = min(failed_runs * 0.05, 0.3) return round(max(0.0, 1.0 - thrash_penalty - run_penalty), 4) def _rewrite_penalty(self, edit_history): # If session 2 wrote large volumes to previously-empty files, # it likely reconstructed from pretrained priors, not the handoff if not edit_history: return 0.0 total_written = sum(len(e["new"]) for e in edit_history) total_previous = sum(len(e["prev"]) for e in edit_history) if total_previous == 0 and total_written > 500: return 0.15 return 0.0 ``` ### 8.3 Why the revised rubric is hard to game | Game attempt | Why it fails | |---|---| | Dump code into handoff | HandoffValidator rejects code blocks > 5 lines | | Write minimal/empty handoff | quality_score = 0, session 2 fails tests | | Session 2 rewrites from pretrained priors | rewrite_penalty fires | | Thrash writes in session 2 | linearity thrash detection penalizes | | Pass visible tests, ignore edge cases | hidden tests weighted 40% of test_score | | Rely on consistent tool patterns | name randomization breaks pattern reliance | --- ## 9. Sandbox [UPDATED — stricter ulimits] ```python # server/sandbox.py import subprocess, tempfile, os, resource class Sandbox: def __init__(self, timeout=10): self.timeout = timeout def run_tests(self, files, test_code): with tempfile.TemporaryDirectory() as tmpdir: self._write_files(tmpdir, files, test_code) def set_limits(): resource.setrlimit(resource.RLIMIT_CPU, (8, 8)) resource.setrlimit(resource.RLIMIT_AS, (256*1024*1024,)*2) # 256MB RAM resource.setrlimit(resource.RLIMIT_NOFILE, (20, 20)) # 20 file handles resource.setrlimit(resource.RLIMIT_NPROC, (10, 10)) # no fork bombs try: result = subprocess.run( ["python", "-m", "pytest", "test_solution.py", "--tb=short", "-q", "--no-header"], capture_output=True, text=True, timeout=self.timeout, cwd=tmpdir, preexec_fn=set_limits, env={"PATH": "/usr/bin:/bin"} # no network access ) return self._parse_result(result.stdout, result.returncode) except subprocess.TimeoutExpired: return TestResult(passed=0, total=1, compiled=False, summary="Timeout — likely infinite loop") except Exception as e: return TestResult(passed=0, total=1, compiled=False, summary=f"Sandbox error: {e}") ``` Note: If on-site infrastructure permits, upgrade to Docker container isolation for the full training run. Subprocess + ulimits is sufficient for dev and demo. --- ## 10. Training Pipeline [UPDATED] ### 10.1 Model `unsloth/Qwen2.5-Coder-7B-Instruct` — coding-specialized, fits Colab T4 in 4-bit, 2x speedup from Unsloth over vanilla HF. ### 10.2 Algorithm: GRPO primary, PPO backup (addresses issue #15) GRPO can be unstable with small batches and noisy rewards. Run PPO in parallel as a sanity check. If GRPO diverges, PPO gives a usable training curve to show. **Reward normalization — critical:** ```python def normalize_rewards(rewards): mean = sum(rewards) / len(rewards) std = (sum((r-mean)**2 for r in rewards) / len(rewards)) ** 0.5 return [(r - mean) / (std + 1e-8) for r in rewards] ``` **GRPO config:** ```yaml num_train_epochs: 6 per_device_train_batch_size: 2 gradient_accumulation_steps: 8 learning_rate: 2e-5 reward_normalization: true clip_range: 0.2 kl_coeff: 0.05 # prevents reward hacking warmup_steps: 50 ``` ### 10.3 Episode rollout (handles stuck agents and invalid actions) ```python def rollout(env, agent, epoch, total_epochs): obs = env.reset() done = False trajectory = [] total_aux = 0.0 decay = aux_rewarder.decay_factor(epoch, total_epochs) # Session 1 for _ in range(env.step_limit + 2): # +2 buffer for late handoff warning action = agent.act(obs) obs, reward, done, info = env.step(action) if "auxiliary_reward" in info: total_aux += info["auxiliary_reward"] * decay trajectory.append((obs, action, reward, info)) if done or info.get("session") == 2: break if env.state()["session"] == 1: return trajectory, 0.0 # hit step limit without handoff # Session 2 s2_obs = {"session": 2, "message": "Call parse_handoff() to retrieve your note."} for _ in range(env.step_limit): action = agent.act(s2_obs) obs, reward, done, info = env.step(action) trajectory.append((obs, action, reward, info)) if done: break final_reward = (reward or 0.0) + total_aux return trajectory, normalize_reward(final_reward) ``` ### 10.4 Curriculum (addresses issue #7) ``` Epochs 1-2: easy tasks only → learn basic handoff structure Epochs 3-4: easy + medium → learn compression under step pressure Epochs 5-6: medium + hard → learn surgical prioritization Eval only: holdout set → generalization check, never in training ``` ### 10.5 Colab notebook outline ``` Cell 1: Install: openenv unsloth trl transformers wandb pytest Cell 2: Load env from HF Space Cell 3: Load Qwen2.5-Coder-7B-Instruct (Unsloth 4-bit) Cell 4: Run all 3 baselines → save baseline_results.json Cell 5: GRPO training loop with rollout → log to wandb Cell 6: Run PPO for comparison Cell 7: Eval on holdout set (trained model vs baselines) Cell 8: Save all plots as PNG to /plots/ Cell 9: Ablation runs (3 configs) Cell 10: Print epoch 1 vs epoch 20 handoff notes side by side ``` --- ## 11. Baselines [NEW — addresses issue #12] All four on the same plot. Without this, reward improvement is meaningless. | Baseline | Description | Expected S2 pass rate | |---|---|---| | No handoff | Session 2 starts with blank note | ~5-10% | | Random handoff | Gibberish as the handoff note | ~8-12% | | **Trained agent (ours)** | Our GRPO-trained model | Target: >60% | | Full S1 transcript | Upper bound — all context given | ~75-85% | The trained agent should be comfortably above random and approaching (not matching) the full transcript upper bound. That gap tells the story clearly. --- ## 12. Ablation Studies [NEW — addresses issue #17] Three ablations to justify each reward component to judges: | Ablation | Removed component | Expected degradation | |---|---|---| | No compression reward | quality_score = 0 | Handoffs become bloated | | No linearity reward | linearity_score = 0 | Session 2 thrashes more | | No auxiliary S1 reward | AuxiliaryRewarder disabled | Slower convergence | Plot all ablations vs full model on same axes in `plots/ablation_comparison.png`. One-line caption per plot. Axes labeled: "Training Episode" (x) / "Total Reward" (y). --- ## 13. Evaluation Reporting [NEW — addresses issue #8] Don't aggregate across difficulties — it hides where the agent struggles. Report separately per difficulty and across seeds: ``` easy tasks: pass rate | avg handoff tokens | avg S2 steps medium tasks: same hard tasks: same holdout tasks: same ← generalization signal Run 3 seeds minimum. Report mean ± std. ``` --- ## 14. Interpretability [NEW — addresses issue #16] Show *what the agent learned to keep vs drop* across training epochs. ```python # Track which handoff sections grow or shrink over training def analyze_handoff_evolution(handoff_log): section_lengths = {} for epoch, handoffs in handoff_log.items(): section_lengths[epoch] = {} for section in ["COMPLETED:", "REMAINING:", "KEY FUNCTIONS:", "NEXT STEPS:"]: lengths = [len(extract_section(h, section)) for h in handoffs] section_lengths[epoch][section] = sum(lengths) / len(lengths) return section_lengths ``` Plot as stacked bar chart (`plots/handoff_diff_over_epochs.png`). Expected learning signal visible in the chart: - COMPLETED section shrinks (agent stops over-documenting finished work) - REMAINING section gets more precise (specific function names, not vague prose) - NEXT STEPS section grows and becomes the highest-value section for session 2 This is the interpretability story for the blog and pitch. --- ## 15. Agent Loop (Client) [UPDATED — addresses issue #13] ```python # client/agent.py — no server imports S1_SYSTEM_PROMPT = """You are working on a coding task in Session 1. Complete as much as possible. When approaching your step limit, call write_handoff() with a structured note following this format: TASK: / COMPLETED: / REMAINING: / KEY FUNCTIONS: / EDGE CASES: / NEXT STEPS: You have a retry budget for invalid actions. Use it wisely.""" S2_SYSTEM_PROMPT = """You are in Session 2. You have NO memory of Session 1. Your ONLY information is the handoff note. Start by calling parse_handoff(), then use the note to continue the task. Do not rewrite everything from scratch.""" class Agent: def __init__(self, model, tokenizer, retry_budget=3): self.model = model self.tokenizer = tokenizer self.retry_budget = retry_budget self.context = [] def act(self, obs): prompt = self._build_prompt(obs) for attempt in range(self.retry_budget): response = self._generate(prompt) action = self._parse_action(response) if action is not None: self.context.append({"obs": obs, "action": action}) return action prompt = self._build_retry_prompt(prompt, response, attempt) return Action(tool="noop", content="") # graceful no-op on exhaustion def _build_prompt(self, obs): system = S1_SYSTEM_PROMPT if obs.get("session") == 1 else S2_SYSTEM_PROMPT return system + "\n\n" + format_obs(obs) ``` --- ## 16. Risk Register [UPDATED — full 20-issue resolution] | # | Issue | Severity | Status | Resolution | |---|---|---|---|---| | 1 | Credit assignment — S1 no direct reward | HIGH | FIXED | Auxiliary shaped rewards + decay schedule | | 2 | Handoff gaming — code dumps / hinting | HIGH | FIXED | HandoffValidator + code block limit + rewrite penalty | | 3 | Linearity metric weak (re-read counting) | MEDIUM | FIXED | Thrash detection on edit history + failed run rate | | 4 | Test suite exploitable | MEDIUM | FIXED | Hidden adversarial tests at submit | | 5 | Session separation weak | MEDIUM | FIXED | Name randomization per episode seed | | 6 | Compression metric naive | MEDIUM | FIXED | Multi-factor quality score: structure + density + ratio | | 7 | Task difficulty miscalibrated | MEDIUM | FIXED | Step limits verified empirically, handoff-critical design | | 8 | Evaluation hides per-difficulty gaps | MEDIUM | FIXED | Separate easy/medium/hard/holdout reporting | | 9 | Sandbox not fully isolated | MEDIUM | FIXED | Strict ulimits: CPU, RAM, file handles, forks | | 10 | Step limit too tight or too loose | LOW | FIXED | Dynamic by difficulty, late-handoff warning | | 11 | Template overfitting | MEDIUM | FIXED | Name randomization + holdout eval set | | 12 | No baselines | HIGH | FIXED | 3 baselines + upper bound, all on same plot | | 13 | Agent gets stuck / invalid actions | LOW | FIXED | Retry budget, invalid action penalty, noop fallback | | 14 | Tool pattern exploitation | LOW | ACCEPTED | Name randomization covers most of this; minor risk | | 15 | GRPO instability | MEDIUM | FIXED | Reward normalization, KL coeff, PPO backup | | 16 | No interpretability | MEDIUM | FIXED | Handoff section evolution tracking + diff plot | | 17 | No ablation studies | MEDIUM | FIXED | 3 ablations with plots | | 18 | Demo risk | LOW | FIXED | Deterministic seeds, pre-recorded run URL | | 19 | Handoff format inconsistent | HIGH | FIXED | Mandatory 6-section structure enforced by validator | | 20 | Tests don't capture understanding | LOW | PARTIALLY | Hidden adversarial tests cover this adequately for hackathon scope | **Issue #14 accepted as low-risk** — name randomization already breaks most pattern exploitation. Full tool response variation adds complexity with marginal gain. **Issue #20 partial** — mutation testing is a research-grade addition, out of scope for the hackathon timeline. --- ## 17. Demo Preparation [NEW — addresses issue #18] - **Deterministic seed**: `env.reset(seed=42)` — same task, same names, reproducible - **Pre-recorded run**: screen recording of a successful trained-agent episode, hosted as URL (not committed to repo). Linked from README. - **Fallback slide**: screenshot of epoch 1 vs epoch 20 handoff side by side — shows the learning visually to a non-technical audience **Never end the live demo on `submit()`** — too unpredictable. End on the handoff note being written and displayed. That's the visual payoff. --- ## 18. Submission Checklist [UPDATED] | Requirement | How satisfied | Status | |---|---|---| | OpenEnv latest release | `MCPEnvironment` subclass, `openenv.yaml`, pinned version in requirements.txt | [ ] | | Training script (Unsloth/TRL) | `training/train_grpo.ipynb` — Colab T4, re-runnable in <30 min | [ ] | | Training evidence | `plots/` — reward, length, 4-way baseline, ablations, interpretability — all PNG | [ ] | | Mini blog OR video | HF blog post + <2 min YouTube video | [ ] | | HF Space | `yourteam/cross-session-continuity-env` — live and runnable | [ ] | | README with all links | Space, notebook, blog, video, WandB run | [ ] | | No large files in repo | Videos as `.url` text files only | [ ] | | Baselines | 3 baselines + upper bound documented and plotted | [ ] | | Ablations | 3 ablations documented and plotted | [ ] | | Holdout eval | Generalization results on 10 unseen tasks | [ ] | | Per-difficulty breakdown | easy / medium / hard results reported separately | [ ] | --- ## 19. README Template [UPDATED] ```markdown # Cross-Session Continuity Env > Can RL teach an LLM to write better notes to its future self? ## Problem LLMs forget everything when a session ends. For long coding tasks that span multiple sessions this is critical. No existing RL environment trains for this. ## How It Works [diagram: session1 → handoff.md → session2 → reward] Session 1: agent gets task + starter code. Works until step limit. Must write a structured 6-section handoff note before session ends. Session 2: starts completely cold. Only the handoff note exists. Must complete the task and pass tests. Reward = test correctness (visible + hidden) + handoff quality + session 2 linearity. ## Reward Breakdown | Component | Weight | What it measures | |-------------------|--------|-------------------------------------| | Tests (visible) | 33% | Session 2 correctness | | Tests (hidden) | 22% | Generalization, no test overfitting | | Handoff quality | 20% | Structure, density, compression | | Linearity | 15% | Session 2 didn't thrash | | Penalties | 10% | Invalid actions, reconstruction | ## Results | Agent | S2 Test Pass Rate | |------------------------|-------------------| | No handoff (baseline) | ~8% | | Random handoff | ~11% | | Trained (ours) | ~65% | | Full transcript (UB) | ~80% | ![reward curve](plots/reward_curve.png) *Total reward over training episodes — all baselines on same axes* ![ablations](plots/ablation_comparison.png) *Each reward component contribution — ablation study* ![handoff evolution](plots/handoff_diff_over_epochs.png) *What the agent learned to keep vs drop over training* ## Before / After **Epoch 1:** 900 tokens, rambling, full code blocks, no structure **Epoch 20:** 180 tokens, 6 clear sections, precise function names, zero code ## Links - HF Space: [url] - Colab Notebook: [url] - HF Blog Post: [url] - YouTube Demo (<2 min): [url] - WandB Training Run: [url] ``` --- ## 20. Pitch Story [UPDATED] > "Every developer has hit this wall. You're deep into a coding task with an AI > assistant. The session ends. You come back the next day — and the AI remembers > nothing. You start over from scratch. > > We asked a different question: what if we trained the AI to leave a perfect > briefing for its future self? > > Cross-Session Continuity Env is an RL environment where an agent must complete > a coding task split across two sessions with zero shared memory. Session 1 > works on the problem, then writes a structured handoff note. Session 2 starts > completely cold — only that note exists. > > The agent is rewarded not for session 1 performance, but for how well its > future self performs using only the note it left behind. > > After training, the agent learned something we didn't expect. It stopped writing > long rambling summaries. It started writing surgical briefings — 180 words, > six sections, exactly what session 2 needs and nothing it doesn't. > > Test pass rates went from 8% (no handoff at all) to 65%. > > No one has trained this behavior explicitly before. We think it matters." --- ## 21. Timeline [UPDATED] | Day | Task | Risk & Contingency | |---|---|---| | Day 1 (pre-onsite) | Task bank: 20 tasks + holdout set. Sandbox + ulimits tested. HandoffValidator working. | Sandbox is highest-risk — do first. Fallback: relax ulimits if resource module unavailable | | Day 2 (pre-onsite) | Env class, session manager, rubric, auxiliary rewarder. Full unit tests on each. | Rubric edge cases — budget 2h for test coverage | | Day 3 (pre-onsite) | End-to-end episode: agent completes 2-session run. Client/server separation verified. | Integration bugs — if stuck, simplify tool set | | Day 4 (onsite 25th) | Colab notebook. All 3 baseline runs. First GRPO curves. WandB connected. | Compute time — run baselines overnight if needed | | Day 5 (onsite 26th am) | Full training run on HF credits. Ablations. Plots committed. | GRPO divergence — fall back to PPO results | | Day 5 (onsite 26th pm) | HF Space live. README + blog done. Demo recorded. Final checklist. | Deployment issues — test HF Space access 24h early | --- ## 22. What Good Looks Like at Submission 1. Judge visits HF Space → watches a live 2-session run with trained agent 2. Reward curve shows clear upward trend with all 4 baselines on the same plot 3. Ablation plot shows each component contributes something measurable 4. Epoch 1 vs epoch 20 handoff note is visibly, strikingly different 5. Per-difficulty breakdown shows where the agent is strong vs weak 6. Colab notebook re-runs in under 30 minutes on a T4 7. Holdout eval confirms generalization, not just memorization All seven = strong submission that covers every judging criterion.