Spaces:

Aswini-Kumar
/

cross-session-continuity-env

Sleeping

App Files Files Community

cross-session-continuity-env / implementation (1).md

Aswini-Kumar

upload: implementation (1).md

c3defd1 verified about 1 month ago

preview code

raw

history blame contribute delete

40.5 kB

	# Cross-Session Continuity Env — Implementation Plan (v2)

	> Changelog from v1: Addressed 20 potential failure modes identified in review.
	> Each section marked [UPDATED], [NEW], or [UNCHANGED] for traceability.

	---

	## 1. Problem Statement [UNCHANGED]

	Capability Gap: LLMs have no persistent memory across sessions. When a session ends,
	everything is gone. In real-world usage this is a critical failure mode — long tasks
	(codebases, research, planning) rarely fit in a single context window.

	What we train: Can RL teach an LLM to write surgical, information-dense handoff notes
	to its future self, such that a cold-start agent in session 2 can complete the task
	successfully using only those notes?

	Why it's novel: No existing RL environment specifically trains or benchmarks
	cross-session state transfer behavior. This is underexplored and publishable.

	Theme: Primarily Theme 2 (Long-Horizon Planning). Secondary fit with Theme 3.1 —
	agent uses real tools (file I/O, test runner) in a dynamic coding environment.

	---

	## 2. High-Level Architecture [UPDATED]

	```
	Episode = Session 1 + Session 2 (ONE training episode, ONE reward signal)

	Session 1:
	Agent receives → task description + starter code + tool access
	Agent works → reads files, writes code, runs tests
	[Auxiliary rewards fire here — see Section 8]
	Agent ends → calls write_handoff(structured_note) → session 1 terminates

	↓ [handoff.md is the ONLY bridge]
	↓ [filesystem wiped — no code persists]
	↓ [function/variable names randomized per episode]

	Session 2:
	Agent receives → ONLY handoff.md + same tool access
	Agent must call parse_handoff() before file access (enforced)
	Agent works → picks up, finishes implementation
	Agent ends → calls submit() → visible + hidden tests run → reward computed

	Reward flows back through both sessions via GRPO (with normalization)
	PPO run in parallel as stability baseline
	```

	---

	## 3. Repository Structure [UPDATED]

	```
	cross-session-continuity-env/
	│
	├── openenv.yaml
	├── README.md
	├── requirements.txt # pinned: openenv==x.y.z
	│
	├── server/
	│ ├── env.py # MCPEnvironment subclass
	│ ├── task_generator.py # task + test generation with name randomization
	│ ├── session_manager.py # session 1 → 2 transition, filesystem wipe
	│ ├── sandbox.py # safe execution, strict ulimits
	│ ├── handoff_validator.py # NEW: validates handoff structure
	│ └── rewards/
	│ ├── rubric.py # composable rubrics (UPDATED)
	│ └── auxiliary.py # NEW: session 1 auxiliary rewards
	│
	├── client/
	│ └── agent.py # agent loop — no server imports, with retry logic
	│
	├── tasks/
	│ ├── easy/ # single file, 3 visible + 1 hidden test
	│ ├── medium/ # 2-3 files, 5 visible + 2 hidden tests
	│ ├── hard/ # 5 files, 8 visible + 3 hidden tests
	│ └── eval_holdout/ # NEW: unseen tasks for evaluation only
	│
	├── training/
	│ ├── train_grpo.ipynb # primary training (GRPO)
	│ ├── train_ppo.ipynb # NEW: PPO baseline for stability comparison
	│ └── grpo_config.yaml
	│
	├── evals/
	│ ├── baselines/
	│ │ ├── no_handoff.py # NEW: session 2 with no note at all
	│ │ ├── random_handoff.py # NEW: random text as handoff
	│ │ └── full_transcript.py # NEW: upper bound — full S1 transcript
	│ ├── ablations/
	│ │ ├── no_compression_reward.py # NEW: ablation
	│ │ ├── no_linearity_reward.py # NEW: ablation
	│ │ └── no_auxiliary_reward.py # NEW: ablation
	│ └── trained_run.py
	│
	├── plots/ # all committed as PNG with captions
	│ ├── reward_curve.png
	│ ├── handoff_length_curve.png
	│ ├── baseline_vs_trained.png # all 4 baselines on same axes
	│ ├── ablation_comparison.png # NEW
	│ ├── difficulty_breakdown.png # NEW: easy/medium/hard separately
	│ └── handoff_diff_over_epochs.png # NEW: interpretability
	│
	└── demos/
	└── recorded_run_seed42.url # URL only — no large files in repo
	```

	---

	## 4. OpenEnv Compliance [UNCHANGED]

	### 4.1 openenv.yaml

	```yaml
	name: cross-session-continuity-env
	version: 0.1.0
	theme: long-horizon-planning
	description: >
	An RL environment where an LLM agent must complete a coding task across two
	sessions with zero shared memory. The agent writes a structured handoff note
	at the end of session 1; session 2 receives only that note. Reward depends
	entirely on session 2 success.
	entry: server/env.py
	tools:
	- read_file
	- write_file
	- run_tests
	- write_handoff
	- parse_handoff
	- submit
	sessions: 2
	difficulty_levels:
	- easy
	- medium
	- hard
	```

	### 4.2 Reserved Tool Names — Avoided

	`reset`, `step`, `state`, `close` are OpenEnv reserved — none used.
	Our tools: `read_file`, `write_file`, `run_tests`, `write_handoff`, `parse_handoff`, `submit` — all clear.

	### 4.3 Client/Server Separation

	- `client/agent.py` talks to env via MCP protocol only
	- Client never imports from `server/`
	- All state lives server-side

	### 4.4 Gym-style API

	```python
	env.reset() # starts episode, returns session 1 observation
	env.step() # action → (obs, reward, done, info)
	env.state() # current env state dict
	```

	---

	## 5. Environment Implementation [UPDATED]

	Key changes from v1:
	- Dynamic step limits by difficulty
	- Auxiliary reward hooks in session 1
	- Handoff structure validation before session 2 starts
	- Invalid action handling with retry budget
	- Agent must call `parse_handoff()` before file access in session 2
	- Filesystem wiped on session transition

	```python
	# server/env.py
	from openenv import MCPEnvironment
	from .task_generator import TaskGenerator
	from .session_manager import SessionManager
	from .sandbox import Sandbox
	from .rewards.rubric import ContinuityRubric
	from .rewards.auxiliary import AuxiliaryRewarder
	from .handoff_validator import HandoffValidator

	STEP_LIMITS = {"easy": 20, "medium": 35, "hard": 55}

	class CrossSessionContinuityEnv(MCPEnvironment):

	def __init__(self, difficulty="medium"):
	self.task_gen = TaskGenerator(difficulty)
	self.session_mgr = SessionManager()
	self.sandbox = Sandbox(timeout=10)
	self.rubric = ContinuityRubric()
	self.aux = AuxiliaryRewarder()
	self.validator = HandoffValidator()
	self.difficulty = difficulty
	self.step_limit = STEP_LIMITS[difficulty]

	def reset(self, task_id=None, seed=None):
	self.task = self.task_gen.sample(task_id, seed=seed) # names randomized
	self.session = 1
	self.handoff = None
	self.step_count = 0
	self.invalid_action_count = 0
	self.retry_budget = 3
	self.s1_test_history = []
	self.s2_edit_history = []
	self.handoff_parsed = False
	self.s2_failed_runs = 0

	return {
	"session": 1,
	"task": self.task.description,
	"starter_code": self.task.starter_code,
	"message": "Session 1 started. Complete what you can, then call write_handoff().",
	"step_limit": self.step_limit
	}

	def step(self, action):
	self.step_count += 1

	# Step limit enforcement
	if self.step_count > self.step_limit and self.session == 1:
	return {
	"warning": "Step limit reached. Call write_handoff() now or episode terminates.",
	"penalty": -0.1
	}

	# Invalid action guard
	if not self._is_valid_action(action):
	self.invalid_action_count += 1
	self.retry_budget -= 1
	if self.retry_budget <= 0:
	return {"done": True, "reward": 0.0, "error": "Retry budget exhausted"}
	return {"error": f"Invalid action '{action.tool}'. Retries left: {self.retry_budget}"}

	if action.tool == "read_file":
	if self.session == 2 and not self.handoff_parsed:
	return {"error": "Call parse_handoff() before accessing files in session 2."}
	content = self.task.files.get(action.path, "File not found.")
	return {"output": content, "session": self.session}

	if action.tool == "parse_handoff":
	if self.session != 2:
	return {"error": "parse_handoff only available in session 2"}
	self.handoff_parsed = True
	return {"output": self.handoff, "session": 2}

	if action.tool == "write_file":
	prev = self.task.files.get(action.path, "")
	self.task.files[action.path] = action.content
	if self.session == 2:
	self.s2_edit_history.append({"path": action.path,
	"prev": prev, "new": action.content})
	return {"output": f"Written to {action.path}", "session": self.session}

	if action.tool == "run_tests":
	result = self.sandbox.run_tests(self.task.files, self.task.test_code)
	if self.session == 1:
	self.s1_test_history.append(result.passed)
	aux = self.aux.s1_reward(result, self.task)
	return {"output": result.summary, "passed": result.passed,
	"auxiliary_reward": aux, "session": 1}
	else:
	if result.passed == 0:
	self.s2_failed_runs += 1
	return {"output": result.summary, "passed": result.passed, "session": 2}

	if action.tool == "write_handoff":
	if self.session != 1:
	return {"error": "write_handoff only available in session 1"}
	validation = self.validator.validate(action.content)
	if not validation.valid:
	return {"error": f"Handoff rejected: {validation.reason}. "
	f"Required sections: {self.validator.REQUIRED_SECTIONS}"}
	self.handoff = action.content
	self.session = 2
	self.handoff_parsed = False
	self.task = self.session_mgr.transition(self.task) # wipe filesystem
	self.retry_budget = 3
	return {
	"session": 2,
	"message": "Session 2 started. Call parse_handoff() first."
	}

	if action.tool == "submit":
	if self.session != 2:
	return {"error": "submit only available in session 2"}
	visible = self.sandbox.run_tests(self.task.files, self.task.test_code)
	hidden = self.sandbox.run_tests(self.task.files, self.task.hidden_test_code)
	reward = self.rubric.score(
	visible_results=visible,
	hidden_results=hidden,
	handoff=self.handoff,
	s2_edit_history=self.s2_edit_history,
	s2_failed_runs=self.s2_failed_runs,
	invalid_actions=self.invalid_action_count
	)
	return {"done": True, "reward": reward,
	"visible": visible.summary, "hidden": hidden.summary}

	def state(self):
	return {
	"session": self.session,
	"step_count": self.step_count,
	"step_limit": self.step_limit,
	"handoff_written": self.handoff is not None,
	"handoff_length": len(self.handoff.split()) if self.handoff else 0,
	"difficulty": self.difficulty,
	"invalid_actions": self.invalid_action_count
	}

	def _is_valid_action(self, action):
	s1_tools = {"read_file", "write_file", "run_tests", "write_handoff"}
	s2_tools = {"parse_handoff", "read_file", "write_file", "run_tests", "submit"}
	return action.tool in (s1_tools if self.session == 1 else s2_tools)
	```

	---

	## 6. Handoff Format — Standardized [NEW]

	Issue addressed (#19): Free-form text leads to inconsistent quality and lets the agent
	game the compression metric with dense-but-useless prose.

	Fix: Enforce a required 6-section structure. `HandoffValidator` rejects the note and
	returns an error (not a penalty) so the agent can retry within its retry budget.

	### 6.1 Required handoff template

	```
	TASK:
	[one sentence: what the overall task is]

	COMPLETED:
	[bullet list: what is fully implemented and verified by tests]

	REMAINING:
	[bullet list: what session 2 must still implement]

	KEY FUNCTIONS:
	[function/class names, signatures, and brief purpose]

	EDGE CASES:
	[constraints or tricky logic discovered in session 1]

	NEXT STEPS:
	[ordered list: what session 2 should do first]
	```

	### 6.2 HandoffValidator

	```python
	# server/handoff_validator.py

	class HandoffValidator:
	REQUIRED_SECTIONS = ["TASK:", "COMPLETED:", "REMAINING:",
	"KEY FUNCTIONS:", "EDGE CASES:", "NEXT STEPS:"]
	MAX_CODE_BLOCK_LINES = 5 # prevents code dumping
	MAX_TOKENS = 400 # hard ceiling

	def validate(self, content: str) -> ValidationResult:
	for section in self.REQUIRED_SECTIONS:
	if section not in content:
	return ValidationResult(valid=False,
	reason=f"Missing required section: '{section}'")

	code_lines = self._count_code_block_lines(content)
	if code_lines > self.MAX_CODE_BLOCK_LINES:
	return ValidationResult(valid=False,
	reason=f"Code block too long ({code_lines} lines, max {self.MAX_CODE_BLOCK_LINES}).")

	token_count = len(content.split())
	if token_count > self.MAX_TOKENS:
	return ValidationResult(valid=False,
	reason=f"Handoff too long ({token_count} tokens, max {self.MAX_TOKENS}).")

	return ValidationResult(valid=True)

	def _count_code_block_lines(self, content):
	in_block, count = False, 0
	for line in content.split("\n"):
	if line.strip().startswith("```"):
	in_block = not in_block
	elif in_block:
	count += 1
	return count
	```

	Why this prevents gaming: Code dumps are blocked. The agent must write structured
	prose. The reconstruction penalty in the rubric catches the remaining shortcut —
	session 2 ignoring the note and reconstructing from pretrained priors.

	---

	## 7. Task Generator [UPDATED]

	### 7.1 Name Randomization (addresses issue #5 — session separation)

	Each episode, function and variable names are remapped so the agent cannot reconstruct
	the solution from pretrained knowledge alone without reading the handoff.

	```python
	# server/task_generator.py
	import random

	NAME_BANK = {
	"merge_intervals": ["combine_ranges", "fuse_spans", "join_segments"],
	"RateLimiter": ["ThrottleGuard", "RequestBucket", "AccessGate"],
	"process_data": ["transform_records", "handle_payload", "digest_input"],
	# expanded for each task in the bank
	}

	class TaskGenerator:
	def sample(self, task_id=None, seed=None):
	if seed:
	random.seed(seed)
	task = self._load_template(task_id)
	task = self._randomize_names(task)
	task = self._inject_hidden_tests(task)
	return task

	def _randomize_names(self, task):
	for canonical, variants in NAME_BANK.items():
	replacement = random.choice(variants)
	task.description = task.description.replace(canonical, replacement)
	task.starter_code = {k: v.replace(canonical, replacement)
	for k, v in task.starter_code.items()}
	task.test_code = task.test_code.replace(canonical, replacement)
	return task
	```

	### 7.2 Hidden Tests (addresses issue #4 — test suite exploitability)

	Every task has visible tests (shown via `run_tests`) and hidden tests (only run at `submit`).
	The agent cannot overfit to the visible test surface.

	```
	easy: 3 visible + 1 hidden adversarial
	medium: 5 visible + 2 hidden adversarial
	hard: 8 visible + 3 hidden adversarial
	```

	Hidden tests are hand-written: empty inputs, max-size inputs, concurrent calls, type
	coercions — things a template-following agent won't naturally handle.

	### 7.3 Handoff-Critical Task Design (addresses issue #7 — difficulty calibration)

	All tasks are designed so session 1 cannot finish within the step limit. Verified
	empirically: step limits allow ~60-70% task completion in session 1. Any task where
	session 1 finishes fully is moved to a warmup set and excluded from training.

	### 7.4 Eval Holdout Set (addresses issue #11 — template overfitting)

	`tasks/eval_holdout/` — 10 tasks never seen during training. Used only for final
	evaluation to check generalization. Never used in curriculum or hyperparameter tuning.

	---

	## 8. Reward Rubric [UPDATED]

	### 8.1 Session 1 Auxiliary Rewards (addresses issue #1 — credit assignment)

	Session 1 has no direct reward — credit assignment across two sessions is the core
	RL challenge here. Pure GRPO on delayed reward causes early plateau.

	Fix: Shaped auxiliary rewards during session 1, decaying over training.

	```python
	# server/rewards/auxiliary.py

	class AuxiliaryRewarder:

	def s1_reward(self, test_result, task):
	reward = 0.0
	if test_result.compiled:
	reward += 0.05
	reward += 0.02 * test_result.passed # small per-test bonus
	return reward

	def decay_factor(self, epoch, total_epochs):
	# Fades out at 60% of training — agent transitions to final reward signal
	return max(0.0, 1.0 - (epoch / (total_epochs * 0.6)))
	```

	These are multiplied by `decay_factor` so early training gets denser signal,
	and late training relies on the real reward. This prevents the agent from
	over-optimizing partial pass rates at the expense of handoff quality.

	### 8.2 Main Rubric (addresses issues #3, #6, #2, #4)

	```python
	# server/rewards/rubric.py
	from openenv import Rubric

	HANDOFF_TOKEN_BUDGET = 300

	class ContinuityRubric(Rubric):

	def score(self, visible_results, hidden_results, handoff,
	s2_edit_history, s2_failed_runs, invalid_actions):

	# Component 1: Test score — visible + hidden weighted
	v_score = visible_results.passed / max(visible_results.total, 1)
	h_score = hidden_results.passed / max(hidden_results.total, 1)
	test_score = 0.6 * v_score + 0.4 * h_score # hidden tests carry real weight

	# Component 2: Handoff quality (replaces naive token count)
	quality_score = self._handoff_quality(handoff)

	# Component 3: Linearity (replaces re-read counting — see issue #3)
	linearity_score = self._linearity(s2_edit_history, s2_failed_runs)

	# Reconstruction penalty (addresses issue #2 shortcut)
	rewrite_penalty = self._rewrite_penalty(s2_edit_history)

	# Invalid action penalty
	action_penalty = min(invalid_actions * 0.02, 0.1)

	total = (
	0.55 * test_score
	+ 0.20 * quality_score
	+ 0.15 * linearity_score
	- rewrite_penalty
	- action_penalty
	)

	return {
	"total": round(max(0.0, total), 4),
	"test_score": test_score,
	"quality_score": quality_score,
	"linearity_score": linearity_score,
	"rewrite_penalty": rewrite_penalty,
	"action_penalty": action_penalty
	}

	def _handoff_quality(self, handoff):
	# Replaces naive token count — measures structure + density + compression
	if not handoff:
	return 0.0
	score = 0.0
	tokens = handoff.split()
	token_count = len(tokens)

	# Compression
	if token_count <= HANDOFF_TOKEN_BUDGET:
	score += 0.4
	else:
	overage = token_count - HANDOFF_TOKEN_BUDGET
	score += max(0.0, 0.4 - (overage / HANDOFF_TOKEN_BUDGET) * 0.4)

	# Structure: reward presence of all required sections
	sections = ["COMPLETED:", "REMAINING:", "KEY FUNCTIONS:", "NEXT STEPS:"]
	score += 0.3 * (sum(1 for s in sections if s in handoff) / len(sections))

	# Information density: unique word ratio penalizes repetition
	unique_ratio = len(set(tokens)) / max(token_count, 1)
	score += 0.2 * min(unique_ratio * 2, 1.0)

	# Structural formatting bonus
	has_bullets = any(l.strip().startswith(("-", "*", "1.", "TODO"))
	for l in handoff.split("\n"))
	score += 0.1 if has_bullets else 0.0

	return round(score, 4)

	def _linearity(self, edit_history, failed_runs):
	# Track thrashing (reverting writes) and failed test runs
	# Better signal than counting re-reads (addresses issue #3)
	if not edit_history:
	return 0.5

	thrash_count = sum(
	1 for i in range(1, len(edit_history))
	if edit_history[i]["new"] == edit_history[i-1]["prev"]
	)
	thrash_penalty = min(thrash_count * 0.1, 0.5)
	run_penalty = min(failed_runs * 0.05, 0.3)

	return round(max(0.0, 1.0 - thrash_penalty - run_penalty), 4)

	def _rewrite_penalty(self, edit_history):
	# If session 2 wrote large volumes to previously-empty files,
	# it likely reconstructed from pretrained priors, not the handoff
	if not edit_history:
	return 0.0
	total_written = sum(len(e["new"]) for e in edit_history)
	total_previous = sum(len(e["prev"]) for e in edit_history)
	if total_previous == 0 and total_written > 500:
	return 0.15
	return 0.0
	```

	### 8.3 Why the revised rubric is hard to game

	\| Game attempt \| Why it fails \|
	\|---\|---\|
	\| Dump code into handoff \| HandoffValidator rejects code blocks > 5 lines \|
	\| Write minimal/empty handoff \| quality_score = 0, session 2 fails tests \|
	\| Session 2 rewrites from pretrained priors \| rewrite_penalty fires \|
	\| Thrash writes in session 2 \| linearity thrash detection penalizes \|
	\| Pass visible tests, ignore edge cases \| hidden tests weighted 40% of test_score \|
	\| Rely on consistent tool patterns \| name randomization breaks pattern reliance \|

	---

	## 9. Sandbox [UPDATED — stricter ulimits]

	```python
	# server/sandbox.py
	import subprocess, tempfile, os, resource

	class Sandbox:
	def __init__(self, timeout=10):
	self.timeout = timeout

	def run_tests(self, files, test_code):
	with tempfile.TemporaryDirectory() as tmpdir:
	self._write_files(tmpdir, files, test_code)

	def set_limits():
	resource.setrlimit(resource.RLIMIT_CPU, (8, 8))
	resource.setrlimit(resource.RLIMIT_AS, (25610241024,)*2) # 256MB RAM
	resource.setrlimit(resource.RLIMIT_NOFILE, (20, 20)) # 20 file handles
	resource.setrlimit(resource.RLIMIT_NPROC, (10, 10)) # no fork bombs

	try:
	result = subprocess.run(
	["python", "-m", "pytest", "test_solution.py",
	"--tb=short", "-q", "--no-header"],
	capture_output=True, text=True,
	timeout=self.timeout, cwd=tmpdir,
	preexec_fn=set_limits,
	env={"PATH": "/usr/bin:/bin"} # no network access
	)
	return self._parse_result(result.stdout, result.returncode)
	except subprocess.TimeoutExpired:
	return TestResult(passed=0, total=1, compiled=False,
	summary="Timeout — likely infinite loop")
	except Exception as e:
	return TestResult(passed=0, total=1, compiled=False,
	summary=f"Sandbox error: {e}")
	```

	Note: If on-site infrastructure permits, upgrade to Docker container isolation for
	the full training run. Subprocess + ulimits is sufficient for dev and demo.

	---

	## 10. Training Pipeline [UPDATED]

	### 10.1 Model

	`unsloth/Qwen2.5-Coder-7B-Instruct` — coding-specialized, fits Colab T4 in 4-bit,
	2x speedup from Unsloth over vanilla HF.

	### 10.2 Algorithm: GRPO primary, PPO backup (addresses issue #15)

	GRPO can be unstable with small batches and noisy rewards. Run PPO in parallel as
	a sanity check. If GRPO diverges, PPO gives a usable training curve to show.

	Reward normalization — critical:
	```python
	def normalize_rewards(rewards):
	mean = sum(rewards) / len(rewards)
	std = (sum((r-mean)2 for r in rewards) / len(rewards)) 0.5
	return [(r - mean) / (std + 1e-8) for r in rewards]
	```

	GRPO config:
	```yaml
	num_train_epochs: 6
	per_device_train_batch_size: 2
	gradient_accumulation_steps: 8
	learning_rate: 2e-5
	reward_normalization: true
	clip_range: 0.2
	kl_coeff: 0.05 # prevents reward hacking
	warmup_steps: 50
	```

	### 10.3 Episode rollout (handles stuck agents and invalid actions)

	```python
	def rollout(env, agent, epoch, total_epochs):
	obs = env.reset()
	done = False
	trajectory = []
	total_aux = 0.0
	decay = aux_rewarder.decay_factor(epoch, total_epochs)

	# Session 1
	for _ in range(env.step_limit + 2): # +2 buffer for late handoff warning
	action = agent.act(obs)
	obs, reward, done, info = env.step(action)
	if "auxiliary_reward" in info:
	total_aux += info["auxiliary_reward"] * decay
	trajectory.append((obs, action, reward, info))
	if done or info.get("session") == 2:
	break

	if env.state()["session"] == 1:
	return trajectory, 0.0 # hit step limit without handoff

	# Session 2
	s2_obs = {"session": 2, "message": "Call parse_handoff() to retrieve your note."}
	for _ in range(env.step_limit):
	action = agent.act(s2_obs)
	obs, reward, done, info = env.step(action)
	trajectory.append((obs, action, reward, info))
	if done:
	break

	final_reward = (reward or 0.0) + total_aux
	return trajectory, normalize_reward(final_reward)
	```

	### 10.4 Curriculum (addresses issue #7)

	```
	Epochs 1-2: easy tasks only → learn basic handoff structure
	Epochs 3-4: easy + medium → learn compression under step pressure
	Epochs 5-6: medium + hard → learn surgical prioritization
	Eval only: holdout set → generalization check, never in training
	```

	### 10.5 Colab notebook outline

	```
	Cell 1: Install: openenv unsloth trl transformers wandb pytest
	Cell 2: Load env from HF Space
	Cell 3: Load Qwen2.5-Coder-7B-Instruct (Unsloth 4-bit)
	Cell 4: Run all 3 baselines → save baseline_results.json
	Cell 5: GRPO training loop with rollout → log to wandb
	Cell 6: Run PPO for comparison
	Cell 7: Eval on holdout set (trained model vs baselines)
	Cell 8: Save all plots as PNG to /plots/
	Cell 9: Ablation runs (3 configs)
	Cell 10: Print epoch 1 vs epoch 20 handoff notes side by side
	```

	---

	## 11. Baselines [NEW — addresses issue #12]

	All four on the same plot. Without this, reward improvement is meaningless.

	\| Baseline \| Description \| Expected S2 pass rate \|
	\|---\|---\|---\|
	\| No handoff \| Session 2 starts with blank note \| ~5-10% \|
	\| Random handoff \| Gibberish as the handoff note \| ~8-12% \|
	\| Trained agent (ours) \| Our GRPO-trained model \| Target: >60% \|
	\| Full S1 transcript \| Upper bound — all context given \| ~75-85% \|

	The trained agent should be comfortably above random and approaching (not matching)
	the full transcript upper bound. That gap tells the story clearly.

	---

	## 12. Ablation Studies [NEW — addresses issue #17]

	Three ablations to justify each reward component to judges:

	\| Ablation \| Removed component \| Expected degradation \|
	\|---\|---\|---\|
	\| No compression reward \| quality_score = 0 \| Handoffs become bloated \|
	\| No linearity reward \| linearity_score = 0 \| Session 2 thrashes more \|
	\| No auxiliary S1 reward \| AuxiliaryRewarder disabled \| Slower convergence \|

	Plot all ablations vs full model on same axes in `plots/ablation_comparison.png`.
	One-line caption per plot. Axes labeled: "Training Episode" (x) / "Total Reward" (y).

	---

	## 13. Evaluation Reporting [NEW — addresses issue #8]

	Don't aggregate across difficulties — it hides where the agent struggles.

	Report separately per difficulty and across seeds:

	```
	easy tasks: pass rate \| avg handoff tokens \| avg S2 steps
	medium tasks: same
	hard tasks: same
	holdout tasks: same ← generalization signal

	Run 3 seeds minimum. Report mean ± std.
	```

	---

	## 14. Interpretability [NEW — addresses issue #16]

	Show what the agent learned to keep vs drop across training epochs.

	```python
	# Track which handoff sections grow or shrink over training
	def analyze_handoff_evolution(handoff_log):
	section_lengths = {}
	for epoch, handoffs in handoff_log.items():
	section_lengths[epoch] = {}
	for section in ["COMPLETED:", "REMAINING:", "KEY FUNCTIONS:", "NEXT STEPS:"]:
	lengths = [len(extract_section(h, section)) for h in handoffs]
	section_lengths[epoch][section] = sum(lengths) / len(lengths)
	return section_lengths
	```

	Plot as stacked bar chart (`plots/handoff_diff_over_epochs.png`).

	Expected learning signal visible in the chart:
	- COMPLETED section shrinks (agent stops over-documenting finished work)
	- REMAINING section gets more precise (specific function names, not vague prose)
	- NEXT STEPS section grows and becomes the highest-value section for session 2

	This is the interpretability story for the blog and pitch.

	---

	## 15. Agent Loop (Client) [UPDATED — addresses issue #13]

	```python
	# client/agent.py — no server imports

	S1_SYSTEM_PROMPT = """You are working on a coding task in Session 1.
	Complete as much as possible. When approaching your step limit, call write_handoff()
	with a structured note following this format:
	TASK: / COMPLETED: / REMAINING: / KEY FUNCTIONS: / EDGE CASES: / NEXT STEPS:
	You have a retry budget for invalid actions. Use it wisely."""

	S2_SYSTEM_PROMPT = """You are in Session 2. You have NO memory of Session 1.
	Your ONLY information is the handoff note. Start by calling parse_handoff(),
	then use the note to continue the task. Do not rewrite everything from scratch."""

	class Agent:
	def __init__(self, model, tokenizer, retry_budget=3):
	self.model = model
	self.tokenizer = tokenizer
	self.retry_budget = retry_budget
	self.context = []

	def act(self, obs):
	prompt = self._build_prompt(obs)
	for attempt in range(self.retry_budget):
	response = self._generate(prompt)
	action = self._parse_action(response)
	if action is not None:
	self.context.append({"obs": obs, "action": action})
	return action
	prompt = self._build_retry_prompt(prompt, response, attempt)
	return Action(tool="noop", content="") # graceful no-op on exhaustion

	def _build_prompt(self, obs):
	system = S1_SYSTEM_PROMPT if obs.get("session") == 1 else S2_SYSTEM_PROMPT
	return system + "\n\n" + format_obs(obs)
	```

	---

	## 16. Risk Register [UPDATED — full 20-issue resolution]

	\| # \| Issue \| Severity \| Status \| Resolution \|
	\|---\|---\|---\|---\|---\|
	\| 1 \| Credit assignment — S1 no direct reward \| HIGH \| FIXED \| Auxiliary shaped rewards + decay schedule \|
	\| 2 \| Handoff gaming — code dumps / hinting \| HIGH \| FIXED \| HandoffValidator + code block limit + rewrite penalty \|
	\| 3 \| Linearity metric weak (re-read counting) \| MEDIUM \| FIXED \| Thrash detection on edit history + failed run rate \|
	\| 4 \| Test suite exploitable \| MEDIUM \| FIXED \| Hidden adversarial tests at submit \|
	\| 5 \| Session separation weak \| MEDIUM \| FIXED \| Name randomization per episode seed \|
	\| 6 \| Compression metric naive \| MEDIUM \| FIXED \| Multi-factor quality score: structure + density + ratio \|
	\| 7 \| Task difficulty miscalibrated \| MEDIUM \| FIXED \| Step limits verified empirically, handoff-critical design \|
	\| 8 \| Evaluation hides per-difficulty gaps \| MEDIUM \| FIXED \| Separate easy/medium/hard/holdout reporting \|
	\| 9 \| Sandbox not fully isolated \| MEDIUM \| FIXED \| Strict ulimits: CPU, RAM, file handles, forks \|
	\| 10 \| Step limit too tight or too loose \| LOW \| FIXED \| Dynamic by difficulty, late-handoff warning \|
	\| 11 \| Template overfitting \| MEDIUM \| FIXED \| Name randomization + holdout eval set \|
	\| 12 \| No baselines \| HIGH \| FIXED \| 3 baselines + upper bound, all on same plot \|
	\| 13 \| Agent gets stuck / invalid actions \| LOW \| FIXED \| Retry budget, invalid action penalty, noop fallback \|
	\| 14 \| Tool pattern exploitation \| LOW \| ACCEPTED \| Name randomization covers most of this; minor risk \|
	\| 15 \| GRPO instability \| MEDIUM \| FIXED \| Reward normalization, KL coeff, PPO backup \|
	\| 16 \| No interpretability \| MEDIUM \| FIXED \| Handoff section evolution tracking + diff plot \|
	\| 17 \| No ablation studies \| MEDIUM \| FIXED \| 3 ablations with plots \|
	\| 18 \| Demo risk \| LOW \| FIXED \| Deterministic seeds, pre-recorded run URL \|
	\| 19 \| Handoff format inconsistent \| HIGH \| FIXED \| Mandatory 6-section structure enforced by validator \|
	\| 20 \| Tests don't capture understanding \| LOW \| PARTIALLY \| Hidden adversarial tests cover this adequately for hackathon scope \|

	Issue #14 accepted as low-risk — name randomization already breaks most pattern
	exploitation. Full tool response variation adds complexity with marginal gain.

	Issue #20 partial — mutation testing is a research-grade addition, out of scope
	for the hackathon timeline.

	---

	## 17. Demo Preparation [NEW — addresses issue #18]

	- Deterministic seed: `env.reset(seed=42)` — same task, same names, reproducible
	- Pre-recorded run: screen recording of a successful trained-agent episode, hosted
	as URL (not committed to repo). Linked from README.
	- Fallback slide: screenshot of epoch 1 vs epoch 20 handoff side by side — shows
	the learning visually to a non-technical audience

	Never end the live demo on `submit()` — too unpredictable. End on the handoff note
	being written and displayed. That's the visual payoff.

	---

	## 18. Submission Checklist [UPDATED]

	\| Requirement \| How satisfied \| Status \|
	\|---\|---\|---\|
	\| OpenEnv latest release \| `MCPEnvironment` subclass, `openenv.yaml`, pinned version in requirements.txt \| [ ] \|
	\| Training script (Unsloth/TRL) \| `training/train_grpo.ipynb` — Colab T4, re-runnable in <30 min \| [ ] \|
	\| Training evidence \| `plots/` — reward, length, 4-way baseline, ablations, interpretability — all PNG \| [ ] \|
	\| Mini blog OR video \| HF blog post + <2 min YouTube video \| [ ] \|
	\| HF Space \| `yourteam/cross-session-continuity-env` — live and runnable \| [ ] \|
	\| README with all links \| Space, notebook, blog, video, WandB run \| [ ] \|
	\| No large files in repo \| Videos as `.url` text files only \| [ ] \|
	\| Baselines \| 3 baselines + upper bound documented and plotted \| [ ] \|
	\| Ablations \| 3 ablations documented and plotted \| [ ] \|
	\| Holdout eval \| Generalization results on 10 unseen tasks \| [ ] \|
	\| Per-difficulty breakdown \| easy / medium / hard results reported separately \| [ ] \|

	---

	## 19. README Template [UPDATED]

	```markdown
	# Cross-Session Continuity Env

	> Can RL teach an LLM to write better notes to its future self?

	## Problem
	LLMs forget everything when a session ends. For long coding tasks that span
	multiple sessions this is critical. No existing RL environment trains for this.

	## How It Works
	[diagram: session1 → handoff.md → session2 → reward]

	Session 1: agent gets task + starter code. Works until step limit.
	Must write a structured 6-section handoff note before session ends.

	Session 2: starts completely cold. Only the handoff note exists.
	Must complete the task and pass tests.

	Reward = test correctness (visible + hidden) + handoff quality + session 2 linearity.

	## Reward Breakdown
	\| Component \| Weight \| What it measures \|
	\|-------------------\|--------\|-------------------------------------\|
	\| Tests (visible) \| 33% \| Session 2 correctness \|
	\| Tests (hidden) \| 22% \| Generalization, no test overfitting \|
	\| Handoff quality \| 20% \| Structure, density, compression \|
	\| Linearity \| 15% \| Session 2 didn't thrash \|
	\| Penalties \| 10% \| Invalid actions, reconstruction \|

	## Results
	\| Agent \| S2 Test Pass Rate \|
	\|------------------------\|-------------------\|
	\| No handoff (baseline) \| ~8% \|
	\| Random handoff \| ~11% \|
	\| Trained (ours) \| ~65% \|
	\| Full transcript (UB) \| ~80% \|

	![reward curve](plots/reward_curve.png)
	Total reward over training episodes — all baselines on same axes

	![ablations](plots/ablation_comparison.png)
	Each reward component contribution — ablation study

	![handoff evolution](plots/handoff_diff_over_epochs.png)
	What the agent learned to keep vs drop over training

	## Before / After
	Epoch 1: 900 tokens, rambling, full code blocks, no structure
	Epoch 20: 180 tokens, 6 clear sections, precise function names, zero code

	## Links
	- HF Space: [url]
	- Colab Notebook: [url]
	- HF Blog Post: [url]
	- YouTube Demo (<2 min): [url]
	- WandB Training Run: [url]
	```

	---

	## 20. Pitch Story [UPDATED]

	> "Every developer has hit this wall. You're deep into a coding task with an AI
	> assistant. The session ends. You come back the next day — and the AI remembers
	> nothing. You start over from scratch.
	>
	> We asked a different question: what if we trained the AI to leave a perfect
	> briefing for its future self?
	>
	> Cross-Session Continuity Env is an RL environment where an agent must complete
	> a coding task split across two sessions with zero shared memory. Session 1
	> works on the problem, then writes a structured handoff note. Session 2 starts
	> completely cold — only that note exists.
	>
	> The agent is rewarded not for session 1 performance, but for how well its
	> future self performs using only the note it left behind.
	>
	> After training, the agent learned something we didn't expect. It stopped writing
	> long rambling summaries. It started writing surgical briefings — 180 words,
	> six sections, exactly what session 2 needs and nothing it doesn't.
	>
	> Test pass rates went from 8% (no handoff at all) to 65%.
	>
	> No one has trained this behavior explicitly before. We think it matters."

	---

	## 21. Timeline [UPDATED]

	\| Day \| Task \| Risk & Contingency \|
	\|---\|---\|---\|
	\| Day 1 (pre-onsite) \| Task bank: 20 tasks + holdout set. Sandbox + ulimits tested. HandoffValidator working. \| Sandbox is highest-risk — do first. Fallback: relax ulimits if resource module unavailable \|
	\| Day 2 (pre-onsite) \| Env class, session manager, rubric, auxiliary rewarder. Full unit tests on each. \| Rubric edge cases — budget 2h for test coverage \|
	\| Day 3 (pre-onsite) \| End-to-end episode: agent completes 2-session run. Client/server separation verified. \| Integration bugs — if stuck, simplify tool set \|
	\| Day 4 (onsite 25th) \| Colab notebook. All 3 baseline runs. First GRPO curves. WandB connected. \| Compute time — run baselines overnight if needed \|
	\| Day 5 (onsite 26th am) \| Full training run on HF credits. Ablations. Plots committed. \| GRPO divergence — fall back to PPO results \|
	\| Day 5 (onsite 26th pm) \| HF Space live. README + blog done. Demo recorded. Final checklist. \| Deployment issues — test HF Space access 24h early \|

	---

	## 22. What Good Looks Like at Submission

	1. Judge visits HF Space → watches a live 2-session run with trained agent
	2. Reward curve shows clear upward trend with all 4 baselines on the same plot
	3. Ablation plot shows each component contributes something measurable
	4. Epoch 1 vs epoch 20 handoff note is visibly, strikingly different
	5. Per-difficulty breakdown shows where the agent is strong vs weak
	6. Colab notebook re-runs in under 30 minutes on a T4
	7. Holdout eval confirms generalization, not just memorization

	All seven = strong submission that covers every judging criterion.