--- title: "Teaching LLMs to Write Better Notes to Their Future Self" thumbnail: /blog/assets/cross-session-continuity/baseline_vs_trained.png authors: - user: Aswini-Kumar --- # Teaching LLMs to Write Better Notes to Their Future Self *Can reinforcement learning teach a coding agent to communicate better across sessions with zero shared memory?* --- ## The Problem Every time you start a new chat with an LLM, it forgets everything from the last session. For short tasks this is fine. For long ones — a multi-hour coding project, a research investigation, a debugging marathon — this is catastrophic. The model re-reads the same files, re-discovers the same bugs, and wastes your time. Humans solve this with notes. Good ones. A developer leaving for the night writes: *"fixed the import error in utils.py, still need to handle the empty-list edge case in merge_intervals, run the tests first when you're back."* Can we train an LLM to do the same? ## The Environment **Cross-Session Continuity Env** is an RL environment built on OpenEnv where a coding agent must complete a task **across two separate sessions with zero shared memory**. ``` Session 1 Session 2 ───────────────────────────────────────────────────────── Agent receives task + starter code Agent receives ONLY handoff note Agent works: read → write → test ─> Agent calls parse_handoff() Agent ends: writes handoff note Agent completes task → submit ↓ [filesystem wiped] [function names randomized] [no code persists] ``` The agent has 6 tools: `read_file`, `write_file`, `run_tests`, `write_handoff`, `parse_handoff`, `submit`. The handoff note has a strict structure (enforced by the validator): ``` TASK: what the overall task is COMPLETED: what was implemented + verified REMAINING: what Session 2 must implement KEY FUNCTIONS: function names, signatures EDGE CASES: constraints or tricky logic NEXT STEPS: ordered action list for Session 2 ``` **Max 400 tokens. Max 5 lines of code. All 6 sections required.** If the note doesn't meet these constraints, the validator rejects it (no penalty — retry is allowed). This forces the agent to develop *information-dense, structured communication* rather than just copy-pasting code. ## Reward Design The reward is composable and anti-gaming: | Component | Weight | Anti-gaming | |-----------|--------|-------------| | Tests (visible, Session 2) | 33% | Hidden tests at submit time | | Tests (hidden) | 22% | Not accessible via `run_tests` | | Handoff quality | 20% | Code-dump blocked by validator | | Linearity | 15% | Thrash/rewrite detection | | Penalties | 10% | Invalid actions, reconstruction | 40% of the test score comes from hidden test cases that are never revealed to the agent. This ensures the agent can't memorize specific test patterns. ## Training We train **Qwen2.5-Coder-7B-Instruct** using **GRPO** (Group Relative Policy Optimization) via Hugging Face TRL and Unsloth, on a Colab T4 GPU. The training uses a 3-phase curriculum: - **Epochs 1–2**: Easy tasks (step limit 20, 3 visible tests) - **Epochs 3–4**: Medium tasks (step limit 35, 5 visible tests) - **Epochs 5–6**: Hard tasks (step limit 55, 8 visible tests) This bootstraps the agent on simpler tasks before exposing it to harder generalization challenges. ## Results The core question: does training actually improve the agent's ability to use its handoff note? ![Baseline vs Trained](plots/baseline_vs_trained.png) *Session 2 test pass rate across 4 conditions. Error bars = ± std, 3 seeds.* The trained agent achieves **~63% Session 2 test pass rate** vs **~8% for no handoff** and **~11% for random handoff**. This is a **+55 percentage point improvement** over the lower bound. The reward curve shows clear learning: ![Reward Curve](plots/reward_curve.png) *Total reward across training episodes. All 4 conditions on same axes. Band = ±1 std.* And the training loss descends cleanly: ![Loss Curve](plots/loss_curve.png) *Policy loss (top) + KL divergence (bottom) across training steps. Curriculum phases shown as shaded regions.* ## What the Agent Actually Learned The most interesting result is *how the handoff notes changed*. ![Handoff Evolution](plots/handoff_diff_over_epochs.png) *Token count per handoff section across 6 training epochs.* - **Epoch 1**: ~700 tokens. Rambling. Code blocks everywhere. Repeats the task description verbatim. The NEXT STEPS section is almost empty. - **Epoch 6**: ~175 tokens. Surgical. Zero code. COMPLETED shrinks (less over-documentation). NEXT STEPS grows to dominate — the most actionable information for Session 2. The agent learned that Session 2 doesn't need to know *what was done*, it needs to know *exactly what to do next*. That's a genuine insight about communication. ## Ablation Study ![Ablation Study](plots/ablation_comparison.png) *Removing any reward component degrades performance. All configs on same axes.* - **No compression reward**: -16 pp. Agent produces bloated notes. Session 2 spends steps parsing instead of coding. - **No linearity reward**: -11 pp. Session 2 thrashes — rewrites code instead of building on it. - **No auxiliary reward**: -8 pp. Slower convergence; the shaped S1 rewards help bootstrap early. ## Why This Matters The capability gap we're targeting — **structured cross-session state transfer** — is genuinely unsolved. Every production deployment of a coding agent hits this wall when tasks span multiple conversations. The environment is designed to force the agent to develop a real skill, not to game a metric: - Function names are randomized per episode (can't memorize by name) - Hidden tests at submit time (can't overfit to visible tests) - Validator blocks code dumps (must communicate structurally) An agent that scores well here has actually learned to write better notes. That's the bet. ## Links - **HF Space (live demo)**: https://huggingface.co/spaces/Aswini-Kumar/cross-session-continuity-env - **Training Notebook**: https://colab.research.google.com/github/CelestialWorthyOfHeavenAndEarth/cross-session-continuity-env/blob/main/training/train_grpo.ipynb - **GitHub**: https://github.com/CelestialWorthyOfHeavenAndEarth/cross-session-continuity-env