Aswini-Kumar's picture
update: BLOG.md
6162152 verified
metadata
title: Teaching LLMs to Write Better Notes to Their Future Self
thumbnail: /blog/assets/cross-session-continuity/baseline_vs_trained.png
authors:
  - user: Aswini-Kumar

Teaching LLMs to Write Better Notes to Their Future Self

Can reinforcement learning teach a coding agent to communicate better across sessions with zero shared memory?


The Problem

Every time you start a new chat with an LLM, it forgets everything from the last session.

For short tasks this is fine. For long ones β€” a multi-hour coding project, a research investigation, a debugging marathon β€” this is catastrophic. The model re-reads the same files, re-discovers the same bugs, and wastes your time.

Humans solve this with notes. Good ones. A developer leaving for the night writes: "fixed the import error in utils.py, still need to handle the empty-list edge case in merge_intervals, run the tests first when you're back."

Can we train an LLM to do the same?

The Environment

Cross-Session Continuity Env is an RL environment built on OpenEnv where a coding agent must complete a task across two separate sessions with zero shared memory.

Session 1                              Session 2
─────────────────────────────────────────────────────────
Agent receives task + starter code     Agent receives ONLY handoff note
Agent works: read β†’ write β†’ test  ─>  Agent calls parse_handoff()
Agent ends: writes handoff note        Agent completes task β†’ submit
                ↓
    [filesystem wiped]
    [function names randomized]
    [no code persists]

The agent has 6 tools: read_file, write_file, run_tests, write_handoff, parse_handoff, submit.

The handoff note has a strict structure (enforced by the validator):

TASK:          what the overall task is
COMPLETED:     what was implemented + verified
REMAINING:     what Session 2 must implement
KEY FUNCTIONS: function names, signatures
EDGE CASES:    constraints or tricky logic
NEXT STEPS:    ordered action list for Session 2

Max 400 tokens. Max 5 lines of code. All 6 sections required.

If the note doesn't meet these constraints, the validator rejects it (no penalty β€” retry is allowed). This forces the agent to develop information-dense, structured communication rather than just copy-pasting code.

Reward Design

The reward is composable and anti-gaming:

Component Weight Anti-gaming
Tests (visible, Session 2) 33% Hidden tests at submit time
Tests (hidden) 22% Not accessible via run_tests
Handoff quality 20% Code-dump blocked by validator
Linearity 15% Thrash/rewrite detection
Penalties 10% Invalid actions, reconstruction

40% of the test score comes from hidden test cases that are never revealed to the agent. This ensures the agent can't memorize specific test patterns.

Training

We train Qwen2.5-Coder-7B-Instruct using GRPO (Group Relative Policy Optimization) via Hugging Face TRL and Unsloth, on a Colab T4 GPU.

The training uses a 3-phase curriculum:

  • Epochs 1–2: Easy tasks (step limit 20, 3 visible tests)
  • Epochs 3–4: Medium tasks (step limit 35, 5 visible tests)
  • Epochs 5–6: Hard tasks (step limit 55, 8 visible tests)

This bootstraps the agent on simpler tasks before exposing it to harder generalization challenges.

Results

The core question: does training actually improve the agent's ability to use its handoff note?

Baseline vs Trained Session 2 test pass rate across 4 conditions. Error bars = Β± std, 3 seeds.

The trained agent achieves ~63% Session 2 test pass rate vs ~8% for no handoff and ~11% for random handoff. This is a +55 percentage point improvement over the lower bound.

The reward curve shows clear learning:

Reward Curve Total reward across training episodes. All 4 conditions on same axes. Band = Β±1 std.

And the training loss descends cleanly:

Loss Curve Policy loss (top) + KL divergence (bottom) across training steps. Curriculum phases shown as shaded regions.

What the Agent Actually Learned

The most interesting result is how the handoff notes changed.

Handoff Evolution Token count per handoff section across 6 training epochs.

  • Epoch 1: ~700 tokens. Rambling. Code blocks everywhere. Repeats the task description verbatim. The NEXT STEPS section is almost empty.
  • Epoch 6: ~175 tokens. Surgical. Zero code. COMPLETED shrinks (less over-documentation). NEXT STEPS grows to dominate β€” the most actionable information for Session 2.

The agent learned that Session 2 doesn't need to know what was done, it needs to know exactly what to do next. That's a genuine insight about communication.

Ablation Study

Ablation Study Removing any reward component degrades performance. All configs on same axes.

  • No compression reward: -16 pp. Agent produces bloated notes. Session 2 spends steps parsing instead of coding.
  • No linearity reward: -11 pp. Session 2 thrashes β€” rewrites code instead of building on it.
  • No auxiliary reward: -8 pp. Slower convergence; the shaped S1 rewards help bootstrap early.

Why This Matters

The capability gap we're targeting β€” structured cross-session state transfer β€” is genuinely unsolved. Every production deployment of a coding agent hits this wall when tasks span multiple conversations.

The environment is designed to force the agent to develop a real skill, not to game a metric:

  • Function names are randomized per episode (can't memorize by name)
  • Hidden tests at submit time (can't overfit to visible tests)
  • Validator blocks code dumps (must communicate structurally)

An agent that scores well here has actually learned to write better notes. That's the bet.

Links