Spaces:

yashu2000
/

MiniGridEnv_Blog

Running

File size: 77,127 Bytes

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>MiniGridEnv: An OpenEnv Benchmark for Text-Grounded Navigation with Cross-Episodic Memory</title>
<meta name="description" content="MiniGridEnv: an OpenEnv-native wrap of Farama MiniGrid/BabyAI for LLM post-training, extended with cross-episodic LLM-rewritten markdown memory and branch-stable GRPO semantics.">
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;600;700;800&family=JetBrains+Mono:wght@400;600&display=swap" rel="stylesheet">
<!-- Mermaid for inline diagrams -->
<script type="module">
  import mermaid from 'https://cdn.jsdelivr.net/npm/mermaid@11/dist/mermaid.esm.min.mjs';
  mermaid.initialize({
    startOnLoad: true,
    theme: 'dark',
    themeVariables: {
      primaryColor: '#6366f1',
      primaryTextColor: '#e2e8f0',
      primaryBorderColor: '#818cf8',
      lineColor: '#818cf8',
      secondaryColor: '#1e293b',
      tertiaryColor: '#172033',
      background: '#0f172a',
      mainBkg: '#1e293b',
      nodeBorder: '#818cf8',
      clusterBkg: '#172033',
      clusterBorder: '#334155',
      titleColor: '#e2e8f0',
      edgeLabelBackground: '#1e293b',
      nodeTextColor: '#e2e8f0'
    },
    flowchart: { curve: 'basis', htmlLabels: true },
    fontFamily: 'Inter, sans-serif'
  });
</script>
<script>
  window.MathJax = {
    tex: {
      inlineMath: [['$', '$'], ['\\(', '\\)']],
      displayMath: [['$$', '$$'], ['\\[', '\\]']]
    },
    svg: { fontCache: 'global' }
  };
</script>
<script async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>
<style>
  :root {
    --bg: #0f172a; --surface: #1e293b; --surface-2: #172033; --border: #334155;
    --text: #e2e8f0; --muted: #94a3b8; --accent: #6366f1;
    --accent2: #818cf8; --green: #22c55e; --red: #ef4444;
    --orange: #f59e0b; --radius: 12px;
  }
  * { margin: 0; padding: 0; box-sizing: border-box; }
  html { scroll-behavior: smooth; }
  body { font-family: 'Inter', -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif;
         background: var(--bg); color: var(--text); line-height: 1.7;
         -webkit-font-smoothing: antialiased; }
  .container { max-width: 820px; margin: 0 auto; padding: 2rem 1.5rem 4rem; }
  /* Top nav */
  .topnav { position: sticky; top: 0; z-index: 10; background: rgba(15,23,42,.85);
            backdrop-filter: blur(10px); border-bottom: 1px solid var(--border);
            padding: .9rem 1.5rem; display: flex; justify-content: space-between;
            align-items: center; font-size: .88rem; }
  .topnav .brand { font-weight: 700; color: var(--text); text-decoration: none;
                   display: flex; align-items: center; gap: .5rem; }
  .topnav .brand .dot { width: 8px; height: 8px; border-radius: 50%;
                        background: var(--green); box-shadow: 0 0 8px rgba(34,197,94,.6); }
  .topnav .links { display: flex; gap: 1.25rem; }
  .topnav .links a { color: var(--muted); text-decoration: none; transition: color .15s; }
  .topnav .links a:hover { color: var(--accent2); }
  /* Hero */
  .hero { text-align: center; padding: 4rem 0 2.5rem; }
  .hero-badge { display: inline-block; background: rgba(99,102,241,.15); color: var(--accent2);
                padding: .4rem 1.1rem; border-radius: 20px; font-size: .78rem; font-weight: 600;
                letter-spacing: .08em; margin-bottom: 1.25rem;
                border: 1px solid rgba(99,102,241,.3); text-transform: uppercase; }
  .hero h1 { font-size: clamp(2rem, 4.2vw, 3.2rem); font-weight: 800; letter-spacing: -.025em;
             line-height: 1.15;
             background: linear-gradient(135deg, #e2e8f0 25%, #6366f1 100%);
             -webkit-background-clip: text; -webkit-text-fill-color: transparent;
             background-clip: text; }
  .hero .subtitle { color: var(--muted); font-size: 1.15rem; max-width: 640px;
                    margin: 1rem auto 0; }
  .hero .byline { color: var(--muted); font-size: .85rem; margin-top: 1.5rem;
                  font-style: italic; }
  .banner-figure { margin: 2rem 0 3rem; }
  .banner-figure img.banner { width: 100%; border-radius: var(--radius);
            border: 1px solid var(--border); display: block; }
  /* Badges row */
  .badges { display: flex; justify-content: center; gap: .6rem; flex-wrap: wrap;
            margin: 1.5rem 0; }
  .badges img { height: 22px; }
  /* Button group */
  .btn-group { display: flex; gap: .75rem; justify-content: center; margin: 2rem 0;
               flex-wrap: wrap; }
  .btn { display: inline-flex; align-items: center; gap: .45rem; padding: .6rem 1.35rem;
         background: var(--accent); color: white; border-radius: 8px; font-size: .88rem;
         font-weight: 600; text-decoration: none; transition: all .2s; }
  .btn:hover { background: var(--accent2); transform: translateY(-1px); }
  .btn-outline { background: transparent; border: 1px solid var(--border); color: var(--text); }
  .btn-outline:hover { border-color: var(--accent); color: var(--accent2);
                       background: rgba(99,102,241,.08); }
  /* TOC */
  .toc { background: var(--surface); border: 1px solid var(--border); border-radius: var(--radius);
         padding: 1.25rem 1.5rem; margin: 0 0 2.5rem; }
  .toc h3 { font-size: .82rem; font-weight: 700; letter-spacing: .08em; text-transform: uppercase;
            color: var(--accent2); margin-bottom: .85rem; }
  .toc ol { list-style: none; counter-reset: toc; display: flex; flex-wrap: wrap; gap: .35rem .8rem;
            margin: 0; padding: 0; }
  .toc ol li { counter-increment: toc; font-size: .88rem; }
  .toc ol li::before { content: counter(toc) "."; color: var(--accent); font-weight: 700;
                       font-size: .8rem; margin-right: .3rem; }
  .toc ol li a { color: var(--muted); text-decoration: none; transition: color .15s; }
  .toc ol li a:hover { color: var(--accent2); }
  /* Sections */
  section { margin: 3.5rem 0; }
  section h2 { font-size: 1.55rem; font-weight: 800; letter-spacing: -.01em;
               margin-bottom: 1rem; color: var(--text);
               border-left: 3px solid var(--accent); padding-left: .9rem; }
  section h3 { font-size: 1.1rem; font-weight: 700; margin: 2rem 0 .75rem;
               color: var(--accent2); }
  section p { color: #cbd5e1; margin-bottom: 1rem; font-size: 1.02rem; }
  section p strong { color: var(--text); }
  section ul, section ol { color: #cbd5e1; margin: 1rem 0 1rem 1.5rem; }
  section ul li, section ol li { margin-bottom: .5rem; font-size: 1rem; }
  section ul li strong, section ol li strong { color: var(--text); }
  /* Pull-quote */
  blockquote { border-left: 3px solid var(--accent2);
               background: rgba(99,102,241,.06); padding: 1.1rem 1.25rem;
               margin: 1.5rem 0; border-radius: 0 8px 8px 0;
               color: #e2e8f0; font-size: 1.02rem; }
  /* Tables */
  .table-wrap { margin: 1.5rem 0; overflow-x: auto;
                background: var(--surface); border: 1px solid var(--border);
                border-radius: var(--radius); }
  table { width: 100%; border-collapse: collapse; font-size: .92rem; }
  th { background: rgba(99,102,241,.1); color: var(--accent2);
       font-size: .72rem; font-weight: 700; letter-spacing: .06em;
       text-transform: uppercase; padding: .85rem 1rem; text-align: left; }
  td { padding: .7rem 1rem; border-top: 1px solid var(--border); color: #cbd5e1; }
  td.num { text-align: right; font-variant-numeric: tabular-nums;
           font-family: 'JetBrains Mono', monospace; font-size: .88rem; }
  tr:hover td { background: rgba(99,102,241,.04); }
  td strong, th strong { color: var(--text); }
  .task-id { font-family: 'JetBrains Mono', monospace; font-weight: 700;
             color: var(--accent2); font-size: .85rem; }
  tr.avg-row td { background: rgba(99,102,241,.08); font-weight: 700;
                  color: var(--text); }
  tr.novel td:first-child { color: #fca5a5; }
  /* Code */
  pre { background: #0b1120; border: 1px solid var(--border);
        border-radius: var(--radius); padding: 1.1rem 1.25rem; overflow-x: auto;
        margin: 1.25rem 0; font-family: 'JetBrains Mono', monospace;
        font-size: .85rem; line-height: 1.6; color: #d1d5db; }
  pre .c { color: #64748b; }
  code { font-family: 'JetBrains Mono', monospace; font-size: .88em;
         background: rgba(99,102,241,.12); color: var(--accent2);
         padding: .1em .35em; border-radius: 4px; }
  pre code { background: none; color: inherit; padding: 0; font-size: 1em; }
  /* Figure */
  figure { margin: 2rem 0; }
  figure img { width: 100%; border-radius: var(--radius);
               border: 1px solid var(--border); }
  figcaption { text-align: center; color: var(--muted); font-size: .85rem;
               margin-top: .75rem; }
  mjx-container { overflow-x: auto; max-width: 100%; }
  /* Mermaid diagram wrapper */
  .mermaid-wrap { margin: 2rem 0; background: var(--surface); border: 1px solid var(--border);
                  border-radius: var(--radius); padding: 1.5rem 1rem; overflow-x: auto; }
  .mermaid-wrap .mermaid { display: flex; justify-content: center; }
  .mermaid-caption { text-align: center; color: var(--muted); font-size: .85rem;
                     margin-top: .75rem; }
  /* Episode trace */
  .episode-trace { background: var(--surface); border: 1px solid var(--border);
                   border-radius: var(--radius); padding: 1.25rem 1.5rem; margin: 1.5rem 0;
                   position: relative; }
  .episode-trace::before { content: ''; position: absolute; left: 1.5rem; top: 2.5rem;
                           bottom: 1.25rem; width: 2px; background: var(--border); }
  .trace-step { position: relative; padding-left: 2rem; margin-bottom: 1.25rem; }
  .trace-step:last-child { margin-bottom: 0; }
  .trace-step .step-marker { position: absolute; left: -.45rem; top: .2rem; width: 12px;
                             height: 12px; border-radius: 50%; border: 2px solid var(--accent);
                             background: var(--bg); z-index: 1; }
  .trace-step .step-marker.terminal { background: var(--red); border-color: var(--red); }
  .trace-step .step-marker.good { background: var(--green); border-color: var(--green); }
  .trace-step .step-label { font-family: 'JetBrains Mono', monospace; font-size: .78rem;
                            color: var(--accent2); font-weight: 700; margin-bottom: .25rem; }
  .trace-step .step-content { font-size: .9rem; color: #cbd5e1; }
  .trace-step .step-content code { font-size: .82em; }
  .trace-verdict { margin-top: 1rem; padding: .75rem 1rem; border-radius: 8px;
                   font-size: .9rem; font-weight: 600; }
  .trace-verdict.bad { background: rgba(239,68,68,.1); border: 1px solid rgba(239,68,68,.3);
                       color: #fca5a5; }
  .trace-verdict.good { background: rgba(34,197,94,.1); border: 1px solid rgba(34,197,94,.3);
                        color: #86efac; }
  /* Callout for the closing question */
  .callout { text-align: center; padding: 2rem 1.5rem; margin: 3rem 0;
             background: linear-gradient(135deg, rgba(99,102,241,.08), rgba(129,140,248,.04));
             border: 1px solid rgba(99,102,241,.25); border-radius: var(--radius); }
  .callout .q { font-size: 1.25rem; font-weight: 700; color: var(--text);
                font-style: italic; margin-bottom: .5rem; }
  .callout .sub { color: var(--muted); font-size: .95rem; }
  /* Footer */
  .footer { text-align: center; padding: 3rem 0 1rem; color: var(--muted);
            font-size: .85rem; border-top: 1px solid var(--border); margin-top: 3rem; }
  .footer a { color: var(--accent2); text-decoration: none; margin: 0 .5rem; }
  .footer a:hover { text-decoration: underline; }
  @media (max-width: 640px) {
    .container { padding: 1rem 1rem 3rem; }
    .hero { padding: 2.5rem 0 1.5rem; }
    .topnav .links { display: none; }
    section h2 { font-size: 1.3rem; }
    table { font-size: .82rem; }
    th, td { padding: .55rem .6rem; }
    .toc ol { flex-direction: column; }
    .episode-trace { padding: 1rem; }
    .episode-trace::before { left: 1rem; }
  }
  /* Memory-file card (qualitative memory evolution gallery) */
  .memory-card { background: var(--surface); border: 1px solid var(--border);
                 border-radius: var(--radius); padding: 1rem 1.2rem; margin: 1rem 0; }
  .memory-card .mem-header { display: flex; justify-content: space-between;
                             font-family: 'JetBrains Mono', monospace; font-size: .78rem;
                             color: var(--accent2); margin-bottom: .65rem; }
  .memory-card .mem-header .mem-meta { color: var(--muted); }
  .memory-card pre { margin: 0; padding: .8rem 1rem; font-size: .8rem; background: var(--surface-2);
                     border-color: var(--border); }
  .memory-gallery { display: grid; grid-template-columns: 1fr; gap: 1rem; }
  @media (min-width: 720px) { .memory-gallery { grid-template-columns: 1fr 1fr; } }
</style>
</head>
<body>

<nav class="topnav">
  <a href="#top" class="brand"><span class="dot"></span> MiniGridEnv Blog</a>
  <div class="links">
    <a href="#why">Why</a>
    <a href="#design">Design</a>
    <a href="#memory">Memory</a>
    <a href="#memory-evolution">Memory gallery</a>
    <a href="#results">Results</a>
    <a href="#engineering">Engineering</a>
    <a href="https://huggingface.co/spaces/yashu2000/MiniGridEnv" target="_blank">Live Space &#8599;</a>
  </div>
</nav>

<div class="container" id="top">

  <div class="hero">
    <div class="hero-badge">OpenEnv &middot; AgentX Phase 2</div>
    <h1>MiniGridEnv</h1>
    <p class="subtitle">An OpenEnv-native wrap of Farama <strong>MiniGrid/BabyAI</strong> for text-grounded navigation, extended with <strong>cross-episodic, LLM-rewritten markdown memory</strong> and branch-stable GRPO.</p>
    <div class="badges">
      <a href="https://github.com/sharma-yash01/MiniGridEnv" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/MiniGridEnv-GitHub-181717?logo=github" alt="MiniGridEnv on GitHub"/></a>
      <a href="https://github.com/sharma-yash01/MiniGridPT" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/MiniGridPT-GitHub-181717?logo=github" alt="MiniGridPT on GitHub"/></a>
      <a href="https://huggingface.co/spaces/yashu2000/MiniGridEnv" target="_blank"><img src="https://img.shields.io/badge/HF%20Space-Live%20Demo-FFD21E?logo=huggingface&logoColor=black" alt="HF Space"/></a>
      <img src="https://img.shields.io/badge/OpenEnv-Native-4B8BBE" alt="OpenEnv"/>
      <img src="https://img.shields.io/badge/BabyAI-10%20levels-brightgreen" alt="10 BabyAI levels"/>
      <img src="https://img.shields.io/badge/Training-GRPO%20%2B%20Memory-orange" alt="GRPO + Memory"/>
    </div>
    <div class="byline">AgentX Phase 2 &middot; OpenEnv Challenge Submission &nbsp;|&nbsp; Yashaswi Sharma (University of Southern California)&nbsp;|&nbsp; Dongze Ye (USC) &nbsp;|&nbsp; Defu Cao (USC) &nbsp;|&nbsp; Muyan Weng (USC)</div>
  </div>

  <figure class="banner-figure">
    <img src="banner.png" alt="MiniGridEnv: observe a 7x7 egocentric grid, act via Thought/Action, remember via cross-episodic markdown memory" class="banner"/>
    <figcaption><strong>Figure 0.</strong> Three-stage loop: <strong>Observe</strong> (7&times;7 egocentric grid as natural language), <strong>Act</strong> (<code>Thought:</code> / <code>Action:</code> parsed to <code>Discrete(7)</code>, stepped over OpenEnv WebSocket), <strong>Remember</strong> (line-limited markdown $M$, rewritten by the same LLM after each episode; Section 7).</figcaption>
  </figure>

  <div class="btn-group">
    <a class="btn" href="https://huggingface.co/spaces/yashu2000/MiniGridEnv" target="_blank">Live Environment Space &rarr;</a>
    <a class="btn btn-outline" href="https://github.com/sharma-yash01/MiniGridEnv" target="_blank" rel="noopener noreferrer">MiniGridEnv (GitHub)</a>
    <a class="btn btn-outline" href="https://github.com/sharma-yash01/MiniGridPT" target="_blank" rel="noopener noreferrer">MiniGridPT (GitHub)</a>
  </div>

  <!-- Table of Contents -->
  <nav class="toc" id="toc">
    <h3>Contents</h3>
    <ol>
      <li><a href="#why">Text-Grounded Navigation with a Self-Curated Notebook</a></li>
      <li><a href="#matters">Why This Benchmark Matters</a></li>
      <li><a href="#prior-work">Prior Work &amp; Novelty</a></li>
      <li><a href="#design">What MiniGridEnv + MiniGridPT Are</a></li>
      <li><a href="#env-design">Environment Design</a></li>
      <li><a href="#openenv">Why OpenEnv</a></li>
      <li><a href="#memory">Cross-Episodic Memory</a></li>
      <li><a href="#scoring">Scoring &amp; Reward Shaping</a></li>
      <li><a href="#architecture">Architecture &amp; Training Pipeline</a></li>
      <li><a href="#memory-evolution">Memory-evolution gallery (illustrative)</a></li>
      <li><a href="#results">Results: What We Found</a></li>
      <li><a href="#engineering">Engineering Lessons</a></li>
      <li><a href="#positioning">Where This Submission Sits</a></li>
      <li><a href="#gigpo">Next Step: GiGPO</a></li>
      <li><a href="#foundations">Foundations &amp; Citations</a></li>
      <li><a href="#quickstart">Quick Start</a></li>
      <li><a href="#future">Future Work</a></li>
      <li><a href="#conclusion">Conclusion</a></li>
    </ol>
  </nav>

  <!-- 1. WHY -->
  <section id="why">
    <h2>Text-grounded navigation with a self-curated notebook</h2>
    <p>Most LLM benchmarks ask what a model <strong>can say</strong>. Few ask whether it can <strong>act in a grounded compositional world while curating its own persistent notebook</strong>. <strong>MiniGridEnv</strong> is an OpenEnv-native wrap of Farama's <a href="https://github.com/Farama-Foundation/Minigrid" target="_blank" style="color:var(--accent2)">MiniGrid / BabyAI</a> that gives an LLM a 7&times;7 egocentric world rendered as natural language, natural-language actions (<code>"go forward"</code>, <code>"pickup"</code>, <code>"turn left"</code>), and BabyAI's ten-stage compositional instruction curriculum from <code>GoToRedBall</code> to <code>BossLevel</code>.</p>
    <p><strong>This blog is about the extension.</strong> The base environment is a faithful OpenEnv wrap of MiniGrid/BabyAI (existing work, now interoperable). The novel contribution is <strong>cross-episodic memory</strong>: a line-limited markdown file the agent reads before each action and <em>rewrites</em> at the end of each episode, plus <strong>branch-stable GRPO file naming</strong> so each parallel rollout chain keeps one stable file to compact across optimizer steps.</p>
    <p>Every reward signal is <strong>ground-truth arithmetic</strong> from the underlying BabyAI bot-verifiable success criterion. There is no LLM judge in the loop.</p>
    <p>The falsifiable claims:</p>
    <ol>
      <li><em>GRPO post-training on grounded navigation produces monotonically increasing completion rates across BabyAI's curriculum.</em></li>
      <li><em>Cross-episodic memory measurably improves completion rate over stateless play, and the memory content evolves from random notes into structured strategies as training progresses.</em></li>
    </ol>
    <p><strong>Figure 0</strong> (the banner above) encodes the contribution at a glance: the <em>Observe</em> panel matches the text-observation stack in <a href="#env-design" style="color:var(--accent2)">Environment design</a>; the <em>Act</em> panel matches NL actions, parsing, and OpenEnv stepping in the same section; the <em>Remember</em> panel matches cross-episodic memory $M$ in <a href="#memory" style="color:var(--accent2)">Cross-episodic memory</a> and the training loop in <a href="#architecture" style="color:var(--accent2)">Architecture &amp; training pipeline</a>.</p>
  </section>

  <!-- 2. WHY IT MATTERS -->
  <section id="matters">
    <h2>Why this benchmark matters</h2>
    <p>Grounded navigation with compositional language is a load-bearing capability for embodied agents, web agents, and any LLM that must <em>act under an observation budget</em>. BabyAI has been the reference curriculum for this since 2019, but its native interface is a raw gym environment, not a WebSocket contract a GRPO trainer can consume across machines, Docker containers, and HF Spaces with a single code path.</p>
    <p>The methodology is <strong>transferable</strong>. Any text-grounded sequential task with a sparse terminal reward and compositional instructions (web navigation, tool-use, interactive debugging, embodied robotics simulators) fits the same MDP template. Memory is also transferable: line-limited LLM-rewritten markdown is a general mechanism for <em>self-directed state</em> that is not specific to BabyAI.</p>
    <p>The environment is <strong>engineering-cheap to scale</strong>. MiniGrid steps are microseconds; an instance is 1&ndash;5&nbsp;MB; the OpenEnv wrapper sets <code>max_concurrent_envs=256</code> out of the box. An LLM-backed environment cannot match that density.</p>
  </section>

  <!-- 3. PRIOR WORK & NOVELTY -->
  <section id="prior-work">
    <h2>Prior work &amp; novelty</h2>
    <p>The space of &quot;LLMs + text-grounded navigation + memory&quot; sits across three prior buckets. None occupies the cell we target:</p>
    <div class="table-wrap">
      <table>
        <thead><tr><th>Prior work bucket</th><th>What it does</th><th>What it does not</th></tr></thead>
        <tbody>
          <tr><td><strong>BabyAI / MiniGrid (base)</strong><br><span style="font-size:.85em;color:var(--muted)">Chevalier-Boisvert et al., <a href="https://arxiv.org/abs/1810.08272" target="_blank" style="color:var(--accent2)">arXiv:1810.08272</a> (ICLR 2019); <a href="https://github.com/Farama-Foundation/Minigrid" target="_blank" style="color:var(--accent2)">Farama-Foundation/Minigrid</a></span></td><td>Compositional language-conditioned navigation as a gym environment with a reference bot and a 10-stage difficulty curriculum</td><td>No OpenEnv/WebSocket contract; no text observation; no LLM post-training pipeline; no memory</td></tr>
          <tr><td><strong>Memory-augmented LLM agents</strong><br><span style="font-size:.85em;color:var(--muted)">Voyager (<a href="https://arxiv.org/abs/2305.16291" target="_blank" style="color:var(--accent2)">arXiv:2305.16291</a>); Reflexion (<a href="https://arxiv.org/abs/2303.11366" target="_blank" style="color:var(--accent2)">arXiv:2303.11366</a>); Generative Agents (<a href="https://arxiv.org/abs/2304.03442" target="_blank" style="color:var(--accent2)">arXiv:2304.03442</a>)</span></td><td>Cross-episode skill libraries, verbal reflection, structured long-term memory, all <em>prompt-engineered</em> at inference time</td><td>No RL post-training; no branch-stable memory semantics under GRPO; not connected to OpenEnv</td></tr>
          <tr><td><strong>RLVR on language environments</strong><br><span style="font-size:.85em;color:var(--muted)">DeepSeekMath / GRPO (<a href="https://arxiv.org/abs/2402.03300" target="_blank" style="color:var(--accent2)">arXiv:2402.03300</a>); TRL &times; OpenEnv (<a href="https://huggingface.co/docs/trl/en/openenv" target="_blank" style="color:var(--accent2)">TRL docs</a>)</span></td><td>Critic-free RL with verifiable rewards; standard WebSocket env contract and `rollout_func`</td><td>No persistent agent state across episodes; no first-class notion of branch-stable rollout chains</td></tr>
          <tr class="novel"><td><strong>MiniGridEnv + MiniGridPT (ours)</strong></td><td>OpenEnv wrap of MiniGrid/BabyAI + GRPO + <em>cross-episodic LLM-rewritten markdown memory</em> + <em>branch-stable per-chain file naming</em></td><td>Not a human study; memory is text-only (no retrieval index)</td></tr>
        </tbody>
      </table>
    </div>
    <blockquote>To our knowledge, no prior work combines an <strong>OpenEnv-native BabyAI environment</strong> with <strong>GRPO post-training</strong>, <strong>line-limited LLM-rewritten cross-episodic memory</strong>, and <strong>branch-stable memory-file naming</strong> that keeps each parallel GRPO chain anchored to a stable file across optimizer steps. The env-contract, memory semantics, and training package are the contribution; MiniGrid/BabyAI are the shoulders we stand on.</blockquote>
  </section>

  <!-- 4. WHAT IT IS -->
  <section id="design">
    <h2>What MiniGridEnv + MiniGridPT are</h2>
    <blockquote>Two strictly separated packages. <strong>MiniGridEnv</strong> (the OpenEnv-compatible environment) and <strong>MiniGridPT</strong> (the GRPO training client) communicate exclusively over WebSocket. No shared Python imports. The training container is pure-GPU; the environment container is CPU-only.</blockquote>
    <p>Each episode:</p>
    <ul>
      <li>The env samples a BabyAI level (<code>GoToRedBall</code> &hellip; <code>BossLevel</code>), seeds procedural generation, and emits a mission like <code>&quot;go to the red ball&quot;</code> or <code>&quot;open the door on your left, then put the green ball next to the yellow key&quot;</code>.</li>
      <li>On turn <em>t</em>, the agent sees a natural-language description of its 7&times;7 egocentric view plus the mission, and emits <code>Thought: &hellip;\nAction: &lt;one of 7 actions&gt;</code>.</li>
      <li>A local parser normalizes the action into MiniGrid's <code>Discrete(7)</code> space; the gym env steps; the wrapper builds the next text observation.</li>
      <li><strong>Mid-episode reward is zero.</strong> On success the env emits <code>+1.0</code> (binary reward, the GRPO-friendly default).</li>
      <li><strong>Memory mode only:</strong> at episode end the LLM reads a post-episode prompt and rewrites its persistent <code>memory/*.md</code> file for the next episode.</li>
    </ul>
    <p>The agent's interface is deliberately minimal: plain <code>Thought:/Action:</code> text, no tool-call protocol, no JSON schema. The training client parses and steps the environment over WebSocket.</p>
  </section>

  <!-- 5. ENVIRONMENT DESIGN -->
  <section id="env-design">
    <h2>Environment design</h2>
    <p>The core contract is three Pydantic types exchanged over the OpenEnv WebSocket (<code>MiniGridEnv/env/models.py</code>):</p>
    <pre><code><span class="c"># Action (agent -> env)</span>
class MiniGridAction(Action):
    command: str                 <span class="c"># "go forward", "turn left", "pickup", ...</span>
    thought: Optional[str] = None <span class="c"># logged for analysis, not executed</span>

<span class="c"># Observation (env -> agent)</span>
class MiniGridObservation(Observation):
    text: str                    <span class="c"># NL description of the 7x7 egocentric view</span>
    mission: str                 <span class="c"># "go to the red ball", ...</span>
    step_idx, steps_remaining, max_steps: int
    history: list[dict]          <span class="c"># recent step summaries</span>
    level_name: str
    last_action: Optional[str]
    action_success: Optional[bool]
    done: bool; reward: Optional[float]; metadata: dict

<span class="c"># State (hidden from agent; logging / eval only)</span>
class MiniGridState(State):
    level_name, level_difficulty, completed, truncated,
    total_reward, steps_taken, optimal_steps, efficiency_ratio,
    valid_actions, invalid_actions, action_distribution</code></pre>

    <h3>The text observation (quality lever #1)</h3>
    <p>MiniGrid's raw observation is a <code>(7, 7, 3)</code> numpy grid of (object type, color, door state) with the agent fixed at row=6 col=3 facing &quot;up&quot;. <code>env/grid_to_text.py</code> turns that into a layered NL description:</p>
    <ol>
      <li><code>Mission: &hellip;</code></li>
      <li><code>You are facing {east,south,west,north}.</code></li>
      <li>Immediate surroundings: ahead / left / right single-cell descriptions.</li>
      <li>Path ahead: compresses runs of empty cells (e.g. <em>&quot;empty for 3 steps, then a closed red door, then a wall&quot;</em>).</li>
      <li>Notable objects: interactive items (key, ball, box, goal, door, lava) with relative phrases (<em>&quot;2 steps ahead and 1 to your right&quot;</em>), sorted by Manhattan distance.</li>
      <li>Carrying state.</li>
    </ol>
    <p>The internal design note is blunt: <em>&quot;the quality of the text observation is the single biggest lever on training success.&quot;</em> Everything else in the environment is a thin layer over the gym loop.</p>

    <h3>Actions (quality lever #2): NL &rarr; Discrete(7)</h3>
    <p><code>env/action_parser.py</code> maps natural-language strings to MiniGrid's discrete action index. The same logic is duplicated (intentionally) in <code>MiniGridPT/training/openenv_runtime.py</code> so the PT package remains standalone; a parity test guards the two copies.</p>
    <div class="table-wrap">
      <table>
        <thead><tr><th>Canonical</th><th>Index</th><th>Accepted aliases</th></tr></thead>
        <tbody>
          <tr><td><code>turn left</code></td><td class="num">0</td><td><code>left</code></td></tr>
          <tr><td><code>turn right</code></td><td class="num">1</td><td><code>right</code></td></tr>
          <tr><td><code>go forward</code></td><td class="num">2</td><td><code>move forward</code>, <code>forward</code>, <code>ahead</code>, <code>step</code>, <code>walk</code></td></tr>
          <tr><td><code>pickup</code></td><td class="num">3</td><td><code>pick up</code>, <code>grab</code>, <code>take</code>, <code>get</code></td></tr>
          <tr><td><code>drop</code></td><td class="num">4</td><td><code>release</code>, <code>put down</code></td></tr>
          <tr><td><code>toggle</code></td><td class="num">5</td><td><code>open</code>, <code>close</code>, <code>unlock</code>, <code>switch</code></td></tr>
          <tr><td><code>done</code></td><td class="num">6</td><td><code>wait</code>, <code>noop</code>, <code>stop</code></td></tr>
        </tbody>
      </table>
    </div>
    <p>An <strong>unparseable string falls back to <code>go forward</code></strong>, not to <code>done</code>. Rationale: early in training, exploration beats noop; every invalid parse increments a counter so we can watch parse-rate climb with training.</p>

    <h3>BabyAI curriculum (10 levels)</h3>
    <p><code>env/levels.py</code> registers the full BabyAI ladder with candidate gym IDs (so minigrid version drift between <code>BabyAI-GoToRedBallGrey-v0</code> and <code>BabyAI-GoToRedBall-v0</code> doesn't brick a run):</p>
    <div class="table-wrap">
      <table>
        <thead><tr><th>Stage</th><th>Level</th><th>Gym ID</th><th>Max steps</th><th>Optimal</th></tr></thead>
        <tbody>
          <tr><td class="num">0</td><td><strong>GoToRedBall</strong></td><td><code>BabyAI-GoToRedBallGrey-v0</code></td><td class="num">64</td><td class="num">~10</td></tr>
          <tr><td class="num">1</td><td>GoToObj</td><td><code>BabyAI-GoToObj-v0</code></td><td class="num">64</td><td class="num">~12</td></tr>
          <tr><td class="num">1</td><td>GoToLocal</td><td><code>BabyAI-GoToLocal-v0</code></td><td class="num">64</td><td class="num">~15</td></tr>
          <tr><td class="num">2</td><td>PickupLoc</td><td><code>BabyAI-PickupLoc-v0</code></td><td class="num">64</td><td class="num">~14</td></tr>
          <tr><td class="num">2</td><td>OpenDoor</td><td><code>BabyAI-OpenDoor-v0</code></td><td class="num">64</td><td class="num">~12</td></tr>
          <tr><td class="num">2</td><td>UnlockLocal</td><td><code>BabyAI-UnlockLocal-v0</code></td><td class="num">128</td><td class="num">~25</td></tr>
          <tr><td class="num">3</td><td>GoTo</td><td><code>BabyAI-GoTo-v0</code></td><td class="num">128</td><td class="num">~30</td></tr>
          <tr><td class="num">3</td><td>PutNextLocal</td><td><code>BabyAI-PutNextLocal-v0</code></td><td class="num">128</td><td class="num">~20</td></tr>
          <tr><td class="num">4</td><td>Synth</td><td><code>BabyAI-Synth-v0</code></td><td class="num">128</td><td class="num">~40</td></tr>
          <tr><td class="num">4</td><td><strong>BossLevel</strong></td><td><code>BabyAI-BossLevel-v0</code></td><td class="num">128</td><td class="num">~80</td></tr>
        </tbody>
      </table>
    </div>
    <p>A single Docker container serves every stage. <code>env.reset(level=&quot;BossLevel&quot;)</code> switches the underlying gym env per-reset. A fix replaced the original <code>del kwargs</code> in <code>reset()</code> with a <code>kwargs.pop(&quot;level&quot;, None)</code>, which is what unlocked single-server curriculum training. Per-level <code>max_steps</code> are defined in our <code>LevelConfig</code> registry (<code>env/levels.py</code>); Synth and BossLevel are <strong>capped at 128</strong> steps in this repo so episode length (and vLLM server-mode padding budgets) stay bounded for training.</p>

    <h3>Reward</h3>
    <p>Default: <strong>binary</strong>. <code>+1.0</code> on completion, <code>0.0</code> otherwise. GRPO works best with clean sparse signals. <code>RewardConfig</code> also supports <code>shaped</code> (step penalty + invalid-action penalty) and <code>efficiency</code> (bonus scaled to <code>optimal_steps/steps_taken</code>) modes if a stage stalls.</p>
    <p>Let $r_t$ denote the per-step environment reward (binary default). With horizon $T$ (our capped <code>max_steps</code>), mission success at termination gives a single $+1$ spike:</p>
    $$r_t = \begin{cases} +1 & \text{if the BabyAI mission is satisfied when the episode ends at step } t \\ 0 & \text{otherwise} \end{cases}$$
    <p>In the default mode, $r_t = 0$ for all $t &lt; T$ unless the mission completes early; shaping modes spread signal across steps via <code>RewardConfig</code> in <code>env/reward.py</code>.</p>
  </section>

  <!-- 6. WHY OPENENV -->
  <section id="openenv">
    <h2>Why OpenEnv</h2>
    <p>OpenEnv gives us three things that matter for this submission:</p>
    <ol>
      <li>A <strong>standard WebSocket environment contract</strong> consumable by TRL's <code>rollout_func</code> with typed Pydantic payloads and Gym-style <code>reset</code>/<code>step</code> semantics.</li>
      <li><strong>Per-session state with <code>SUPPORTS_CONCURRENT_SESSIONS=True</code></strong> and <code>max_concurrent_envs=256</code>. DDP ranks can hammer the same Space without cross-talk because each WebSocket session gets a fresh <code>gym.Env</code> instance (MiniGrid is not thread-safe; factory mode is mandatory).</li>
      <li><strong>Uniform deployment</strong>. Identical env code runs in-process for tests, as a Docker container for development (<code>server/Dockerfile</code>, <code>openenv-base</code>, port 8000), and as a Hugging Face Space during training and evaluation.</li>
    </ol>
    <p>No new abstractions were invented. Base types only: <code>EnvClient</code>, <code>Environment</code>, Pydantic <code>Action</code> / <code>Observation</code> / <code>State</code>. Curriculum level, history, and per-episode metrics ride on <code>metadata</code> and <code>state</code>. The environment ships with <code>openenv.yaml</code>, a <code>Dockerfile</code>, and an HF Space.</p>
    <p>Critically, <strong>MiniGridPT does not <code>import MiniGridEnv</code></strong>. Everything crosses the wire. A <code>MiniGridClient(EnvClient)</code> in <code>MiniGridPT/training/openenv_runtime.py</code> sends plain dicts. This is the architectural lynchpin that lets the training node be pure-GPU and the environment node be CPU-only.</p>
  </section>

  <!-- 7. CROSS-EPISODIC MEMORY (THE NOVELTY) -->
  <section id="memory">
    <h2>Cross-episodic memory</h2>
    <p>This is the research contribution. The base MiniGrid/BabyAI world is stateless between episodes: each <code>reset</code> gives the agent a fresh procedurally generated room with no persistent side-channel. We add one:</p>
    <pre><code>@dataclass
class MemoryConfig:
    enabled: bool = False
    max_lines: int = 100            <span class="c"># line-limit, not token-limit</span>
    memory_dir: str = "./memory"
    agent_id: str = "default"
    branch_stable_memory: bool = False  <span class="c"># see below</span>

    @property
    def memory_path(self) -> Path:
        return Path(self.memory_dir) / f"{self.agent_id}.md"</code></pre>

    <p>Four deliberate design choices, each rejecting a plausible alternative:</p>
    <ol>
      <li><strong>Line limit, not token limit.</strong> Lines are visible and countable <em>by the model</em> in the prompt (<code>(42/100 lines)</code>). The model gets a concrete budget it can reason about.</li>
      <li><strong>Full replacement, not append.</strong> At each episode end the LLM rewrites the file from scratch. This forces the agent to decide what to keep vs. evict (the interesting half of curation).</li>
      <li><strong>Unstructured markdown, not schema.</strong> No bullets required, no JSON. The research question is whether the model will <em>self-organize</em> useful knowledge, not whether it can fill in a template.</li>
      <li><strong>Truncation from the top.</strong> Safety net only; if the model overshoots <code>max_lines</code>, keep the most-recently-written lines.</li>
    </ol>

    <h3>Post-episode rewrite via <code>_temporary_vllm_max_tokens</code></h3>
    <p>Action turns need ~128 tokens (<code>Thought: &hellip;\nAction: go forward</code>). The memory rewrite needs ~512 (100 lines at ~5 tokens/line worst case). One global <code>max_completion_length</code> cannot satisfy both. The fix is a context manager:</p>
    <pre><code>@contextmanager
def _temporary_vllm_max_tokens(trainer, max_tokens: int):
    vg = trainer.vllm_generation
    prev = vg.max_completion_length
    vg.max_completion_length = max_tokens
    try:
        yield
    finally:
        vg.max_completion_length = prev

<span class="c"># Used both for the 512-token memory rewrite and for the 1-token</span>
<span class="c"># NCCL-padding dummy generates described in the Engineering section.</span></code></pre>

    <h3>Branch-stable file naming (per-chain compaction)</h3>
    <p>GRPO runs <em>G</em> parallel completions per prompt, each with its own advantage and gradient contribution. If every slot writes to a uniquely-named file, there's no continuity across optimizer steps, so each memory chain is one episode long. If every slot writes to one shared file, writes race and the signal is mush.</p>
    <p>The solution: <strong>branch-stable naming</strong> <code>rank{R}_br{k}_{base}.md</code> with <code>k = slot_idx % num_generations</code>. The <em>k</em>-th parallel generation maps to a <strong>stable file across optimizer steps</strong>, so branch <em>k</em> after prompt group P1 is the same file used by branch <em>k</em> after prompt group P2. Each of the <em>G</em> GRPO branches builds its own evolving notebook, which is what gives the model a training signal to <em>compact and summarize</em> episode-to-episode.</p>
    <p>Requires <code>per_device_train_batch_size == num_generations</code> (otherwise multiple groups in one step hit the same <em>k</em> and a one-time <code>UserWarning</code> fires). A third scheme (a single shared file across all slots and ranks) is sketched but not landed; it needs a decision about concurrent-writer races.</p>

    <p>Let $M_e \in \mathcal{M}$ denote the memory file (markdown string) at the start of episode $e$, let $\tau_e$ be the trajectory (observations, parsed actions, outcomes), and let $\pi_\theta^{\mathrm{mem}}$ be the same LLM invoked on the post-episode memory-update prompt. The write is a full rewrite followed by a line-budget projection $\Pi_L(\cdot)$ that keeps the last $L$ lines (here $L = 100$):</p>
    $$M_{e+1} = \Pi_L\!\left( \pi_\theta^{\mathrm{mem}}(M_e,\, \tau_e,\, \mathrm{outcome}_e) \right).$$
    <p>Branch-stable filenames tie each GRPO branch index $k = s \bmod G$ to a stable path across optimizer steps, for DDP rank $R$, slot index $s$, group size $G = \texttt{num\_generations}$, and basename <code>base</code> (e.g. <code>default</code>):</p>
    $$\mathrm{path}(R,s,\mathrm{base}) \;=\; \texttt{memory/rank}R\texttt{\_br}_{\,k}\texttt{\_}\mathrm{base}\texttt{.md}\,,\quad k = s \bmod G.$$
    <p>This is exactly the <strong>Remember</strong> panel in Figure&nbsp;0: the file card is $M_e$ at read time; the post-episode LLM box is $\pi_\theta^{\mathrm{mem}}$; the curved arrow is the next-episode read of $M_{e+1}$.</p>

    <blockquote>Can an LLM learn to curate its own persistent, line-budgeted notebook such that cross-episodic memory measurably improves completion rate, and the memory content evolves from random notes into structured strategies as training progresses?</blockquote>
  </section>

  <!-- 8. SCORING -->
  <section id="scoring">
    <h2>Scoring &amp; reward shaping</h2>
    <p>The environment reward is terminal and sparse. Everything else is a <strong>small shaping bonus</strong> designed to rule out pathological regimes without dominating the signal.</p>
    <div class="table-wrap">
      <table>
        <thead><tr><th>Component</th><th>Range</th><th>Source</th><th>What it rewards</th></tr></thead>
        <tbody>
          <tr><td><strong>Env reward (binary)</strong></td><td class="num">0 or +1</td><td><code>env/reward.py</code></td><td>Mission completed (BabyAI ground-truth success)</td></tr>
          <tr><td><strong>Format reward</strong></td><td class="num">[&minus;0.1, +0.1]</td><td><code>reward_funcs.reward_format</code></td><td>Both <code>Thought:</code> and <code>Action:</code> present (1.0), one (0.5), neither (0.0), rescaled</td></tr>
          <tr><td><strong>Memory: in-budget</strong></td><td class="num">+0.05</td><td><code>compute_memory_quality_flags</code></td><td>Memory rewrite stayed within <code>max_lines</code> (no truncation)</td></tr>
          <tr><td><strong>Memory: non-empty</strong></td><td class="num">+0.05</td><td><code>compute_memory_quality_flags</code></td><td>Agent is actually writing something</td></tr>
          <tr><td><strong>Memory: not-a-dump</strong></td><td class="num">&minus;0.05</td><td><code>memory_looks_like_observation_dump</code></td><td>Penalty if memory is just a copy of the last observation</td></tr>
        </tbody>
      </table>
    </div>
    <p>Design principle: <strong>env reward dominates</strong>. Format and memory-quality bonuses are at &plusmn;0.1&ndash;0.15 scale, intended as training wheels, removable once the model reliably emits structured output (&gt;90% validity) and writes substantive memory.</p>
    <p>Let $\tau$ denote an episode trajectory and $M_e, M_{e+1}$ memory before/after the episode. Write $R_{\mathrm{env}} = \sum_t r_t \in \{0,1\}$ for the binary BabyAI success signal, $R_{\mathrm{fmt}}(\tau)$ for the rescaled format score in $[-1,1]$ (mapped to $[-0.1,0.1]$ via $\alpha_{\mathrm{fmt}} = 0.1$ in code), and $R_{\mathrm{mem}}(M_{e+1})$ for the memory-quality shaping used by the trainer. The scalar logged to TRL as <code>env_reward</code> is:</p>
    $$R(\tau, M_e, M_{e+1}) \;=\; \underbrace{R_{\mathrm{env}}}_{{\in \{0,1\}}} \;+\; \underbrace{\alpha_{\mathrm{fmt}}\, R_{\mathrm{fmt}}(\tau)}_{{\in [-0.1,\,0.1]}} \;+\; \underbrace{R_{\mathrm{mem}}(M_{e+1})}_{{\in [-0.05,\,0.10]}}.$$
    <p>With indicator $\mathbf{1}[\cdot]$, line budget $L$, and dump detector $\mathrm{dump}(M)$ (true when memory is effectively a copy of the last observation):</p>
    $$R_{\mathrm{mem}}(M) = \beta_{\mathrm{budget}}\,\mathbf{1}\big[\mathrm{lines}(M) \le L\big] + \beta_{\mathrm{ne}}\,\mathbf{1}\big[M \neq \varnothing\big] - \beta_{\mathrm{dump}}\,\mathbf{1}\big[\mathrm{dump}(M)\big],$$
    <p>with $\beta_{\mathrm{budget}} = \beta_{\mathrm{ne}} = \beta_{\mathrm{dump}} = 0.05$ as implemented in <code>MiniGridPT</code> (names may differ slightly in code; the ranges in the table above match the shipped constants).</p>
    <blockquote><strong>Why memory quality has a negative flag.</strong> Without the <code>memory_looks_like_observation_dump</code> penalty, the shortest-path way to collect the non-empty bonus is to paste the last observation into memory. That gives zero cross-episodic signal. The penalty forces the memory to be <em>compressed / abstracted</em>, which is the interesting behavior.</blockquote>
  </section>

  <!-- 9. ARCHITECTURE & TRAINING PIPELINE -->
  <section id="architecture">
    <h2>Architecture &amp; training pipeline</h2>
    <p>Two strictly separated packages. <strong>MiniGridEnv</strong> (OpenEnv environment) and <strong>MiniGridPT</strong> (GRPO training client) communicate exclusively over WebSocket (no in-process imports).</p>

    <div class="mermaid-wrap">
      <pre class="mermaid">
flowchart LR
    subgraph PT ["MiniGridPT (Training)"]
        GRPO["GRPOTrainer<br/>TRL 1.0.0"]
        RF["rollout_func<br/>(per-episode loop)"]
        VLLM["vLLM<br/>colocate/server"]
        PARSE["parse_action<br/>(NL -> Discrete(7))"]
        MEM["memory/rank{R}_br{k}_default.md<br/>(branch-stable)"]
    end
    subgraph ENV ["MiniGridEnv (OpenEnv)"]
        WS["FastAPI<br/>WebSocket"]
        GYM["MiniGrid gym env<br/>(BabyAI level)"]
        TEXT["grid_to_text<br/>(7x7 -> NL)"]
        REW["Reward<br/>binary +1.0"]
    end

    GRPO --> RF
    RF --> VLLM
    VLLM -->|"generate Thought/Action"| PARSE
    PARSE -->|"{command, thought}"| WS
    WS --> GYM
    GYM --> TEXT
    TEXT -->|"observation.text"| WS
    WS -->|"obs + reward + done"| RF
    RF -->|"post-episode rewrite"| MEM
    MEM -->|"read at t=0 next episode"| RF
    REW --> WS
      </pre>
      <p class="mermaid-caption">Figure 1. System architecture. PT never imports env-side types. Memory is a per-branch markdown file owned by the training client, rewritten by the LLM at each episode end.</p>
    </div>

    <p>Training uses <strong>GRPO</strong> (Group Relative Policy Optimization), a critic-free RL algorithm ideal for terminal-only rewards. We use TRL 1.0.0's <code>rollout_func</code> contract for explicit control over the <code>generate &rarr; parse &rarr; env.step</code> loop.</p>

    <h3>The rollout function</h3>
    <p>Per slot, <code>_rollout_one_episode</code> (<code>MiniGridPT/training/rollout_func.py</code>) runs a complete episode inside one training step:</p>
    <ol>
      <li>Build initial chat messages (system + first user observation block, with the current memory folded in if enabled).</li>
      <li>Open a WebSocket session via <code>MiniGridClient(base_url=ENV_BASE_URL).sync()</code> and call <code>env.reset(level=LEVEL, seed=&hellip;)</code>.</li>
      <li><strong>Episode loop</strong> until <code>done</code> or the turn cap: generate with vLLM (<code>max_completion_length=128</code>), append tokens to <code>completion_ids</code> with <code>env_mask=1</code>, parse Thought/Action, call <code>env.step({&quot;command&quot;: canonical, &quot;thought&quot;: thought})</code>, append the rendered next-observation tokens with <code>env_mask=0</code>.</li>
      <li><strong>Post-episode memory rewrite</strong> (memory mode only): build <code>MEMORY_UPDATE_PROMPT</code> with outcome / steps / current memory / line count / budget, call <code>generate()</code> wrapped in <code>_temporary_vllm_max_tokens(trainer, 512)</code>, write to the branch-stable file, append tokens with <code>env_mask=0</code>.</li>
    </ol>

    <p>The return dict is the shape TRL's <code>GRPOTrainer</code> consumes:</p>
    <pre><code>{
    "prompt_ids":     list[list[int]],   <span class="c"># one per slot (fixed initial prompt)</span>
    "completion_ids": list[list[int]],   <span class="c"># full episode (LLM + env user turns)</span>
    "logprobs":       list[list[float]], <span class="c"># from vLLM; zero-filled for env_mask=0</span>
    "env_mask":       list[list[int]],   <span class="c"># 1 = LLM token, 0 = env/context token</span>
    "env_reward":     list[float],       <span class="c"># binary env reward + memory-quality bonus</span>
}</code></pre>

    <p>The <code>env_mask</code> is what lets us mix <strong>LLM-authored tokens</strong> (eligible for the env-reward term in the GRPO objective) with <strong>env-rendered context tokens</strong> (visible for KL-to-reference but excluded from the advantage weighting). Without it, the model would be &quot;rewarded&quot; for tokens it didn't generate.</p>
    <p>At episode boundaries, the rollout reads memory $M_e$ into the prompt, rolls out $\tau$ with environment observations rendered as tokens with mask $0$, then applies $\pi_\theta^{\mathrm{mem}}$ to obtain $M_{e+1}$ as in Section&nbsp;7, the same loop sketched in Figure&nbsp;0 (Remember).</p>
    <p>Let $y_{i,1:T_i}$ be the token sequence for completion $i$ (including env turns), $m_{i,t} \in \{0,1\}$ the env mask, and $\rho_{i,t}(\theta) = \pi_\theta(y_{i,t}\mid y_{i,1:t-1}) / \pi_{\theta_{\mathrm{old}}}(y_{i,t}\mid y_{i,1:t-1})$ the importance ratio on LLM-authored tokens. With clipping threshold $\epsilon$, KL coefficient $\beta_{\mathrm{KL}}$, and group-relative advantage $A_i = (R_i - \mu_R)/\sigma_R$ over $G$ parallel completions sharing a prompt, the masked GRPO-style surrogate we target is:</p>
    $$\mathcal{L}_{\mathrm{GRPO}}(\theta) \;=\; -\,\mathbb{E}\left[ \sum_{i=1}^{G} \sum_{t : m_{i,t}=1} \min\!\Big( \rho_{i,t}(\theta)\, A_i,\; \mathrm{clip}\big(\rho_{i,t}(\theta), 1-\epsilon, 1+\epsilon\big)\, A_i \Big) \right] + \beta_{\mathrm{KL}}\, D_{\mathrm{KL}}(\pi_\theta \,\|\, \pi_{\mathrm{ref}}).$$
    <p>Here $R_i \equiv R(\tau_i, M_e, M_{e+1})$ is the scalar from Section&nbsp;8 (environment return plus shaping). Tokens with $m_{i,t}=0$ contribute to the KL / context loss path in TRL but not to the clipped policy-gradient term above.</p>
  </section>

  <!-- 10. MEMORY-EVOLUTION GALLERY -->
  <section id="memory-evolution">
    <h2>Memory-evolution gallery (illustrative)</h2>
    <p>We do not include step-by-step episode traces here: the training logs for the current run are not immediately accessible, and the priority for this submission is the <strong>mechanism</strong> (Figure&nbsp;0, Sections&nbsp;7&ndash;9) rather than cherry-picked rollouts. <strong>Compute for this project is exhausted</strong> before we could finish a converged memory ablation; the author is also concurrently submitting <strong>LotteryElicitationEnv</strong> and <strong>ReasoningEconomicsEnv</strong> to the same OpenEnv track, so GPU budget is shared across multiple codebases.</p>
    <p>The cards below are <strong>not</strong> verbatim snapshots from a finished training run. They are <em>category placeholders</em> for what we expect to extract once additional compute is available to run memory-structure experiments and to save real <code>memory/rank{R}_br{k}_*.md</code> files at checkpoints. Future work will test alternative memory organizations (Section&nbsp;11) under that budget.</p>

    <h3>Qualitative memory-file evolution</h3>
    <p>When a long run exists, we will snapshot <code>memory/rank0_br0_default.md</code> (or branch-stable peers) and categorize content. For now, each panel illustrates a <em>type</em> of content we expect to see at different training phases:</p>
    <div class="memory-gallery">
      <div class="memory-card">
        <div class="mem-header"><span>memory/rank0_br0_default.md</span><span class="mem-meta">step ~500 (noise)</span></div>
        <pre><code>ball is red
i saw a door
step 3 turn left
step 4 go forward</code></pre>
      </div>
      <div class="memory-card">
        <div class="mem-header"><span>memory/rank0_br0_default.md</span><span class="mem-meta">step ~5000 (action patterns)</span></div>
        <pre><code>- if the Mission says &quot;go to X&quot;, first face X
- turn left/right before go forward if object is
  to the side
- on GoToRedBall the ball is usually 1-3 steps away</code></pre>
      </div>
      <div class="memory-card">
        <div class="mem-header"><span>memory/rank0_br0_default.md</span><span class="mem-meta">step ~15000 (level-specific notes)</span></div>
        <pre><code>- UnlockLocal: keys are the same color as doors
- OpenDoor: &quot;toggle&quot; opens closed and locked doors
  (if carrying correct key)
- Synth: mission has multiple clauses -&gt; do them
  left-to-right as written</code></pre>
      </div>
      <div class="memory-card">
        <div class="mem-header"><span>memory/rank0_br0_default.md</span><span class="mem-meta">step ~25000 (failure notes)</span></div>
        <pre><code>- if action_success=False on go forward, there is a
  wall/door -&gt; rotate before next step
- pickup with no adjacent object always fails; read
  &quot;carrying: nothing&quot; before attempting</code></pre>
      </div>
    </div>
    <blockquote>Verbatim memory snapshots from a converged or long partial run will replace these placeholders when further compute is available. Until then, the gallery documents the <strong>hypothesis space</strong> for how $M$ should evolve, not empirical outcomes from the current submission.</blockquote>
  </section>

  <!-- 11. RESULTS -->
  <section id="results">
    <h2>Results: what we found</h2>

    <h3>Status (honest scope)</h3>
    <p>The <strong>MiniGridPT</strong> training package was exercised for <strong>correctness</strong> (short runs, parser parity, WebSocket stepping, memory file I/O, vLLM colocate and multi-GPU server mode with NCCL-safe padding). We do <strong>not</strong> report converged learning curves or final completion rates: the policy did not converge on the full curriculum under the available budget, and structured experiments on alternative memory formats are <strong>deferred to a follow-up compute cycle</strong>. <strong>Compute for this line of work is exhausted</strong> for the current submission window; the author is concurrently shipping <strong>LotteryElicitationEnv</strong> and <strong>ReasoningEconomicsEnv</strong> to the same OpenEnv track, so GPU time is shared across multiple submissions.</p>

    <h3>What we validated (engineering, not leaderboard numbers)</h3>
    <ul>
      <li><strong>Stable multi-turn rollouts</strong> against the live OpenEnv WebSocket from TRL's <code>rollout_func</code>, with <code>env_mask</code> partitioning LLM-authored vs. env-rendered tokens and per-episode logs persisting to the <code>--output_dir</code>.</li>
      <li><strong>Single-A100 colocate smoke runs</strong> on Qwen3-8B and Qwen2.5-1.5B-Instruct (hundreds of optimizer steps, not a converged curriculum); <code>MGPT_VLLM_GPU_UTIL</code> tuned to ~0.45&ndash;0.65 on 40&nbsp;GB (see Engineering lessons).</li>
      <li><strong>Cross-episodic memory I/O</strong> through <code>_temporary_vllm_max_tokens(trainer, 512)</code>; branch-stable filenames <code>rank{R}_br{k}_default.md</code> persisting across optimizer steps.</li>
      <li><strong>Multi-GPU server-mode training</strong> with fixed-count generate padding (<code>DIST_SERVER_GENERATES_PER_EPISODE</code>) eliminating NCCL desync under variable-length episodes (now bounded by our capped <code>max_steps</code> &le; 128 per level in <code>env/levels.py</code>).</li>
      <li><strong>Lambda runbook</strong>: <code>bootstrap_lambda.sh</code> &rarr; <code>preflight_lambda.sh</code> &rarr; <code>run_grpo_lambda.sh</code>, with <code>MGPT_*</code> env vars and cadence / metrics callbacks writing <code>metrics_scalars.csv</code>, <code>metrics_events.jsonl</code>, <code>cadence.log</code>, <code>diagnostics_cadence.jsonl</code>.</li>
      <li><strong>36 env-side tests and a PT action-parser parity test</strong> enforcing the NL &rarr; <code>Discrete(7)</code> contract across both packages.</li>
    </ul>

    <h3>Baseline harness (wired, not benchmarked)</h3>
    <p>The environment bundles three baselines (Random, BabyAI <code>BotAgent</code>, and a caller-provided zero-shot <code>completion_fn</code>), all runnable in-process without a GPU. They are <strong>not</strong> executed at scale in this submission; longer baseline sweeps and GRPO comparisons are explicitly scoped for the next compute allocation.</p>

    <h3>Memory design space: planned experiments</h3>
    <p>Because the current run did not converge and memory-structure ablations are outstanding, the table below is the <strong>forward-looking experiment matrix</strong> for how $M$ might be organized once additional GPU budget is available. Each row states a hypothesis; all rows require <strong>additional compute</strong>.</p>
    <div class="table-wrap">
      <table>
        <thead><tr><th>Variant</th><th>Hypothesis tested</th><th>Notes</th></tr></thead>
        <tbody>
          <tr><td><strong>Structured schema</strong> (JSON / YAML / fixed markdown sections)</td><td>Schema &gt; free-form markdown for stable curation</td><td>Requires additional compute</td></tr>
          <tr><td><strong>Append + periodic compaction</strong></td><td>Full-episode rewrite cost limits the learning signal</td><td>Requires additional compute</td></tr>
          <tr><td><strong>Hierarchical</strong> (in-episode scratchpad + cross-episode long-term)</td><td>Conflating short- and long-term in one file hurts</td><td>Requires additional compute</td></tr>
          <tr><td><strong>Retrieval-indexed</strong> (embed notes, top-<em>k</em> by observation)</td><td>Linear-file recall fails at scale</td><td>Requires additional compute</td></tr>
          <tr><td><strong>Shared single-file</strong> across branches / ranks</td><td>Collective memory beats per-branch curation</td><td>Shared-memory design TBD; requires additional compute + concurrency design</td></tr>
          <tr><td><strong>Success-gated writes</strong></td><td>Failure episodes poison $M$</td><td>Requires additional compute</td></tr>
          <tr><td><strong>Variable line budget</strong> by level difficulty</td><td>Uniform $L$ is too tight for hardest stages</td><td>Requires additional compute</td></tr>
          <tr><td><strong>Dual-memory</strong> (policy vs. world knowledge)</td><td>Unified $M$ conflates two knowledge types</td><td>Requires additional compute</td></tr>
          <tr><td><strong>Token budget</strong> instead of line budget</td><td>Line-count is the wrong self-budgeting unit for LLMs</td><td>Requires additional compute</td></tr>
        </tbody>
      </table>
    </div>

    <p>The research question in Section&nbsp;7 remains the scientific target; this submission's empirical contribution is the <strong>validated pipeline and semantics</strong> for $M$, not yet a table of win-rates.</p>
  </section>

  <!-- 12. ENGINEERING LESSONS -->
  <section id="engineering">
    <h2>Engineering lessons</h2>
    <p>Running GRPO + OpenEnv + vLLM on a multi-turn, memory-augmented environment surfaced three categories of structural issues. We document the ones that are <em>general</em>; the next OpenEnv submission is likely to hit each.</p>

    <h3 id="nccl">NCCL desync under variable-length episodes</h3>
    <p>In <code>vllm_mode=server</code>, each <code>trainer.vllm_generation.generate()</code> call performs <code>gather_object &rarr; all_gather_object &rarr; broadcast_object_list</code>. Our rollout runs <code>while not session.done</code>, so different DDP ranks make different numbers of <code>generate()</code> calls per episode: a short run (few turns) vs. the per-level <code>max_steps</code> cap (64 on early BabyAI stages, up to <strong>128</strong> after our registry cap for Synth and BossLevel). NCCL collectives are sequence-numbered: <em>different call counts per rank = permanent desync</em>.</p>
    <p><strong>Symptoms:</strong> training tqdm stuck, GPU 0&ndash;N pinned, vLLM GPU idle, NCCL watchdog firing after its timeout, <code>UnpicklingError</code> as ranks deserialize off-by-one collective buffers.</p>
    <p><strong>Fix:</strong> fixed-count padding: every rank performs <em>exactly</em> <code>DIST_SERVER_GENERATES_PER_EPISODE</code> generates per episode, where the count is <code>max_episode_turns + (1 if memory_enabled else 0)</code>. After the real loop terminates, <code>_pad_vllm_server_generates_to_target</code> issues dummy 1-token generates under <code>_temporary_vllm_max_tokens(trainer, 1)</code>, outputs discarded, guarded with <code>try/finally</code>. Active only when <code>vllm_mode == &quot;server&quot;</code> and <code>world_size &gt; 1</code>; reward, logprobs, and credit assignment are byte-identical to the unpadded case.</p>
    <blockquote>This pattern is general. <strong>Any TRL <code>rollout_func</code> user running variable-length rollouts in server mode has this bug latent.</strong> LotteryElicitationEnv/PT (sibling project) hit it first; the same fix ported cleanly here.</blockquote>

    <h3 id="vllm-util">vLLM colocate GPU-memory utilization is a total-VRAM fraction</h3>
    <p><code>MGPT_VLLM_GPU_UTIL</code> is passed to TRL &rarr; vLLM as <code>--vllm_gpu_memory_utilization</code>. vLLM interprets it as the fraction of <strong>total device VRAM</strong> the engine may reserve (weights + KV budget). <em>Not</em> &quot;fraction of what PyTorch left free.&quot;</p>
    <p>In colocate mode, the policy model loads first, then vLLM tries to grab its share <em>on the same GPU</em>. Too high &rarr; vLLM startup <code>ValueError</code> or later <code>torch.OutOfMemoryError</code> on logprob / <code>lm_head</code>. The shipped TRL default of 0.9 is too aggressive on 40&nbsp;GB A100 colocate. <strong>Safe range: 0.45&ndash;0.65.</strong></p>
    <p>Server mode needs &ge;2 GPUs (splits vLLM vs. training devices); <code>MGPT_VLLM_MODE=auto</code> picks <code>server</code> on &ge;2 GPUs else <code>colocate</code>.</p>

    <h3 id="hygiene">Hygiene table (four smaller issues, their fixes)</h3>
    <div class="table-wrap">
      <table>
        <thead><tr><th>Issue</th><th>Root cause</th><th>Fix</th></tr></thead>
        <tbody>
          <tr>
            <td><strong>Single-server curriculum blocked</strong></td>
            <td>Scaffold <code>reset()</code> did <code>del kwargs</code> before forwarding to the gym env, dropping the <code>level</code> kwarg</td>
            <td><code>level = kwargs.pop(&quot;level&quot;, None)</code> (and <code>level_name</code>) before clearing; now one Docker container serves every stage</td>
          </tr>
          <tr>
            <td><strong>Branch-stable memory races</strong></td>
            <td>When <code>per_device_train_batch_size &gt; num_generations</code>, multiple prompt groups in one step map to the same branch index <em>k</em> and race on the file</td>
            <td>Asserted at startup; a one-time <code>UserWarning</code> if the invariant is ever broken; recommended configuration <code>batch == num_generations</code></td>
          </tr>
          <tr>
            <td><strong>Action-parser drift</strong></td>
            <td>PT package ships its own <code>parse_action</code> (so it doesn't import the env); env-side changes can silently diverge</td>
            <td>Parity test <code>tests/test_action_parser_parity.py</code> in MiniGridPT cross-compares canonical actions + aliases against the env's parser</td>
          </tr>
          <tr>
            <td><strong>&quot;go forward&quot; fallback on unparseable text</strong></td>
            <td>Early-training LLMs emit malformed text; mapping to <code>done</code> kills episodes instantly (zero signal)</td>
            <td>Fallback = <code>go forward</code>, not <code>done</code>; every invalid parse increments <code>invalid_actions</code> so the parse-rate climb is a visible training-progress curve</td>
          </tr>
        </tbody>
      </table>
    </div>
  </section>

  <!-- 13. POSITIONING QUADRANT -->
  <section id="positioning">
    <h2>Where this submission sits</h2>
    <div class="mermaid-wrap">
      <pre class="mermaid">
quadrantChart
    title Grounded navigation: memory x OpenEnv/RL
    x-axis "Stateless" --> "Memory-augmented"
    y-axis "Gym only" --> "OpenEnv + RL stack"
    quadrant-1 "Our target"
    quadrant-2 "Untouched"
    quadrant-3 "Classical RL"
    quadrant-4 "Prompt-only"
    "BabyAI / MiniGrid": [0.10, 0.14]
    "Lottery (sibling env)": [0.22, 0.90]
    "GRPO, no memory": [0.42, 0.62]
    "Voyager": [0.90, 0.34]
    "Reflexion": [0.74, 0.24]
    "GenAgents": [0.88, 0.14]
    "MiniGridEnv + MiniGridPT": [0.90, 0.88]
      </pre>
      <p class="mermaid-caption">Figure 2. MiniGridEnv + MiniGridPT occupy the memory-augmented + OpenEnv + post-training quadrant that prior work leaves untouched. Voyager / Reflexion / Generative Agents are memory-rich but prompt-only; BabyAI itself is a gym env without an OpenEnv or RL-post-training story; sibling LotteryElicitationEnv is OpenEnv + RL but stateless.</p>
    </div>
  </section>

  <!-- 14. GIGPO UPGRADE -->
  <section id="gigpo">
    <h2>Next step: GiGPO</h2>
    <p>GRPO is good enough to <em>ship</em> this submission: critic-free, works out of the box in TRL, scalar per-episode advantage is fine for short-horizon BabyAI stages. But it underweights step-level credit assignment, which is exactly what hurts on 30+ turn episodes and what memory mode needs (memory episodes are ~2&times; longer).</p>
    <p><strong>GiGPO = GRPO + anchor-state step-level advantages.</strong> Episode-level macro advantage (same group-relative signal as GRPO over $G$ completions):</p>
    $$A^{E}_i \;=\; \frac{R_i - \mu_R}{\sigma_R}.$$
    <p>Step-level micro advantage within anchor-state group $S_k$ (all $(\tau,t')$ pairs whose observation text hashes match step $t$):</p>
    $$A^{S}(a_t) \;=\; \frac{Q(a_t) - \mu_{Q(S_k)}}{\sigma_{Q(S_k)}}\,,\quad S_k = \big\{ (\tau, t') : \mathrm{hash}(o_{t'}) = \mathrm{hash}(o_t) \big\}.$$
    <p>Combined per-token advantage with mixing weight $\omega \ge 0$:</p>
    $$A_t \;=\; A^{E}_i + \omega\, A^{S}(a_t).$$
    <p>When no anchors are found, $A^{S} = 0$ and GiGPO reduces to GRPO (equivalently $\omega = 0$).</p>
    <p>Why this fits MiniGrid: all <em>G</em> rollouts share the same initial observation for a given prompt/seed (guaranteed anchor); corridor navigation revisits the same 7&times;7 egocentric view; BabyAI per-seed determinism creates exact hash matches. The full step-level design is deferred to the GiGPO follow-up (trainer subclass + rollout fields).</p>

    <h3>Experimental matrix for the follow-up</h3>
    <div class="table-wrap">
      <table>
        <thead><tr><th>Config</th><th>Algorithm</th><th>Memory</th><th>Flags</th></tr></thead>
        <tbody>
          <tr><td>A</td><td>GRPO</td><td>Off</td><td><code>--loss_type dapo</code></td></tr>
          <tr><td>B</td><td>GRPO</td><td>On (branch-stable)</td><td><code>--loss_type dapo --memory --memory-branch-stable</code></td></tr>
          <tr><td>C</td><td>GiGPO</td><td>Off</td><td><code>--use_gigpo</code></td></tr>
          <tr><td>D</td><td>GiGPO</td><td>On (branch-stable)</td><td><code>--use_gigpo --memory --memory-branch-stable</code></td></tr>
        </tbody>
      </table>
    </div>
    <p><strong>Hypothesis:</strong> D dominates. Step-level anchor-state credit and cross-episodic strategy accumulation are complementary: GiGPO assigns credit <em>within</em> an episode; memory propagates credit <em>across</em> episodes.</p>
  </section>

  <!-- 15. FOUNDATIONS & CITATIONS -->
  <section id="foundations">
    <h2>Foundations &amp; citations</h2>
    <div class="table-wrap">
      <table>
        <thead><tr><th>Foundation</th><th>Role in this project</th><th>Citation</th></tr></thead>
        <tbody>
          <tr><td><strong>MiniGrid &amp; BabyAI</strong></td><td>Base gym environment, 10-stage curriculum, reference <code>BotAgent</code> upper bound, procedural level generation</td><td>Chevalier-Boisvert et al., <a href="https://arxiv.org/abs/1810.08272" target="_blank" style="color:var(--accent2)">arXiv:1810.08272</a> (ICLR 2019); <a href="https://github.com/Farama-Foundation/Minigrid" target="_blank" style="color:var(--accent2)">Farama-Foundation/Minigrid</a></td></tr>
          <tr><td><strong>GRPO / DeepSeekMath</strong></td><td>Critic-free group-relative policy optimization; our default trainer via TRL's <code>GRPOTrainer</code></td><td>Shao et al., <a href="https://arxiv.org/abs/2402.03300" target="_blank" style="color:var(--accent2)">arXiv:2402.03300</a></td></tr>
          <tr><td><strong>TRL &times; OpenEnv</strong></td><td><code>rollout_func</code> contract, vLLM colocate/server, <code>loss_type=dapo</code> length-bias handling</td><td><a href="https://huggingface.co/docs/trl/en/openenv" target="_blank" style="color:var(--accent2)">TRL OpenEnv docs</a></td></tr>
          <tr><td><strong>OpenEnv</strong></td><td>Standard WebSocket env contract, per-session state, <code>create_app</code>, HF Space deploy</td><td><a href="https://huggingface.co/blog/openenv" target="_blank" style="color:var(--accent2)">HF Blog: Introducing OpenEnv</a></td></tr>
          <tr><td><strong>Voyager</strong></td><td>Skill-library / cross-episode knowledge accumulation (closest memory-system analog; ours is RL-trained where Voyager is prompt-engineered)</td><td>Wang et al., <a href="https://arxiv.org/abs/2305.16291" target="_blank" style="color:var(--accent2)">arXiv:2305.16291</a></td></tr>
          <tr><td><strong>Reflexion</strong></td><td>Verbal reflection after episodes; motivates a post-episode LLM rewrite pass over a persistent buffer</td><td>Shinn et al., <a href="https://arxiv.org/abs/2303.11366" target="_blank" style="color:var(--accent2)">arXiv:2303.11366</a></td></tr>
          <tr><td><strong>Generative Agents</strong></td><td>Long-term memory stream with relevance / recency weighting; our line-budgeted rewrite is a deliberately simpler alternative</td><td>Park et al., <a href="https://arxiv.org/abs/2304.03442" target="_blank" style="color:var(--accent2)">arXiv:2304.03442</a></td></tr>
          <tr><td><strong>LotteryElicitationEnv / PT</strong></td><td>Sibling OpenEnv submission; shared structural template for two-repo split, <code>rollout_func</code>, NCCL generate-count padding</td><td>Same monorepo &middot; <a href="https://huggingface.co/spaces/yashu2000/LotteryElicitationEnv" target="_blank" style="color:var(--accent2)">LotteryElicitationEnv HF Space</a></td></tr>
          <tr><td><strong>ReasoningEconomicsEnv / PT</strong></td><td>Structural template for <code>_temporary_vllm_max_tokens</code> pattern</td><td>Same monorepo</td></tr>
        </tbody>
      </table>
    </div>
  </section>

  <!-- 16. QUICK START -->
  <section id="quickstart">
    <h2>Quick start</h2>
    <p>Single-A100 Lambda recipe (use MiniGridEnv Docker + MiniGridPT <code>scripts/</code> as the source of truth for env vars and launch order):</p>
    <pre><code><span class="c"># 0. Clone both packages (sibling directories)</span>
git clone https://github.com/sharma-yash01/MiniGridEnv.git
git clone https://github.com/sharma-yash01/MiniGridPT.git

<span class="c"># 1. Build + start MiniGridEnv (Docker on port 8000)</span>
cd MiniGridEnv
sudo docker build -t minigrid-env:latest -f server/Dockerfile .
sudo docker run -d --name minigrid-env -p 8000:8000 \
    -v "${HOME}/.cache/huggingface:/root/.cache/huggingface" \
    minigrid-env:latest
curl -sS "http://127.0.0.1:8000/health"

<span class="c"># 2. Configure MGPT_* (single A100, colocate vLLM)</span>
export ENV_BASE_URL="http://127.0.0.1:8000"
export MGPT_ROOT=$(pwd)/../MiniGridPT
export MGPT_VENV=$HOME/.venvs/minigridpt-lambda
export PYTORCH_WHEEL_INDEX=https://download.pytorch.org/whl/cu121
export MGPT_MODEL=Qwen/Qwen3-8B
export MGPT_LEVEL=GoToRedBall
export MGPT_VLLM_MODE=colocate
export MGPT_VLLM_GPU_UTIL=0.45           <span class="c"># colocate-safe on A100 40GB</span>

<span class="c"># 3. Bootstrap + preflight + train</span>
bash "$MGPT_ROOT/scripts/bootstrap_lambda.sh"
source "$MGPT_VENV/bin/activate"
bash "$MGPT_ROOT/scripts/preflight_lambda.sh"
cd "$MGPT_ROOT" && nohup bash scripts/run_grpo_lambda.sh > train.log 2>&1 &
tail -f train.log

<span class="c"># 4. Memory-mode variant (branch-stable, batch == num_generations)</span>
export MGPT_MEMORY=1
export MGPT_MEMORY_MAX_LINES=100
export MGPT_MEMORY_BRANCH_STABLE=1
export MGPT_NUM_GENERATIONS=8
export MGPT_BATCH_SIZE=8
bash "$MGPT_ROOT/scripts/run_grpo_lambda.sh"

<span class="c"># 5. Full curriculum (GoToRedBall -&gt; BossLevel)</span>
export ENV_URL="${ENV_BASE_URL}"
export MODEL="${MGPT_MODEL}"
export BASE_OUT="${MGPT_OUTPUT_DIR}/curriculum"
export USE_MEMORY=1
bash "$MGPT_ROOT/scripts/launch_curriculum.sh"</code></pre>
    <p>All 36 env-side tests pass with <code>cd MiniGridEnv &amp;&amp; uv run --with pytest pytest tests</code>. The OpenEnv contract is validated with <code>openenv validate</code>.</p>
  </section>

  <div class="callout">
    <div class="q">Compute budget exhausted for this submission</div>
    <div class="sub">The training package is validated for correctness; converged runs, baseline tables, and memory-structure ablations require <strong>more GPU time</strong>. The author is concurrently submitting <strong>LotteryElicitationEnv</strong> and <strong>ReasoningEconomicsEnv</strong> to the same OpenEnv track, so resources are shared across all three. The open scientific question remains in Section&nbsp;7; what ships now is the pipeline and the formal semantics for $M$.</div>
  </div>

  <!-- 17. FUTURE WORK -->
  <section id="future">
    <h2>Future work</h2>
    <ul>
      <li><strong>Run the full A/B/C/D experimental matrix</strong> to publish the memory-vs-stateless and GRPO-vs-GiGPO comparison across the BabyAI curriculum once additional compute is available (measured numbers to be filled in after those runs).</li>
      <li><strong>Land GiGPO</strong> as a <code>GiGPOTrainer(GRPOTrainer)</code> subclass. Minimum diff: add <code>obs_texts</code> / <code>step_boundaries</code> to the rollout return, compute anchor-state groups, expand step advantages to tokens.</li>
      <li><strong>Close the inference-time gap</strong>: <code>inference/run_episode.py</code> reads memory during play but does not yet mirror training's post-episode LLM memory rewrite. Evaluation should match training end-to-end; add a <strong>post-episode-memory-rewrite eval variant</strong> when more compute is available.</li>
      <li><strong>Baseline harness at scale</strong>: run Random, BabyAI <code>BotAgent</code>, and zero-shot LLM baselines with enough seeds to report completion rates and calibration vs. GRPO / GRPO+memory (deferred for lack of compute).</li>
      <li><strong>Port the NCCL generate-count padding upstream into TRL</strong>: the bug is general, the fix is simple.</li>
      <li><strong>Harder curricula</strong>: extend beyond BabyAI (MiniHack, TextWorld) with the same OpenEnv wrap + memory template.</li>
      <li><strong>Human transfer pilot</strong>: does a memory-trained agent generalize to unseen BabyAI seeds better than stateless, and how much of the memory is environment-specific versus transferable strategy?</li>
    </ul>
  </section>

  <!-- 18. CONCLUSION -->
  <section id="conclusion">
    <h2>Conclusion</h2>
    <p><strong>MiniGridEnv + MiniGridPT</strong> takes the gym-native MiniGrid/BabyAI curriculum and turns it into a complete OpenEnv + GRPO + memory pipeline. The environment is a faithful wrap: text observation, NL action, BabyAI's ten stages. The training package is the extension: branch-stable markdown memory, a post-episode LLM rewrite shaped by <code>_temporary_vllm_max_tokens</code>, and an env-mask-aware rollout loop that makes variable-length multi-turn episodes play nicely with vLLM server mode.</p>
    <p>The infrastructure contributions (NCCL generate-count padding for variable-length rollouts, branch-stable per-chain memory files, the <code>max_completion_length</code> context manager for mixed action/memory generation budgets, per-reset curriculum via <code>reset()</code> kwargs) are lessons the next OpenEnv + TRL 1.0 + multi-turn + memory submission will need.</p>
    <p>Empirical completion tables and memory ablations await the next compute cycle (Section&nbsp;7 for the open question; Section&nbsp;11 for the planned experiment matrix). What ships with this post is the <strong>validated pipeline and the formal semantics for $M$</strong>.</p>
  </section>

  <div class="footer">
    <p>MiniGridEnv &middot; AgentX OpenEnv Track &middot; UC Berkeley RDI</p>
    <p style="margin-top:.5rem;">
      <a href="https://github.com/sharma-yash01/MiniGridEnv" target="_blank" rel="noopener noreferrer">MiniGridEnv</a> &middot;
      <a href="https://github.com/sharma-yash01/MiniGridPT" target="_blank" rel="noopener noreferrer">MiniGridPT</a> &middot;
      <a href="https://huggingface.co/spaces/yashu2000/MiniGridEnv" target="_blank">MiniGridEnv HF Space</a> &middot;
      <a href="https://github.com/meta-pytorch/OpenEnv" target="_blank">OpenEnv Framework</a> &middot;
      <a href="https://huggingface.co/docs/trl/en/openenv" target="_blank">TRL x OpenEnv</a> &middot;
      <a href="https://github.com/Farama-Foundation/Minigrid" target="_blank">MiniGrid</a>
    </p>
  </div>

</div>
</body>
</html>