File size: 77,127 Bytes
b997527
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fb75057
b997527
 
 
 
 
 
 
 
 
 
fb75057
 
 
b997527
 
 
 
7cb1c2d
b997527
 
 
 
 
 
 
 
fb75057
 
 
b997527
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fb75057
b997527
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fb75057
b997527
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fb75057
b997527
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fb75057
b997527
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fb75057
 
 
 
b997527
 
 
 
 
 
 
 
 
fb75057
 
 
 
b997527
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fb75057
b997527
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fb75057
 
 
b997527
 
 
 
 
 
 
 
e539caf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>MiniGridEnv: An OpenEnv Benchmark for Text-Grounded Navigation with Cross-Episodic Memory</title>
<meta name="description" content="MiniGridEnv: an OpenEnv-native wrap of Farama MiniGrid/BabyAI for LLM post-training, extended with cross-episodic LLM-rewritten markdown memory and branch-stable GRPO semantics.">
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;600;700;800&family=JetBrains+Mono:wght@400;600&display=swap" rel="stylesheet">
<!-- Mermaid for inline diagrams -->
<script type="module">
  import mermaid from 'https://cdn.jsdelivr.net/npm/mermaid@11/dist/mermaid.esm.min.mjs';
  mermaid.initialize({
    startOnLoad: true,
    theme: 'dark',
    themeVariables: {
      primaryColor: '#6366f1',
      primaryTextColor: '#e2e8f0',
      primaryBorderColor: '#818cf8',
      lineColor: '#818cf8',
      secondaryColor: '#1e293b',
      tertiaryColor: '#172033',
      background: '#0f172a',
      mainBkg: '#1e293b',
      nodeBorder: '#818cf8',
      clusterBkg: '#172033',
      clusterBorder: '#334155',
      titleColor: '#e2e8f0',
      edgeLabelBackground: '#1e293b',
      nodeTextColor: '#e2e8f0'
    },
    flowchart: { curve: 'basis', htmlLabels: true },
    fontFamily: 'Inter, sans-serif'
  });
</script>
<script>
  window.MathJax = {
    tex: {
      inlineMath: [['$', '$'], ['\\(', '\\)']],
      displayMath: [['$$', '$$'], ['\\[', '\\]']]
    },
    svg: { fontCache: 'global' }
  };
</script>
<script async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>
<style>
  :root {
    --bg: #0f172a; --surface: #1e293b; --surface-2: #172033; --border: #334155;
    --text: #e2e8f0; --muted: #94a3b8; --accent: #6366f1;
    --accent2: #818cf8; --green: #22c55e; --red: #ef4444;
    --orange: #f59e0b; --radius: 12px;
  }
  * { margin: 0; padding: 0; box-sizing: border-box; }
  html { scroll-behavior: smooth; }
  body { font-family: 'Inter', -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif;
         background: var(--bg); color: var(--text); line-height: 1.7;
         -webkit-font-smoothing: antialiased; }
  .container { max-width: 820px; margin: 0 auto; padding: 2rem 1.5rem 4rem; }
  /* Top nav */
  .topnav { position: sticky; top: 0; z-index: 10; background: rgba(15,23,42,.85);
            backdrop-filter: blur(10px); border-bottom: 1px solid var(--border);
            padding: .9rem 1.5rem; display: flex; justify-content: space-between;
            align-items: center; font-size: .88rem; }
  .topnav .brand { font-weight: 700; color: var(--text); text-decoration: none;
                   display: flex; align-items: center; gap: .5rem; }
  .topnav .brand .dot { width: 8px; height: 8px; border-radius: 50%;
                        background: var(--green); box-shadow: 0 0 8px rgba(34,197,94,.6); }
  .topnav .links { display: flex; gap: 1.25rem; }
  .topnav .links a { color: var(--muted); text-decoration: none; transition: color .15s; }
  .topnav .links a:hover { color: var(--accent2); }
  /* Hero */
  .hero { text-align: center; padding: 4rem 0 2.5rem; }
  .hero-badge { display: inline-block; background: rgba(99,102,241,.15); color: var(--accent2);
                padding: .4rem 1.1rem; border-radius: 20px; font-size: .78rem; font-weight: 600;
                letter-spacing: .08em; margin-bottom: 1.25rem;
                border: 1px solid rgba(99,102,241,.3); text-transform: uppercase; }
  .hero h1 { font-size: clamp(2rem, 4.2vw, 3.2rem); font-weight: 800; letter-spacing: -.025em;
             line-height: 1.15;
             background: linear-gradient(135deg, #e2e8f0 25%, #6366f1 100%);
             -webkit-background-clip: text; -webkit-text-fill-color: transparent;
             background-clip: text; }
  .hero .subtitle { color: var(--muted); font-size: 1.15rem; max-width: 640px;
                    margin: 1rem auto 0; }
  .hero .byline { color: var(--muted); font-size: .85rem; margin-top: 1.5rem;
                  font-style: italic; }
  .banner-figure { margin: 2rem 0 3rem; }
  .banner-figure img.banner { width: 100%; border-radius: var(--radius);
            border: 1px solid var(--border); display: block; }
  /* Badges row */
  .badges { display: flex; justify-content: center; gap: .6rem; flex-wrap: wrap;
            margin: 1.5rem 0; }
  .badges img { height: 22px; }
  /* Button group */
  .btn-group { display: flex; gap: .75rem; justify-content: center; margin: 2rem 0;
               flex-wrap: wrap; }
  .btn { display: inline-flex; align-items: center; gap: .45rem; padding: .6rem 1.35rem;
         background: var(--accent); color: white; border-radius: 8px; font-size: .88rem;
         font-weight: 600; text-decoration: none; transition: all .2s; }
  .btn:hover { background: var(--accent2); transform: translateY(-1px); }
  .btn-outline { background: transparent; border: 1px solid var(--border); color: var(--text); }
  .btn-outline:hover { border-color: var(--accent); color: var(--accent2);
                       background: rgba(99,102,241,.08); }
  /* TOC */
  .toc { background: var(--surface); border: 1px solid var(--border); border-radius: var(--radius);
         padding: 1.25rem 1.5rem; margin: 0 0 2.5rem; }
  .toc h3 { font-size: .82rem; font-weight: 700; letter-spacing: .08em; text-transform: uppercase;
            color: var(--accent2); margin-bottom: .85rem; }
  .toc ol { list-style: none; counter-reset: toc; display: flex; flex-wrap: wrap; gap: .35rem .8rem;
            margin: 0; padding: 0; }
  .toc ol li { counter-increment: toc; font-size: .88rem; }
  .toc ol li::before { content: counter(toc) "."; color: var(--accent); font-weight: 700;
                       font-size: .8rem; margin-right: .3rem; }
  .toc ol li a { color: var(--muted); text-decoration: none; transition: color .15s; }
  .toc ol li a:hover { color: var(--accent2); }
  /* Sections */
  section { margin: 3.5rem 0; }
  section h2 { font-size: 1.55rem; font-weight: 800; letter-spacing: -.01em;
               margin-bottom: 1rem; color: var(--text);
               border-left: 3px solid var(--accent); padding-left: .9rem; }
  section h3 { font-size: 1.1rem; font-weight: 700; margin: 2rem 0 .75rem;
               color: var(--accent2); }
  section p { color: #cbd5e1; margin-bottom: 1rem; font-size: 1.02rem; }
  section p strong { color: var(--text); }
  section ul, section ol { color: #cbd5e1; margin: 1rem 0 1rem 1.5rem; }
  section ul li, section ol li { margin-bottom: .5rem; font-size: 1rem; }
  section ul li strong, section ol li strong { color: var(--text); }
  /* Pull-quote */
  blockquote { border-left: 3px solid var(--accent2);
               background: rgba(99,102,241,.06); padding: 1.1rem 1.25rem;
               margin: 1.5rem 0; border-radius: 0 8px 8px 0;
               color: #e2e8f0; font-size: 1.02rem; }
  /* Tables */
  .table-wrap { margin: 1.5rem 0; overflow-x: auto;
                background: var(--surface); border: 1px solid var(--border);
                border-radius: var(--radius); }
  table { width: 100%; border-collapse: collapse; font-size: .92rem; }
  th { background: rgba(99,102,241,.1); color: var(--accent2);
       font-size: .72rem; font-weight: 700; letter-spacing: .06em;
       text-transform: uppercase; padding: .85rem 1rem; text-align: left; }
  td { padding: .7rem 1rem; border-top: 1px solid var(--border); color: #cbd5e1; }
  td.num { text-align: right; font-variant-numeric: tabular-nums;
           font-family: 'JetBrains Mono', monospace; font-size: .88rem; }
  tr:hover td { background: rgba(99,102,241,.04); }
  td strong, th strong { color: var(--text); }
  .task-id { font-family: 'JetBrains Mono', monospace; font-weight: 700;
             color: var(--accent2); font-size: .85rem; }
  tr.avg-row td { background: rgba(99,102,241,.08); font-weight: 700;
                  color: var(--text); }
  tr.novel td:first-child { color: #fca5a5; }
  /* Code */
  pre { background: #0b1120; border: 1px solid var(--border);
        border-radius: var(--radius); padding: 1.1rem 1.25rem; overflow-x: auto;
        margin: 1.25rem 0; font-family: 'JetBrains Mono', monospace;
        font-size: .85rem; line-height: 1.6; color: #d1d5db; }
  pre .c { color: #64748b; }
  code { font-family: 'JetBrains Mono', monospace; font-size: .88em;
         background: rgba(99,102,241,.12); color: var(--accent2);
         padding: .1em .35em; border-radius: 4px; }
  pre code { background: none; color: inherit; padding: 0; font-size: 1em; }
  /* Figure */
  figure { margin: 2rem 0; }
  figure img { width: 100%; border-radius: var(--radius);
               border: 1px solid var(--border); }
  figcaption { text-align: center; color: var(--muted); font-size: .85rem;
               margin-top: .75rem; }
  mjx-container { overflow-x: auto; max-width: 100%; }
  /* Mermaid diagram wrapper */
  .mermaid-wrap { margin: 2rem 0; background: var(--surface); border: 1px solid var(--border);
                  border-radius: var(--radius); padding: 1.5rem 1rem; overflow-x: auto; }
  .mermaid-wrap .mermaid { display: flex; justify-content: center; }
  .mermaid-caption { text-align: center; color: var(--muted); font-size: .85rem;
                     margin-top: .75rem; }
  /* Episode trace */
  .episode-trace { background: var(--surface); border: 1px solid var(--border);
                   border-radius: var(--radius); padding: 1.25rem 1.5rem; margin: 1.5rem 0;
                   position: relative; }
  .episode-trace::before { content: ''; position: absolute; left: 1.5rem; top: 2.5rem;
                           bottom: 1.25rem; width: 2px; background: var(--border); }
  .trace-step { position: relative; padding-left: 2rem; margin-bottom: 1.25rem; }
  .trace-step:last-child { margin-bottom: 0; }
  .trace-step .step-marker { position: absolute; left: -.45rem; top: .2rem; width: 12px;
                             height: 12px; border-radius: 50%; border: 2px solid var(--accent);
                             background: var(--bg); z-index: 1; }
  .trace-step .step-marker.terminal { background: var(--red); border-color: var(--red); }
  .trace-step .step-marker.good { background: var(--green); border-color: var(--green); }
  .trace-step .step-label { font-family: 'JetBrains Mono', monospace; font-size: .78rem;
                            color: var(--accent2); font-weight: 700; margin-bottom: .25rem; }
  .trace-step .step-content { font-size: .9rem; color: #cbd5e1; }
  .trace-step .step-content code { font-size: .82em; }
  .trace-verdict { margin-top: 1rem; padding: .75rem 1rem; border-radius: 8px;
                   font-size: .9rem; font-weight: 600; }
  .trace-verdict.bad { background: rgba(239,68,68,.1); border: 1px solid rgba(239,68,68,.3);
                       color: #fca5a5; }
  .trace-verdict.good { background: rgba(34,197,94,.1); border: 1px solid rgba(34,197,94,.3);
                        color: #86efac; }
  /* Callout for the closing question */
  .callout { text-align: center; padding: 2rem 1.5rem; margin: 3rem 0;
             background: linear-gradient(135deg, rgba(99,102,241,.08), rgba(129,140,248,.04));
             border: 1px solid rgba(99,102,241,.25); border-radius: var(--radius); }
  .callout .q { font-size: 1.25rem; font-weight: 700; color: var(--text);
                font-style: italic; margin-bottom: .5rem; }
  .callout .sub { color: var(--muted); font-size: .95rem; }
  /* Footer */
  .footer { text-align: center; padding: 3rem 0 1rem; color: var(--muted);
            font-size: .85rem; border-top: 1px solid var(--border); margin-top: 3rem; }
  .footer a { color: var(--accent2); text-decoration: none; margin: 0 .5rem; }
  .footer a:hover { text-decoration: underline; }
  @media (max-width: 640px) {
    .container { padding: 1rem 1rem 3rem; }
    .hero { padding: 2.5rem 0 1.5rem; }
    .topnav .links { display: none; }
    section h2 { font-size: 1.3rem; }
    table { font-size: .82rem; }
    th, td { padding: .55rem .6rem; }
    .toc ol { flex-direction: column; }
    .episode-trace { padding: 1rem; }
    .episode-trace::before { left: 1rem; }
  }
  /* Memory-file card (qualitative memory evolution gallery) */
  .memory-card { background: var(--surface); border: 1px solid var(--border);
                 border-radius: var(--radius); padding: 1rem 1.2rem; margin: 1rem 0; }
  .memory-card .mem-header { display: flex; justify-content: space-between;
                             font-family: 'JetBrains Mono', monospace; font-size: .78rem;
                             color: var(--accent2); margin-bottom: .65rem; }
  .memory-card .mem-header .mem-meta { color: var(--muted); }
  .memory-card pre { margin: 0; padding: .8rem 1rem; font-size: .8rem; background: var(--surface-2);
                     border-color: var(--border); }
  .memory-gallery { display: grid; grid-template-columns: 1fr; gap: 1rem; }
  @media (min-width: 720px) { .memory-gallery { grid-template-columns: 1fr 1fr; } }
</style>
</head>
<body>

<nav class="topnav">
  <a href="#top" class="brand"><span class="dot"></span> MiniGridEnv Blog</a>
  <div class="links">
    <a href="#why">Why</a>
    <a href="#design">Design</a>
    <a href="#memory">Memory</a>
    <a href="#memory-evolution">Memory gallery</a>
    <a href="#results">Results</a>
    <a href="#engineering">Engineering</a>
    <a href="https://huggingface.co/spaces/yashu2000/MiniGridEnv" target="_blank">Live Space &#8599;</a>
  </div>
</nav>

<div class="container" id="top">

  <div class="hero">
    <div class="hero-badge">OpenEnv &middot; AgentX Phase 2</div>
    <h1>MiniGridEnv</h1>
    <p class="subtitle">An OpenEnv-native wrap of Farama <strong>MiniGrid/BabyAI</strong> for text-grounded navigation, extended with <strong>cross-episodic, LLM-rewritten markdown memory</strong> and branch-stable GRPO.</p>
    <div class="badges">
      <a href="https://github.com/sharma-yash01/MiniGridEnv" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/MiniGridEnv-GitHub-181717?logo=github" alt="MiniGridEnv on GitHub"/></a>
      <a href="https://github.com/sharma-yash01/MiniGridPT" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/MiniGridPT-GitHub-181717?logo=github" alt="MiniGridPT on GitHub"/></a>
      <a href="https://huggingface.co/spaces/yashu2000/MiniGridEnv" target="_blank"><img src="https://img.shields.io/badge/HF%20Space-Live%20Demo-FFD21E?logo=huggingface&logoColor=black" alt="HF Space"/></a>
      <img src="https://img.shields.io/badge/OpenEnv-Native-4B8BBE" alt="OpenEnv"/>
      <img src="https://img.shields.io/badge/BabyAI-10%20levels-brightgreen" alt="10 BabyAI levels"/>
      <img src="https://img.shields.io/badge/Training-GRPO%20%2B%20Memory-orange" alt="GRPO + Memory"/>
    </div>
    <div class="byline">AgentX Phase 2 &middot; OpenEnv Challenge Submission &nbsp;|&nbsp; Yashaswi Sharma (University of Southern California)&nbsp;|&nbsp; Dongze Ye (USC) &nbsp;|&nbsp; Defu Cao (USC) &nbsp;|&nbsp; Muyan Weng (USC)</div>
  </div>

  <figure class="banner-figure">
    <img src="banner.png" alt="MiniGridEnv: observe a 7x7 egocentric grid, act via Thought/Action, remember via cross-episodic markdown memory" class="banner"/>
    <figcaption><strong>Figure 0.</strong> Three-stage loop: <strong>Observe</strong> (7&times;7 egocentric grid as natural language), <strong>Act</strong> (<code>Thought:</code> / <code>Action:</code> parsed to <code>Discrete(7)</code>, stepped over OpenEnv WebSocket), <strong>Remember</strong> (line-limited markdown $M$, rewritten by the same LLM after each episode; Section 7).</figcaption>
  </figure>

  <div class="btn-group">
    <a class="btn" href="https://huggingface.co/spaces/yashu2000/MiniGridEnv" target="_blank">Live Environment Space &rarr;</a>
    <a class="btn btn-outline" href="https://github.com/sharma-yash01/MiniGridEnv" target="_blank" rel="noopener noreferrer">MiniGridEnv (GitHub)</a>
    <a class="btn btn-outline" href="https://github.com/sharma-yash01/MiniGridPT" target="_blank" rel="noopener noreferrer">MiniGridPT (GitHub)</a>
  </div>

  <!-- Table of Contents -->
  <nav class="toc" id="toc">
    <h3>Contents</h3>
    <ol>
      <li><a href="#why">Text-Grounded Navigation with a Self-Curated Notebook</a></li>
      <li><a href="#matters">Why This Benchmark Matters</a></li>
      <li><a href="#prior-work">Prior Work &amp; Novelty</a></li>
      <li><a href="#design">What MiniGridEnv + MiniGridPT Are</a></li>
      <li><a href="#env-design">Environment Design</a></li>
      <li><a href="#openenv">Why OpenEnv</a></li>
      <li><a href="#memory">Cross-Episodic Memory</a></li>
      <li><a href="#scoring">Scoring &amp; Reward Shaping</a></li>
      <li><a href="#architecture">Architecture &amp; Training Pipeline</a></li>
      <li><a href="#memory-evolution">Memory-evolution gallery (illustrative)</a></li>
      <li><a href="#results">Results: What We Found</a></li>
      <li><a href="#engineering">Engineering Lessons</a></li>
      <li><a href="#positioning">Where This Submission Sits</a></li>
      <li><a href="#gigpo">Next Step: GiGPO</a></li>
      <li><a href="#foundations">Foundations &amp; Citations</a></li>
      <li><a href="#quickstart">Quick Start</a></li>
      <li><a href="#future">Future Work</a></li>
      <li><a href="#conclusion">Conclusion</a></li>
    </ol>
  </nav>

  <!-- 1. WHY -->
  <section id="why">
    <h2>Text-grounded navigation with a self-curated notebook</h2>
    <p>Most LLM benchmarks ask what a model <strong>can say</strong>. Few ask whether it can <strong>act in a grounded compositional world while curating its own persistent notebook</strong>. <strong>MiniGridEnv</strong> is an OpenEnv-native wrap of Farama's <a href="https://github.com/Farama-Foundation/Minigrid" target="_blank" style="color:var(--accent2)">MiniGrid / BabyAI</a> that gives an LLM a 7&times;7 egocentric world rendered as natural language, natural-language actions (<code>"go forward"</code>, <code>"pickup"</code>, <code>"turn left"</code>), and BabyAI's ten-stage compositional instruction curriculum from <code>GoToRedBall</code> to <code>BossLevel</code>.</p>
    <p><strong>This blog is about the extension.</strong> The base environment is a faithful OpenEnv wrap of MiniGrid/BabyAI (existing work, now interoperable). The novel contribution is <strong>cross-episodic memory</strong>: a line-limited markdown file the agent reads before each action and <em>rewrites</em> at the end of each episode, plus <strong>branch-stable GRPO file naming</strong> so each parallel rollout chain keeps one stable file to compact across optimizer steps.</p>
    <p>Every reward signal is <strong>ground-truth arithmetic</strong> from the underlying BabyAI bot-verifiable success criterion. There is no LLM judge in the loop.</p>
    <p>The falsifiable claims:</p>
    <ol>
      <li><em>GRPO post-training on grounded navigation produces monotonically increasing completion rates across BabyAI's curriculum.</em></li>
      <li><em>Cross-episodic memory measurably improves completion rate over stateless play, and the memory content evolves from random notes into structured strategies as training progresses.</em></li>
    </ol>
    <p><strong>Figure 0</strong> (the banner above) encodes the contribution at a glance: the <em>Observe</em> panel matches the text-observation stack in <a href="#env-design" style="color:var(--accent2)">Environment design</a>; the <em>Act</em> panel matches NL actions, parsing, and OpenEnv stepping in the same section; the <em>Remember</em> panel matches cross-episodic memory $M$ in <a href="#memory" style="color:var(--accent2)">Cross-episodic memory</a> and the training loop in <a href="#architecture" style="color:var(--accent2)">Architecture &amp; training pipeline</a>.</p>
  </section>

  <!-- 2. WHY IT MATTERS -->
  <section id="matters">
    <h2>Why this benchmark matters</h2>
    <p>Grounded navigation with compositional language is a load-bearing capability for embodied agents, web agents, and any LLM that must <em>act under an observation budget</em>. BabyAI has been the reference curriculum for this since 2019, but its native interface is a raw gym environment, not a WebSocket contract a GRPO trainer can consume across machines, Docker containers, and HF Spaces with a single code path.</p>
    <p>The methodology is <strong>transferable</strong>. Any text-grounded sequential task with a sparse terminal reward and compositional instructions (web navigation, tool-use, interactive debugging, embodied robotics simulators) fits the same MDP template. Memory is also transferable: line-limited LLM-rewritten markdown is a general mechanism for <em>self-directed state</em> that is not specific to BabyAI.</p>
    <p>The environment is <strong>engineering-cheap to scale</strong>. MiniGrid steps are microseconds; an instance is 1&ndash;5&nbsp;MB; the OpenEnv wrapper sets <code>max_concurrent_envs=256</code> out of the box. An LLM-backed environment cannot match that density.</p>
  </section>

  <!-- 3. PRIOR WORK & NOVELTY -->
  <section id="prior-work">
    <h2>Prior work &amp; novelty</h2>
    <p>The space of &quot;LLMs + text-grounded navigation + memory&quot; sits across three prior buckets. None occupies the cell we target:</p>
    <div class="table-wrap">
      <table>
        <thead><tr><th>Prior work bucket</th><th>What it does</th><th>What it does not</th></tr></thead>
        <tbody>
          <tr><td><strong>BabyAI / MiniGrid (base)</strong><br><span style="font-size:.85em;color:var(--muted)">Chevalier-Boisvert et al., <a href="https://arxiv.org/abs/1810.08272" target="_blank" style="color:var(--accent2)">arXiv:1810.08272</a> (ICLR 2019); <a href="https://github.com/Farama-Foundation/Minigrid" target="_blank" style="color:var(--accent2)">Farama-Foundation/Minigrid</a></span></td><td>Compositional language-conditioned navigation as a gym environment with a reference bot and a 10-stage difficulty curriculum</td><td>No OpenEnv/WebSocket contract; no text observation; no LLM post-training pipeline; no memory</td></tr>
          <tr><td><strong>Memory-augmented LLM agents</strong><br><span style="font-size:.85em;color:var(--muted)">Voyager (<a href="https://arxiv.org/abs/2305.16291" target="_blank" style="color:var(--accent2)">arXiv:2305.16291</a>); Reflexion (<a href="https://arxiv.org/abs/2303.11366" target="_blank" style="color:var(--accent2)">arXiv:2303.11366</a>); Generative Agents (<a href="https://arxiv.org/abs/2304.03442" target="_blank" style="color:var(--accent2)">arXiv:2304.03442</a>)</span></td><td>Cross-episode skill libraries, verbal reflection, structured long-term memory, all <em>prompt-engineered</em> at inference time</td><td>No RL post-training; no branch-stable memory semantics under GRPO; not connected to OpenEnv</td></tr>
          <tr><td><strong>RLVR on language environments</strong><br><span style="font-size:.85em;color:var(--muted)">DeepSeekMath / GRPO (<a href="https://arxiv.org/abs/2402.03300" target="_blank" style="color:var(--accent2)">arXiv:2402.03300</a>); TRL &times; OpenEnv (<a href="https://huggingface.co/docs/trl/en/openenv" target="_blank" style="color:var(--accent2)">TRL docs</a>)</span></td><td>Critic-free RL with verifiable rewards; standard WebSocket env contract and `rollout_func`</td><td>No persistent agent state across episodes; no first-class notion of branch-stable rollout chains</td></tr>
          <tr class="novel"><td><strong>MiniGridEnv + MiniGridPT (ours)</strong></td><td>OpenEnv wrap of MiniGrid/BabyAI + GRPO + <em>cross-episodic LLM-rewritten markdown memory</em> + <em>branch-stable per-chain file naming</em></td><td>Not a human study; memory is text-only (no retrieval index)</td></tr>
        </tbody>
      </table>
    </div>
    <blockquote>To our knowledge, no prior work combines an <strong>OpenEnv-native BabyAI environment</strong> with <strong>GRPO post-training</strong>, <strong>line-limited LLM-rewritten cross-episodic memory</strong>, and <strong>branch-stable memory-file naming</strong> that keeps each parallel GRPO chain anchored to a stable file across optimizer steps. The env-contract, memory semantics, and training package are the contribution; MiniGrid/BabyAI are the shoulders we stand on.</blockquote>
  </section>

  <!-- 4. WHAT IT IS -->
  <section id="design">
    <h2>What MiniGridEnv + MiniGridPT are</h2>
    <blockquote>Two strictly separated packages. <strong>MiniGridEnv</strong> (the OpenEnv-compatible environment) and <strong>MiniGridPT</strong> (the GRPO training client) communicate exclusively over WebSocket. No shared Python imports. The training container is pure-GPU; the environment container is CPU-only.</blockquote>
    <p>Each episode:</p>
    <ul>
      <li>The env samples a BabyAI level (<code>GoToRedBall</code> &hellip; <code>BossLevel</code>), seeds procedural generation, and emits a mission like <code>&quot;go to the red ball&quot;</code> or <code>&quot;open the door on your left, then put the green ball next to the yellow key&quot;</code>.</li>
      <li>On turn <em>t</em>, the agent sees a natural-language description of its 7&times;7 egocentric view plus the mission, and emits <code>Thought: &hellip;\nAction: &lt;one of 7 actions&gt;</code>.</li>
      <li>A local parser normalizes the action into MiniGrid's <code>Discrete(7)</code> space; the gym env steps; the wrapper builds the next text observation.</li>
      <li><strong>Mid-episode reward is zero.</strong> On success the env emits <code>+1.0</code> (binary reward, the GRPO-friendly default).</li>
      <li><strong>Memory mode only:</strong> at episode end the LLM reads a post-episode prompt and rewrites its persistent <code>memory/*.md</code> file for the next episode.</li>
    </ul>
    <p>The agent's interface is deliberately minimal: plain <code>Thought:/Action:</code> text, no tool-call protocol, no JSON schema. The training client parses and steps the environment over WebSocket.</p>
  </section>

  <!-- 5. ENVIRONMENT DESIGN -->
  <section id="env-design">
    <h2>Environment design</h2>
    <p>The core contract is three Pydantic types exchanged over the OpenEnv WebSocket (<code>MiniGridEnv/env/models.py</code>):</p>
    <pre><code><span class="c"># Action (agent -> env)</span>
class MiniGridAction(Action):
    command: str                 <span class="c"># "go forward", "turn left", "pickup", ...</span>
    thought: Optional[str] = None <span class="c"># logged for analysis, not executed</span>

<span class="c"># Observation (env -> agent)</span>
class MiniGridObservation(Observation):
    text: str                    <span class="c"># NL description of the 7x7 egocentric view</span>
    mission: str                 <span class="c"># "go to the red ball", ...</span>
    step_idx, steps_remaining, max_steps: int
    history: list[dict]          <span class="c"># recent step summaries</span>
    level_name: str
    last_action: Optional[str]
    action_success: Optional[bool]
    done: bool; reward: Optional[float]; metadata: dict

<span class="c"># State (hidden from agent; logging / eval only)</span>
class MiniGridState(State):
    level_name, level_difficulty, completed, truncated,
    total_reward, steps_taken, optimal_steps, efficiency_ratio,
    valid_actions, invalid_actions, action_distribution</code></pre>

    <h3>The text observation (quality lever #1)</h3>
    <p>MiniGrid's raw observation is a <code>(7, 7, 3)</code> numpy grid of (object type, color, door state) with the agent fixed at row=6 col=3 facing &quot;up&quot;. <code>env/grid_to_text.py</code> turns that into a layered NL description:</p>
    <ol>
      <li><code>Mission: &hellip;</code></li>
      <li><code>You are facing {east,south,west,north}.</code></li>
      <li>Immediate surroundings: ahead / left / right single-cell descriptions.</li>
      <li>Path ahead: compresses runs of empty cells (e.g. <em>&quot;empty for 3 steps, then a closed red door, then a wall&quot;</em>).</li>
      <li>Notable objects: interactive items (key, ball, box, goal, door, lava) with relative phrases (<em>&quot;2 steps ahead and 1 to your right&quot;</em>), sorted by Manhattan distance.</li>
      <li>Carrying state.</li>
    </ol>
    <p>The internal design note is blunt: <em>&quot;the quality of the text observation is the single biggest lever on training success.&quot;</em> Everything else in the environment is a thin layer over the gym loop.</p>

    <h3>Actions (quality lever #2): NL &rarr; Discrete(7)</h3>
    <p><code>env/action_parser.py</code> maps natural-language strings to MiniGrid's discrete action index. The same logic is duplicated (intentionally) in <code>MiniGridPT/training/openenv_runtime.py</code> so the PT package remains standalone; a parity test guards the two copies.</p>
    <div class="table-wrap">
      <table>
        <thead><tr><th>Canonical</th><th>Index</th><th>Accepted aliases</th></tr></thead>
        <tbody>
          <tr><td><code>turn left</code></td><td class="num">0</td><td><code>left</code></td></tr>
          <tr><td><code>turn right</code></td><td class="num">1</td><td><code>right</code></td></tr>
          <tr><td><code>go forward</code></td><td class="num">2</td><td><code>move forward</code>, <code>forward</code>, <code>ahead</code>, <code>step</code>, <code>walk</code></td></tr>
          <tr><td><code>pickup</code></td><td class="num">3</td><td><code>pick up</code>, <code>grab</code>, <code>take</code>, <code>get</code></td></tr>
          <tr><td><code>drop</code></td><td class="num">4</td><td><code>release</code>, <code>put down</code></td></tr>
          <tr><td><code>toggle</code></td><td class="num">5</td><td><code>open</code>, <code>close</code>, <code>unlock</code>, <code>switch</code></td></tr>
          <tr><td><code>done</code></td><td class="num">6</td><td><code>wait</code>, <code>noop</code>, <code>stop</code></td></tr>
        </tbody>
      </table>
    </div>
    <p>An <strong>unparseable string falls back to <code>go forward</code></strong>, not to <code>done</code>. Rationale: early in training, exploration beats noop; every invalid parse increments a counter so we can watch parse-rate climb with training.</p>

    <h3>BabyAI curriculum (10 levels)</h3>
    <p><code>env/levels.py</code> registers the full BabyAI ladder with candidate gym IDs (so minigrid version drift between <code>BabyAI-GoToRedBallGrey-v0</code> and <code>BabyAI-GoToRedBall-v0</code> doesn't brick a run):</p>
    <div class="table-wrap">
      <table>
        <thead><tr><th>Stage</th><th>Level</th><th>Gym ID</th><th>Max steps</th><th>Optimal</th></tr></thead>
        <tbody>
          <tr><td class="num">0</td><td><strong>GoToRedBall</strong></td><td><code>BabyAI-GoToRedBallGrey-v0</code></td><td class="num">64</td><td class="num">~10</td></tr>
          <tr><td class="num">1</td><td>GoToObj</td><td><code>BabyAI-GoToObj-v0</code></td><td class="num">64</td><td class="num">~12</td></tr>
          <tr><td class="num">1</td><td>GoToLocal</td><td><code>BabyAI-GoToLocal-v0</code></td><td class="num">64</td><td class="num">~15</td></tr>
          <tr><td class="num">2</td><td>PickupLoc</td><td><code>BabyAI-PickupLoc-v0</code></td><td class="num">64</td><td class="num">~14</td></tr>
          <tr><td class="num">2</td><td>OpenDoor</td><td><code>BabyAI-OpenDoor-v0</code></td><td class="num">64</td><td class="num">~12</td></tr>
          <tr><td class="num">2</td><td>UnlockLocal</td><td><code>BabyAI-UnlockLocal-v0</code></td><td class="num">128</td><td class="num">~25</td></tr>
          <tr><td class="num">3</td><td>GoTo</td><td><code>BabyAI-GoTo-v0</code></td><td class="num">128</td><td class="num">~30</td></tr>
          <tr><td class="num">3</td><td>PutNextLocal</td><td><code>BabyAI-PutNextLocal-v0</code></td><td class="num">128</td><td class="num">~20</td></tr>
          <tr><td class="num">4</td><td>Synth</td><td><code>BabyAI-Synth-v0</code></td><td class="num">128</td><td class="num">~40</td></tr>
          <tr><td class="num">4</td><td><strong>BossLevel</strong></td><td><code>BabyAI-BossLevel-v0</code></td><td class="num">128</td><td class="num">~80</td></tr>
        </tbody>
      </table>
    </div>
    <p>A single Docker container serves every stage. <code>env.reset(level=&quot;BossLevel&quot;)</code> switches the underlying gym env per-reset. A fix replaced the original <code>del kwargs</code> in <code>reset()</code> with a <code>kwargs.pop(&quot;level&quot;, None)</code>, which is what unlocked single-server curriculum training. Per-level <code>max_steps</code> are defined in our <code>LevelConfig</code> registry (<code>env/levels.py</code>); Synth and BossLevel are <strong>capped at 128</strong> steps in this repo so episode length (and vLLM server-mode padding budgets) stay bounded for training.</p>

    <h3>Reward</h3>
    <p>Default: <strong>binary</strong>. <code>+1.0</code> on completion, <code>0.0</code> otherwise. GRPO works best with clean sparse signals. <code>RewardConfig</code> also supports <code>shaped</code> (step penalty + invalid-action penalty) and <code>efficiency</code> (bonus scaled to <code>optimal_steps/steps_taken</code>) modes if a stage stalls.</p>
    <p>Let $r_t$ denote the per-step environment reward (binary default). With horizon $T$ (our capped <code>max_steps</code>), mission success at termination gives a single $+1$ spike:</p>
    $$r_t = \begin{cases} +1 & \text{if the BabyAI mission is satisfied when the episode ends at step } t \\ 0 & \text{otherwise} \end{cases}$$
    <p>In the default mode, $r_t = 0$ for all $t &lt; T$ unless the mission completes early; shaping modes spread signal across steps via <code>RewardConfig</code> in <code>env/reward.py</code>.</p>
  </section>

  <!-- 6. WHY OPENENV -->
  <section id="openenv">
    <h2>Why OpenEnv</h2>
    <p>OpenEnv gives us three things that matter for this submission:</p>
    <ol>
      <li>A <strong>standard WebSocket environment contract</strong> consumable by TRL's <code>rollout_func</code> with typed Pydantic payloads and Gym-style <code>reset</code>/<code>step</code> semantics.</li>
      <li><strong>Per-session state with <code>SUPPORTS_CONCURRENT_SESSIONS=True</code></strong> and <code>max_concurrent_envs=256</code>. DDP ranks can hammer the same Space without cross-talk because each WebSocket session gets a fresh <code>gym.Env</code> instance (MiniGrid is not thread-safe; factory mode is mandatory).</li>
      <li><strong>Uniform deployment</strong>. Identical env code runs in-process for tests, as a Docker container for development (<code>server/Dockerfile</code>, <code>openenv-base</code>, port 8000), and as a Hugging Face Space during training and evaluation.</li>
    </ol>
    <p>No new abstractions were invented. Base types only: <code>EnvClient</code>, <code>Environment</code>, Pydantic <code>Action</code> / <code>Observation</code> / <code>State</code>. Curriculum level, history, and per-episode metrics ride on <code>metadata</code> and <code>state</code>. The environment ships with <code>openenv.yaml</code>, a <code>Dockerfile</code>, and an HF Space.</p>
    <p>Critically, <strong>MiniGridPT does not <code>import MiniGridEnv</code></strong>. Everything crosses the wire. A <code>MiniGridClient(EnvClient)</code> in <code>MiniGridPT/training/openenv_runtime.py</code> sends plain dicts. This is the architectural lynchpin that lets the training node be pure-GPU and the environment node be CPU-only.</p>
  </section>

  <!-- 7. CROSS-EPISODIC MEMORY (THE NOVELTY) -->
  <section id="memory">
    <h2>Cross-episodic memory</h2>
    <p>This is the research contribution. The base MiniGrid/BabyAI world is stateless between episodes: each <code>reset</code> gives the agent a fresh procedurally generated room with no persistent side-channel. We add one:</p>
    <pre><code>@dataclass
class MemoryConfig:
    enabled: bool = False
    max_lines: int = 100            <span class="c"># line-limit, not token-limit</span>
    memory_dir: str = "./memory"
    agent_id: str = "default"
    branch_stable_memory: bool = False  <span class="c"># see below</span>

    @property
    def memory_path(self) -> Path:
        return Path(self.memory_dir) / f"{self.agent_id}.md"</code></pre>

    <p>Four deliberate design choices, each rejecting a plausible alternative:</p>
    <ol>
      <li><strong>Line limit, not token limit.</strong> Lines are visible and countable <em>by the model</em> in the prompt (<code>(42/100 lines)</code>). The model gets a concrete budget it can reason about.</li>
      <li><strong>Full replacement, not append.</strong> At each episode end the LLM rewrites the file from scratch. This forces the agent to decide what to keep vs. evict (the interesting half of curation).</li>
      <li><strong>Unstructured markdown, not schema.</strong> No bullets required, no JSON. The research question is whether the model will <em>self-organize</em> useful knowledge, not whether it can fill in a template.</li>
      <li><strong>Truncation from the top.</strong> Safety net only; if the model overshoots <code>max_lines</code>, keep the most-recently-written lines.</li>
    </ol>

    <h3>Post-episode rewrite via <code>_temporary_vllm_max_tokens</code></h3>
    <p>Action turns need ~128 tokens (<code>Thought: &hellip;\nAction: go forward</code>). The memory rewrite needs ~512 (100 lines at ~5 tokens/line worst case). One global <code>max_completion_length</code> cannot satisfy both. The fix is a context manager:</p>
    <pre><code>@contextmanager
def _temporary_vllm_max_tokens(trainer, max_tokens: int):
    vg = trainer.vllm_generation
    prev = vg.max_completion_length
    vg.max_completion_length = max_tokens
    try:
        yield
    finally:
        vg.max_completion_length = prev

<span class="c"># Used both for the 512-token memory rewrite and for the 1-token</span>
<span class="c"># NCCL-padding dummy generates described in the Engineering section.</span></code></pre>

    <h3>Branch-stable file naming (per-chain compaction)</h3>
    <p>GRPO runs <em>G</em> parallel completions per prompt, each with its own advantage and gradient contribution. If every slot writes to a uniquely-named file, there's no continuity across optimizer steps, so each memory chain is one episode long. If every slot writes to one shared file, writes race and the signal is mush.</p>
    <p>The solution: <strong>branch-stable naming</strong> <code>rank{R}_br{k}_{base}.md</code> with <code>k = slot_idx % num_generations</code>. The <em>k</em>-th parallel generation maps to a <strong>stable file across optimizer steps</strong>, so branch <em>k</em> after prompt group P1 is the same file used by branch <em>k</em> after prompt group P2. Each of the <em>G</em> GRPO branches builds its own evolving notebook, which is what gives the model a training signal to <em>compact and summarize</em> episode-to-episode.</p>
    <p>Requires <code>per_device_train_batch_size == num_generations</code> (otherwise multiple groups in one step hit the same <em>k</em> and a one-time <code>UserWarning</code> fires). A third scheme (a single shared file across all slots and ranks) is sketched but not landed; it needs a decision about concurrent-writer races.</p>

    <p>Let $M_e \in \mathcal{M}$ denote the memory file (markdown string) at the start of episode $e$, let $\tau_e$ be the trajectory (observations, parsed actions, outcomes), and let $\pi_\theta^{\mathrm{mem}}$ be the same LLM invoked on the post-episode memory-update prompt. The write is a full rewrite followed by a line-budget projection $\Pi_L(\cdot)$ that keeps the last $L$ lines (here $L = 100$):</p>
    $$M_{e+1} = \Pi_L\!\left( \pi_\theta^{\mathrm{mem}}(M_e,\, \tau_e,\, \mathrm{outcome}_e) \right).$$
    <p>Branch-stable filenames tie each GRPO branch index $k = s \bmod G$ to a stable path across optimizer steps, for DDP rank $R$, slot index $s$, group size $G = \texttt{num\_generations}$, and basename <code>base</code> (e.g. <code>default</code>):</p>
    $$\mathrm{path}(R,s,\mathrm{base}) \;=\; \texttt{memory/rank}R\texttt{\_br}_{\,k}\texttt{\_}\mathrm{base}\texttt{.md}\,,\quad k = s \bmod G.$$
    <p>This is exactly the <strong>Remember</strong> panel in Figure&nbsp;0: the file card is $M_e$ at read time; the post-episode LLM box is $\pi_\theta^{\mathrm{mem}}$; the curved arrow is the next-episode read of $M_{e+1}$.</p>

    <blockquote>Can an LLM learn to curate its own persistent, line-budgeted notebook such that cross-episodic memory measurably improves completion rate, and the memory content evolves from random notes into structured strategies as training progresses?</blockquote>
  </section>

  <!-- 8. SCORING -->
  <section id="scoring">
    <h2>Scoring &amp; reward shaping</h2>
    <p>The environment reward is terminal and sparse. Everything else is a <strong>small shaping bonus</strong> designed to rule out pathological regimes without dominating the signal.</p>
    <div class="table-wrap">
      <table>
        <thead><tr><th>Component</th><th>Range</th><th>Source</th><th>What it rewards</th></tr></thead>
        <tbody>
          <tr><td><strong>Env reward (binary)</strong></td><td class="num">0 or +1</td><td><code>env/reward.py</code></td><td>Mission completed (BabyAI ground-truth success)</td></tr>
          <tr><td><strong>Format reward</strong></td><td class="num">[&minus;0.1, +0.1]</td><td><code>reward_funcs.reward_format</code></td><td>Both <code>Thought:</code> and <code>Action:</code> present (1.0), one (0.5), neither (0.0), rescaled</td></tr>
          <tr><td><strong>Memory: in-budget</strong></td><td class="num">+0.05</td><td><code>compute_memory_quality_flags</code></td><td>Memory rewrite stayed within <code>max_lines</code> (no truncation)</td></tr>
          <tr><td><strong>Memory: non-empty</strong></td><td class="num">+0.05</td><td><code>compute_memory_quality_flags</code></td><td>Agent is actually writing something</td></tr>
          <tr><td><strong>Memory: not-a-dump</strong></td><td class="num">&minus;0.05</td><td><code>memory_looks_like_observation_dump</code></td><td>Penalty if memory is just a copy of the last observation</td></tr>
        </tbody>
      </table>
    </div>
    <p>Design principle: <strong>env reward dominates</strong>. Format and memory-quality bonuses are at &plusmn;0.1&ndash;0.15 scale, intended as training wheels, removable once the model reliably emits structured output (&gt;90% validity) and writes substantive memory.</p>
    <p>Let $\tau$ denote an episode trajectory and $M_e, M_{e+1}$ memory before/after the episode. Write $R_{\mathrm{env}} = \sum_t r_t \in \{0,1\}$ for the binary BabyAI success signal, $R_{\mathrm{fmt}}(\tau)$ for the rescaled format score in $[-1,1]$ (mapped to $[-0.1,0.1]$ via $\alpha_{\mathrm{fmt}} = 0.1$ in code), and $R_{\mathrm{mem}}(M_{e+1})$ for the memory-quality shaping used by the trainer. The scalar logged to TRL as <code>env_reward</code> is:</p>
    $$R(\tau, M_e, M_{e+1}) \;=\; \underbrace{R_{\mathrm{env}}}_{{\in \{0,1\}}} \;+\; \underbrace{\alpha_{\mathrm{fmt}}\, R_{\mathrm{fmt}}(\tau)}_{{\in [-0.1,\,0.1]}} \;+\; \underbrace{R_{\mathrm{mem}}(M_{e+1})}_{{\in [-0.05,\,0.10]}}.$$
    <p>With indicator $\mathbf{1}[\cdot]$, line budget $L$, and dump detector $\mathrm{dump}(M)$ (true when memory is effectively a copy of the last observation):</p>
    $$R_{\mathrm{mem}}(M) = \beta_{\mathrm{budget}}\,\mathbf{1}\big[\mathrm{lines}(M) \le L\big] + \beta_{\mathrm{ne}}\,\mathbf{1}\big[M \neq \varnothing\big] - \beta_{\mathrm{dump}}\,\mathbf{1}\big[\mathrm{dump}(M)\big],$$
    <p>with $\beta_{\mathrm{budget}} = \beta_{\mathrm{ne}} = \beta_{\mathrm{dump}} = 0.05$ as implemented in <code>MiniGridPT</code> (names may differ slightly in code; the ranges in the table above match the shipped constants).</p>
    <blockquote><strong>Why memory quality has a negative flag.</strong> Without the <code>memory_looks_like_observation_dump</code> penalty, the shortest-path way to collect the non-empty bonus is to paste the last observation into memory. That gives zero cross-episodic signal. The penalty forces the memory to be <em>compressed / abstracted</em>, which is the interesting behavior.</blockquote>
  </section>

  <!-- 9. ARCHITECTURE & TRAINING PIPELINE -->
  <section id="architecture">
    <h2>Architecture &amp; training pipeline</h2>
    <p>Two strictly separated packages. <strong>MiniGridEnv</strong> (OpenEnv environment) and <strong>MiniGridPT</strong> (GRPO training client) communicate exclusively over WebSocket (no in-process imports).</p>

    <div class="mermaid-wrap">
      <pre class="mermaid">
flowchart LR
    subgraph PT ["MiniGridPT (Training)"]
        GRPO["GRPOTrainer<br/>TRL 1.0.0"]
        RF["rollout_func<br/>(per-episode loop)"]
        VLLM["vLLM<br/>colocate/server"]
        PARSE["parse_action<br/>(NL -> Discrete(7))"]
        MEM["memory/rank{R}_br{k}_default.md<br/>(branch-stable)"]
    end
    subgraph ENV ["MiniGridEnv (OpenEnv)"]
        WS["FastAPI<br/>WebSocket"]
        GYM["MiniGrid gym env<br/>(BabyAI level)"]
        TEXT["grid_to_text<br/>(7x7 -> NL)"]
        REW["Reward<br/>binary +1.0"]
    end

    GRPO --> RF
    RF --> VLLM
    VLLM -->|"generate Thought/Action"| PARSE
    PARSE -->|"{command, thought}"| WS
    WS --> GYM
    GYM --> TEXT
    TEXT -->|"observation.text"| WS
    WS -->|"obs + reward + done"| RF
    RF -->|"post-episode rewrite"| MEM
    MEM -->|"read at t=0 next episode"| RF
    REW --> WS
      </pre>
      <p class="mermaid-caption">Figure 1. System architecture. PT never imports env-side types. Memory is a per-branch markdown file owned by the training client, rewritten by the LLM at each episode end.</p>
    </div>

    <p>Training uses <strong>GRPO</strong> (Group Relative Policy Optimization), a critic-free RL algorithm ideal for terminal-only rewards. We use TRL 1.0.0's <code>rollout_func</code> contract for explicit control over the <code>generate &rarr; parse &rarr; env.step</code> loop.</p>

    <h3>The rollout function</h3>
    <p>Per slot, <code>_rollout_one_episode</code> (<code>MiniGridPT/training/rollout_func.py</code>) runs a complete episode inside one training step:</p>
    <ol>
      <li>Build initial chat messages (system + first user observation block, with the current memory folded in if enabled).</li>
      <li>Open a WebSocket session via <code>MiniGridClient(base_url=ENV_BASE_URL).sync()</code> and call <code>env.reset(level=LEVEL, seed=&hellip;)</code>.</li>
      <li><strong>Episode loop</strong> until <code>done</code> or the turn cap: generate with vLLM (<code>max_completion_length=128</code>), append tokens to <code>completion_ids</code> with <code>env_mask=1</code>, parse Thought/Action, call <code>env.step({&quot;command&quot;: canonical, &quot;thought&quot;: thought})</code>, append the rendered next-observation tokens with <code>env_mask=0</code>.</li>
      <li><strong>Post-episode memory rewrite</strong> (memory mode only): build <code>MEMORY_UPDATE_PROMPT</code> with outcome / steps / current memory / line count / budget, call <code>generate()</code> wrapped in <code>_temporary_vllm_max_tokens(trainer, 512)</code>, write to the branch-stable file, append tokens with <code>env_mask=0</code>.</li>
    </ol>

    <p>The return dict is the shape TRL's <code>GRPOTrainer</code> consumes:</p>
    <pre><code>{
    "prompt_ids":     list[list[int]],   <span class="c"># one per slot (fixed initial prompt)</span>
    "completion_ids": list[list[int]],   <span class="c"># full episode (LLM + env user turns)</span>
    "logprobs":       list[list[float]], <span class="c"># from vLLM; zero-filled for env_mask=0</span>
    "env_mask":       list[list[int]],   <span class="c"># 1 = LLM token, 0 = env/context token</span>
    "env_reward":     list[float],       <span class="c"># binary env reward + memory-quality bonus</span>
}</code></pre>

    <p>The <code>env_mask</code> is what lets us mix <strong>LLM-authored tokens</strong> (eligible for the env-reward term in the GRPO objective) with <strong>env-rendered context tokens</strong> (visible for KL-to-reference but excluded from the advantage weighting). Without it, the model would be &quot;rewarded&quot; for tokens it didn't generate.</p>
    <p>At episode boundaries, the rollout reads memory $M_e$ into the prompt, rolls out $\tau$ with environment observations rendered as tokens with mask $0$, then applies $\pi_\theta^{\mathrm{mem}}$ to obtain $M_{e+1}$ as in Section&nbsp;7, the same loop sketched in Figure&nbsp;0 (Remember).</p>
    <p>Let $y_{i,1:T_i}$ be the token sequence for completion $i$ (including env turns), $m_{i,t} \in \{0,1\}$ the env mask, and $\rho_{i,t}(\theta) = \pi_\theta(y_{i,t}\mid y_{i,1:t-1}) / \pi_{\theta_{\mathrm{old}}}(y_{i,t}\mid y_{i,1:t-1})$ the importance ratio on LLM-authored tokens. With clipping threshold $\epsilon$, KL coefficient $\beta_{\mathrm{KL}}$, and group-relative advantage $A_i = (R_i - \mu_R)/\sigma_R$ over $G$ parallel completions sharing a prompt, the masked GRPO-style surrogate we target is:</p>
    $$\mathcal{L}_{\mathrm{GRPO}}(\theta) \;=\; -\,\mathbb{E}\left[ \sum_{i=1}^{G} \sum_{t : m_{i,t}=1} \min\!\Big( \rho_{i,t}(\theta)\, A_i,\; \mathrm{clip}\big(\rho_{i,t}(\theta), 1-\epsilon, 1+\epsilon\big)\, A_i \Big) \right] + \beta_{\mathrm{KL}}\, D_{\mathrm{KL}}(\pi_\theta \,\|\, \pi_{\mathrm{ref}}).$$
    <p>Here $R_i \equiv R(\tau_i, M_e, M_{e+1})$ is the scalar from Section&nbsp;8 (environment return plus shaping). Tokens with $m_{i,t}=0$ contribute to the KL / context loss path in TRL but not to the clipped policy-gradient term above.</p>
  </section>

  <!-- 10. MEMORY-EVOLUTION GALLERY -->
  <section id="memory-evolution">
    <h2>Memory-evolution gallery (illustrative)</h2>
    <p>We do not include step-by-step episode traces here: the training logs for the current run are not immediately accessible, and the priority for this submission is the <strong>mechanism</strong> (Figure&nbsp;0, Sections&nbsp;7&ndash;9) rather than cherry-picked rollouts. <strong>Compute for this project is exhausted</strong> before we could finish a converged memory ablation; the author is also concurrently submitting <strong>LotteryElicitationEnv</strong> and <strong>ReasoningEconomicsEnv</strong> to the same OpenEnv track, so GPU budget is shared across multiple codebases.</p>
    <p>The cards below are <strong>not</strong> verbatim snapshots from a finished training run. They are <em>category placeholders</em> for what we expect to extract once additional compute is available to run memory-structure experiments and to save real <code>memory/rank{R}_br{k}_*.md</code> files at checkpoints. Future work will test alternative memory organizations (Section&nbsp;11) under that budget.</p>

    <h3>Qualitative memory-file evolution</h3>
    <p>When a long run exists, we will snapshot <code>memory/rank0_br0_default.md</code> (or branch-stable peers) and categorize content. For now, each panel illustrates a <em>type</em> of content we expect to see at different training phases:</p>
    <div class="memory-gallery">
      <div class="memory-card">
        <div class="mem-header"><span>memory/rank0_br0_default.md</span><span class="mem-meta">step ~500 (noise)</span></div>
        <pre><code>ball is red
i saw a door
step 3 turn left
step 4 go forward</code></pre>
      </div>
      <div class="memory-card">
        <div class="mem-header"><span>memory/rank0_br0_default.md</span><span class="mem-meta">step ~5000 (action patterns)</span></div>
        <pre><code>- if the Mission says &quot;go to X&quot;, first face X
- turn left/right before go forward if object is
  to the side
- on GoToRedBall the ball is usually 1-3 steps away</code></pre>
      </div>
      <div class="memory-card">
        <div class="mem-header"><span>memory/rank0_br0_default.md</span><span class="mem-meta">step ~15000 (level-specific notes)</span></div>
        <pre><code>- UnlockLocal: keys are the same color as doors
- OpenDoor: &quot;toggle&quot; opens closed and locked doors
  (if carrying correct key)
- Synth: mission has multiple clauses -&gt; do them
  left-to-right as written</code></pre>
      </div>
      <div class="memory-card">
        <div class="mem-header"><span>memory/rank0_br0_default.md</span><span class="mem-meta">step ~25000 (failure notes)</span></div>
        <pre><code>- if action_success=False on go forward, there is a
  wall/door -&gt; rotate before next step
- pickup with no adjacent object always fails; read
  &quot;carrying: nothing&quot; before attempting</code></pre>
      </div>
    </div>
    <blockquote>Verbatim memory snapshots from a converged or long partial run will replace these placeholders when further compute is available. Until then, the gallery documents the <strong>hypothesis space</strong> for how $M$ should evolve, not empirical outcomes from the current submission.</blockquote>
  </section>

  <!-- 11. RESULTS -->
  <section id="results">
    <h2>Results: what we found</h2>

    <h3>Status (honest scope)</h3>
    <p>The <strong>MiniGridPT</strong> training package was exercised for <strong>correctness</strong> (short runs, parser parity, WebSocket stepping, memory file I/O, vLLM colocate and multi-GPU server mode with NCCL-safe padding). We do <strong>not</strong> report converged learning curves or final completion rates: the policy did not converge on the full curriculum under the available budget, and structured experiments on alternative memory formats are <strong>deferred to a follow-up compute cycle</strong>. <strong>Compute for this line of work is exhausted</strong> for the current submission window; the author is concurrently shipping <strong>LotteryElicitationEnv</strong> and <strong>ReasoningEconomicsEnv</strong> to the same OpenEnv track, so GPU time is shared across multiple submissions.</p>

    <h3>What we validated (engineering, not leaderboard numbers)</h3>
    <ul>
      <li><strong>Stable multi-turn rollouts</strong> against the live OpenEnv WebSocket from TRL's <code>rollout_func</code>, with <code>env_mask</code> partitioning LLM-authored vs. env-rendered tokens and per-episode logs persisting to the <code>--output_dir</code>.</li>
      <li><strong>Single-A100 colocate smoke runs</strong> on Qwen3-8B and Qwen2.5-1.5B-Instruct (hundreds of optimizer steps, not a converged curriculum); <code>MGPT_VLLM_GPU_UTIL</code> tuned to ~0.45&ndash;0.65 on 40&nbsp;GB (see Engineering lessons).</li>
      <li><strong>Cross-episodic memory I/O</strong> through <code>_temporary_vllm_max_tokens(trainer, 512)</code>; branch-stable filenames <code>rank{R}_br{k}_default.md</code> persisting across optimizer steps.</li>
      <li><strong>Multi-GPU server-mode training</strong> with fixed-count generate padding (<code>DIST_SERVER_GENERATES_PER_EPISODE</code>) eliminating NCCL desync under variable-length episodes (now bounded by our capped <code>max_steps</code> &le; 128 per level in <code>env/levels.py</code>).</li>
      <li><strong>Lambda runbook</strong>: <code>bootstrap_lambda.sh</code> &rarr; <code>preflight_lambda.sh</code> &rarr; <code>run_grpo_lambda.sh</code>, with <code>MGPT_*</code> env vars and cadence / metrics callbacks writing <code>metrics_scalars.csv</code>, <code>metrics_events.jsonl</code>, <code>cadence.log</code>, <code>diagnostics_cadence.jsonl</code>.</li>
      <li><strong>36 env-side tests and a PT action-parser parity test</strong> enforcing the NL &rarr; <code>Discrete(7)</code> contract across both packages.</li>
    </ul>

    <h3>Baseline harness (wired, not benchmarked)</h3>
    <p>The environment bundles three baselines (Random, BabyAI <code>BotAgent</code>, and a caller-provided zero-shot <code>completion_fn</code>), all runnable in-process without a GPU. They are <strong>not</strong> executed at scale in this submission; longer baseline sweeps and GRPO comparisons are explicitly scoped for the next compute allocation.</p>

    <h3>Memory design space: planned experiments</h3>
    <p>Because the current run did not converge and memory-structure ablations are outstanding, the table below is the <strong>forward-looking experiment matrix</strong> for how $M$ might be organized once additional GPU budget is available. Each row states a hypothesis; all rows require <strong>additional compute</strong>.</p>
    <div class="table-wrap">
      <table>
        <thead><tr><th>Variant</th><th>Hypothesis tested</th><th>Notes</th></tr></thead>
        <tbody>
          <tr><td><strong>Structured schema</strong> (JSON / YAML / fixed markdown sections)</td><td>Schema &gt; free-form markdown for stable curation</td><td>Requires additional compute</td></tr>
          <tr><td><strong>Append + periodic compaction</strong></td><td>Full-episode rewrite cost limits the learning signal</td><td>Requires additional compute</td></tr>
          <tr><td><strong>Hierarchical</strong> (in-episode scratchpad + cross-episode long-term)</td><td>Conflating short- and long-term in one file hurts</td><td>Requires additional compute</td></tr>
          <tr><td><strong>Retrieval-indexed</strong> (embed notes, top-<em>k</em> by observation)</td><td>Linear-file recall fails at scale</td><td>Requires additional compute</td></tr>
          <tr><td><strong>Shared single-file</strong> across branches / ranks</td><td>Collective memory beats per-branch curation</td><td>Shared-memory design TBD; requires additional compute + concurrency design</td></tr>
          <tr><td><strong>Success-gated writes</strong></td><td>Failure episodes poison $M$</td><td>Requires additional compute</td></tr>
          <tr><td><strong>Variable line budget</strong> by level difficulty</td><td>Uniform $L$ is too tight for hardest stages</td><td>Requires additional compute</td></tr>
          <tr><td><strong>Dual-memory</strong> (policy vs. world knowledge)</td><td>Unified $M$ conflates two knowledge types</td><td>Requires additional compute</td></tr>
          <tr><td><strong>Token budget</strong> instead of line budget</td><td>Line-count is the wrong self-budgeting unit for LLMs</td><td>Requires additional compute</td></tr>
        </tbody>
      </table>
    </div>

    <p>The research question in Section&nbsp;7 remains the scientific target; this submission's empirical contribution is the <strong>validated pipeline and semantics</strong> for $M$, not yet a table of win-rates.</p>
  </section>

  <!-- 12. ENGINEERING LESSONS -->
  <section id="engineering">
    <h2>Engineering lessons</h2>
    <p>Running GRPO + OpenEnv + vLLM on a multi-turn, memory-augmented environment surfaced three categories of structural issues. We document the ones that are <em>general</em>; the next OpenEnv submission is likely to hit each.</p>

    <h3 id="nccl">NCCL desync under variable-length episodes</h3>
    <p>In <code>vllm_mode=server</code>, each <code>trainer.vllm_generation.generate()</code> call performs <code>gather_object &rarr; all_gather_object &rarr; broadcast_object_list</code>. Our rollout runs <code>while not session.done</code>, so different DDP ranks make different numbers of <code>generate()</code> calls per episode: a short run (few turns) vs. the per-level <code>max_steps</code> cap (64 on early BabyAI stages, up to <strong>128</strong> after our registry cap for Synth and BossLevel). NCCL collectives are sequence-numbered: <em>different call counts per rank = permanent desync</em>.</p>
    <p><strong>Symptoms:</strong> training tqdm stuck, GPU 0&ndash;N pinned, vLLM GPU idle, NCCL watchdog firing after its timeout, <code>UnpicklingError</code> as ranks deserialize off-by-one collective buffers.</p>
    <p><strong>Fix:</strong> fixed-count padding: every rank performs <em>exactly</em> <code>DIST_SERVER_GENERATES_PER_EPISODE</code> generates per episode, where the count is <code>max_episode_turns + (1 if memory_enabled else 0)</code>. After the real loop terminates, <code>_pad_vllm_server_generates_to_target</code> issues dummy 1-token generates under <code>_temporary_vllm_max_tokens(trainer, 1)</code>, outputs discarded, guarded with <code>try/finally</code>. Active only when <code>vllm_mode == &quot;server&quot;</code> and <code>world_size &gt; 1</code>; reward, logprobs, and credit assignment are byte-identical to the unpadded case.</p>
    <blockquote>This pattern is general. <strong>Any TRL <code>rollout_func</code> user running variable-length rollouts in server mode has this bug latent.</strong> LotteryElicitationEnv/PT (sibling project) hit it first; the same fix ported cleanly here.</blockquote>

    <h3 id="vllm-util">vLLM colocate GPU-memory utilization is a total-VRAM fraction</h3>
    <p><code>MGPT_VLLM_GPU_UTIL</code> is passed to TRL &rarr; vLLM as <code>--vllm_gpu_memory_utilization</code>. vLLM interprets it as the fraction of <strong>total device VRAM</strong> the engine may reserve (weights + KV budget). <em>Not</em> &quot;fraction of what PyTorch left free.&quot;</p>
    <p>In colocate mode, the policy model loads first, then vLLM tries to grab its share <em>on the same GPU</em>. Too high &rarr; vLLM startup <code>ValueError</code> or later <code>torch.OutOfMemoryError</code> on logprob / <code>lm_head</code>. The shipped TRL default of 0.9 is too aggressive on 40&nbsp;GB A100 colocate. <strong>Safe range: 0.45&ndash;0.65.</strong></p>
    <p>Server mode needs &ge;2 GPUs (splits vLLM vs. training devices); <code>MGPT_VLLM_MODE=auto</code> picks <code>server</code> on &ge;2 GPUs else <code>colocate</code>.</p>

    <h3 id="hygiene">Hygiene table (four smaller issues, their fixes)</h3>
    <div class="table-wrap">
      <table>
        <thead><tr><th>Issue</th><th>Root cause</th><th>Fix</th></tr></thead>
        <tbody>
          <tr>
            <td><strong>Single-server curriculum blocked</strong></td>
            <td>Scaffold <code>reset()</code> did <code>del kwargs</code> before forwarding to the gym env, dropping the <code>level</code> kwarg</td>
            <td><code>level = kwargs.pop(&quot;level&quot;, None)</code> (and <code>level_name</code>) before clearing; now one Docker container serves every stage</td>
          </tr>
          <tr>
            <td><strong>Branch-stable memory races</strong></td>
            <td>When <code>per_device_train_batch_size &gt; num_generations</code>, multiple prompt groups in one step map to the same branch index <em>k</em> and race on the file</td>
            <td>Asserted at startup; a one-time <code>UserWarning</code> if the invariant is ever broken; recommended configuration <code>batch == num_generations</code></td>
          </tr>
          <tr>
            <td><strong>Action-parser drift</strong></td>
            <td>PT package ships its own <code>parse_action</code> (so it doesn't import the env); env-side changes can silently diverge</td>
            <td>Parity test <code>tests/test_action_parser_parity.py</code> in MiniGridPT cross-compares canonical actions + aliases against the env's parser</td>
          </tr>
          <tr>
            <td><strong>&quot;go forward&quot; fallback on unparseable text</strong></td>
            <td>Early-training LLMs emit malformed text; mapping to <code>done</code> kills episodes instantly (zero signal)</td>
            <td>Fallback = <code>go forward</code>, not <code>done</code>; every invalid parse increments <code>invalid_actions</code> so the parse-rate climb is a visible training-progress curve</td>
          </tr>
        </tbody>
      </table>
    </div>
  </section>

  <!-- 13. POSITIONING QUADRANT -->
  <section id="positioning">
    <h2>Where this submission sits</h2>
    <div class="mermaid-wrap">
      <pre class="mermaid">
quadrantChart
    title Grounded navigation: memory x OpenEnv/RL
    x-axis "Stateless" --> "Memory-augmented"
    y-axis "Gym only" --> "OpenEnv + RL stack"
    quadrant-1 "Our target"
    quadrant-2 "Untouched"
    quadrant-3 "Classical RL"
    quadrant-4 "Prompt-only"
    "BabyAI / MiniGrid": [0.10, 0.14]
    "Lottery (sibling env)": [0.22, 0.90]
    "GRPO, no memory": [0.42, 0.62]
    "Voyager": [0.90, 0.34]
    "Reflexion": [0.74, 0.24]
    "GenAgents": [0.88, 0.14]
    "MiniGridEnv + MiniGridPT": [0.90, 0.88]
      </pre>
      <p class="mermaid-caption">Figure 2. MiniGridEnv + MiniGridPT occupy the memory-augmented + OpenEnv + post-training quadrant that prior work leaves untouched. Voyager / Reflexion / Generative Agents are memory-rich but prompt-only; BabyAI itself is a gym env without an OpenEnv or RL-post-training story; sibling LotteryElicitationEnv is OpenEnv + RL but stateless.</p>
    </div>
  </section>

  <!-- 14. GIGPO UPGRADE -->
  <section id="gigpo">
    <h2>Next step: GiGPO</h2>
    <p>GRPO is good enough to <em>ship</em> this submission: critic-free, works out of the box in TRL, scalar per-episode advantage is fine for short-horizon BabyAI stages. But it underweights step-level credit assignment, which is exactly what hurts on 30+ turn episodes and what memory mode needs (memory episodes are ~2&times; longer).</p>
    <p><strong>GiGPO = GRPO + anchor-state step-level advantages.</strong> Episode-level macro advantage (same group-relative signal as GRPO over $G$ completions):</p>
    $$A^{E}_i \;=\; \frac{R_i - \mu_R}{\sigma_R}.$$
    <p>Step-level micro advantage within anchor-state group $S_k$ (all $(\tau,t')$ pairs whose observation text hashes match step $t$):</p>
    $$A^{S}(a_t) \;=\; \frac{Q(a_t) - \mu_{Q(S_k)}}{\sigma_{Q(S_k)}}\,,\quad S_k = \big\{ (\tau, t') : \mathrm{hash}(o_{t'}) = \mathrm{hash}(o_t) \big\}.$$
    <p>Combined per-token advantage with mixing weight $\omega \ge 0$:</p>
    $$A_t \;=\; A^{E}_i + \omega\, A^{S}(a_t).$$
    <p>When no anchors are found, $A^{S} = 0$ and GiGPO reduces to GRPO (equivalently $\omega = 0$).</p>
    <p>Why this fits MiniGrid: all <em>G</em> rollouts share the same initial observation for a given prompt/seed (guaranteed anchor); corridor navigation revisits the same 7&times;7 egocentric view; BabyAI per-seed determinism creates exact hash matches. The full step-level design is deferred to the GiGPO follow-up (trainer subclass + rollout fields).</p>

    <h3>Experimental matrix for the follow-up</h3>
    <div class="table-wrap">
      <table>
        <thead><tr><th>Config</th><th>Algorithm</th><th>Memory</th><th>Flags</th></tr></thead>
        <tbody>
          <tr><td>A</td><td>GRPO</td><td>Off</td><td><code>--loss_type dapo</code></td></tr>
          <tr><td>B</td><td>GRPO</td><td>On (branch-stable)</td><td><code>--loss_type dapo --memory --memory-branch-stable</code></td></tr>
          <tr><td>C</td><td>GiGPO</td><td>Off</td><td><code>--use_gigpo</code></td></tr>
          <tr><td>D</td><td>GiGPO</td><td>On (branch-stable)</td><td><code>--use_gigpo --memory --memory-branch-stable</code></td></tr>
        </tbody>
      </table>
    </div>
    <p><strong>Hypothesis:</strong> D dominates. Step-level anchor-state credit and cross-episodic strategy accumulation are complementary: GiGPO assigns credit <em>within</em> an episode; memory propagates credit <em>across</em> episodes.</p>
  </section>

  <!-- 15. FOUNDATIONS & CITATIONS -->
  <section id="foundations">
    <h2>Foundations &amp; citations</h2>
    <div class="table-wrap">
      <table>
        <thead><tr><th>Foundation</th><th>Role in this project</th><th>Citation</th></tr></thead>
        <tbody>
          <tr><td><strong>MiniGrid &amp; BabyAI</strong></td><td>Base gym environment, 10-stage curriculum, reference <code>BotAgent</code> upper bound, procedural level generation</td><td>Chevalier-Boisvert et al., <a href="https://arxiv.org/abs/1810.08272" target="_blank" style="color:var(--accent2)">arXiv:1810.08272</a> (ICLR 2019); <a href="https://github.com/Farama-Foundation/Minigrid" target="_blank" style="color:var(--accent2)">Farama-Foundation/Minigrid</a></td></tr>
          <tr><td><strong>GRPO / DeepSeekMath</strong></td><td>Critic-free group-relative policy optimization; our default trainer via TRL's <code>GRPOTrainer</code></td><td>Shao et al., <a href="https://arxiv.org/abs/2402.03300" target="_blank" style="color:var(--accent2)">arXiv:2402.03300</a></td></tr>
          <tr><td><strong>TRL &times; OpenEnv</strong></td><td><code>rollout_func</code> contract, vLLM colocate/server, <code>loss_type=dapo</code> length-bias handling</td><td><a href="https://huggingface.co/docs/trl/en/openenv" target="_blank" style="color:var(--accent2)">TRL OpenEnv docs</a></td></tr>
          <tr><td><strong>OpenEnv</strong></td><td>Standard WebSocket env contract, per-session state, <code>create_app</code>, HF Space deploy</td><td><a href="https://huggingface.co/blog/openenv" target="_blank" style="color:var(--accent2)">HF Blog: Introducing OpenEnv</a></td></tr>
          <tr><td><strong>Voyager</strong></td><td>Skill-library / cross-episode knowledge accumulation (closest memory-system analog; ours is RL-trained where Voyager is prompt-engineered)</td><td>Wang et al., <a href="https://arxiv.org/abs/2305.16291" target="_blank" style="color:var(--accent2)">arXiv:2305.16291</a></td></tr>
          <tr><td><strong>Reflexion</strong></td><td>Verbal reflection after episodes; motivates a post-episode LLM rewrite pass over a persistent buffer</td><td>Shinn et al., <a href="https://arxiv.org/abs/2303.11366" target="_blank" style="color:var(--accent2)">arXiv:2303.11366</a></td></tr>
          <tr><td><strong>Generative Agents</strong></td><td>Long-term memory stream with relevance / recency weighting; our line-budgeted rewrite is a deliberately simpler alternative</td><td>Park et al., <a href="https://arxiv.org/abs/2304.03442" target="_blank" style="color:var(--accent2)">arXiv:2304.03442</a></td></tr>
          <tr><td><strong>LotteryElicitationEnv / PT</strong></td><td>Sibling OpenEnv submission; shared structural template for two-repo split, <code>rollout_func</code>, NCCL generate-count padding</td><td>Same monorepo &middot; <a href="https://huggingface.co/spaces/yashu2000/LotteryElicitationEnv" target="_blank" style="color:var(--accent2)">LotteryElicitationEnv HF Space</a></td></tr>
          <tr><td><strong>ReasoningEconomicsEnv / PT</strong></td><td>Structural template for <code>_temporary_vllm_max_tokens</code> pattern</td><td>Same monorepo</td></tr>
        </tbody>
      </table>
    </div>
  </section>

  <!-- 16. QUICK START -->
  <section id="quickstart">
    <h2>Quick start</h2>
    <p>Single-A100 Lambda recipe (use MiniGridEnv Docker + MiniGridPT <code>scripts/</code> as the source of truth for env vars and launch order):</p>
    <pre><code><span class="c"># 0. Clone both packages (sibling directories)</span>
git clone https://github.com/sharma-yash01/MiniGridEnv.git
git clone https://github.com/sharma-yash01/MiniGridPT.git

<span class="c"># 1. Build + start MiniGridEnv (Docker on port 8000)</span>
cd MiniGridEnv
sudo docker build -t minigrid-env:latest -f server/Dockerfile .
sudo docker run -d --name minigrid-env -p 8000:8000 \
    -v "${HOME}/.cache/huggingface:/root/.cache/huggingface" \
    minigrid-env:latest
curl -sS "http://127.0.0.1:8000/health"

<span class="c"># 2. Configure MGPT_* (single A100, colocate vLLM)</span>
export ENV_BASE_URL="http://127.0.0.1:8000"
export MGPT_ROOT=$(pwd)/../MiniGridPT
export MGPT_VENV=$HOME/.venvs/minigridpt-lambda
export PYTORCH_WHEEL_INDEX=https://download.pytorch.org/whl/cu121
export MGPT_MODEL=Qwen/Qwen3-8B
export MGPT_LEVEL=GoToRedBall
export MGPT_VLLM_MODE=colocate
export MGPT_VLLM_GPU_UTIL=0.45           <span class="c"># colocate-safe on A100 40GB</span>

<span class="c"># 3. Bootstrap + preflight + train</span>
bash "$MGPT_ROOT/scripts/bootstrap_lambda.sh"
source "$MGPT_VENV/bin/activate"
bash "$MGPT_ROOT/scripts/preflight_lambda.sh"
cd "$MGPT_ROOT" && nohup bash scripts/run_grpo_lambda.sh > train.log 2>&1 &
tail -f train.log

<span class="c"># 4. Memory-mode variant (branch-stable, batch == num_generations)</span>
export MGPT_MEMORY=1
export MGPT_MEMORY_MAX_LINES=100
export MGPT_MEMORY_BRANCH_STABLE=1
export MGPT_NUM_GENERATIONS=8
export MGPT_BATCH_SIZE=8
bash "$MGPT_ROOT/scripts/run_grpo_lambda.sh"

<span class="c"># 5. Full curriculum (GoToRedBall -&gt; BossLevel)</span>
export ENV_URL="${ENV_BASE_URL}"
export MODEL="${MGPT_MODEL}"
export BASE_OUT="${MGPT_OUTPUT_DIR}/curriculum"
export USE_MEMORY=1
bash "$MGPT_ROOT/scripts/launch_curriculum.sh"</code></pre>
    <p>All 36 env-side tests pass with <code>cd MiniGridEnv &amp;&amp; uv run --with pytest pytest tests</code>. The OpenEnv contract is validated with <code>openenv validate</code>.</p>
  </section>

  <div class="callout">
    <div class="q">Compute budget exhausted for this submission</div>
    <div class="sub">The training package is validated for correctness; converged runs, baseline tables, and memory-structure ablations require <strong>more GPU time</strong>. The author is concurrently submitting <strong>LotteryElicitationEnv</strong> and <strong>ReasoningEconomicsEnv</strong> to the same OpenEnv track, so resources are shared across all three. The open scientific question remains in Section&nbsp;7; what ships now is the pipeline and the formal semantics for $M$.</div>
  </div>

  <!-- 17. FUTURE WORK -->
  <section id="future">
    <h2>Future work</h2>
    <ul>
      <li><strong>Run the full A/B/C/D experimental matrix</strong> to publish the memory-vs-stateless and GRPO-vs-GiGPO comparison across the BabyAI curriculum once additional compute is available (measured numbers to be filled in after those runs).</li>
      <li><strong>Land GiGPO</strong> as a <code>GiGPOTrainer(GRPOTrainer)</code> subclass. Minimum diff: add <code>obs_texts</code> / <code>step_boundaries</code> to the rollout return, compute anchor-state groups, expand step advantages to tokens.</li>
      <li><strong>Close the inference-time gap</strong>: <code>inference/run_episode.py</code> reads memory during play but does not yet mirror training's post-episode LLM memory rewrite. Evaluation should match training end-to-end; add a <strong>post-episode-memory-rewrite eval variant</strong> when more compute is available.</li>
      <li><strong>Baseline harness at scale</strong>: run Random, BabyAI <code>BotAgent</code>, and zero-shot LLM baselines with enough seeds to report completion rates and calibration vs. GRPO / GRPO+memory (deferred for lack of compute).</li>
      <li><strong>Port the NCCL generate-count padding upstream into TRL</strong>: the bug is general, the fix is simple.</li>
      <li><strong>Harder curricula</strong>: extend beyond BabyAI (MiniHack, TextWorld) with the same OpenEnv wrap + memory template.</li>
      <li><strong>Human transfer pilot</strong>: does a memory-trained agent generalize to unseen BabyAI seeds better than stateless, and how much of the memory is environment-specific versus transferable strategy?</li>
    </ul>
  </section>

  <!-- 18. CONCLUSION -->
  <section id="conclusion">
    <h2>Conclusion</h2>
    <p><strong>MiniGridEnv + MiniGridPT</strong> takes the gym-native MiniGrid/BabyAI curriculum and turns it into a complete OpenEnv + GRPO + memory pipeline. The environment is a faithful wrap: text observation, NL action, BabyAI's ten stages. The training package is the extension: branch-stable markdown memory, a post-episode LLM rewrite shaped by <code>_temporary_vllm_max_tokens</code>, and an env-mask-aware rollout loop that makes variable-length multi-turn episodes play nicely with vLLM server mode.</p>
    <p>The infrastructure contributions (NCCL generate-count padding for variable-length rollouts, branch-stable per-chain memory files, the <code>max_completion_length</code> context manager for mixed action/memory generation budgets, per-reset curriculum via <code>reset()</code> kwargs) are lessons the next OpenEnv + TRL 1.0 + multi-turn + memory submission will need.</p>
    <p>Empirical completion tables and memory ablations await the next compute cycle (Section&nbsp;7 for the open question; Section&nbsp;11 for the planned experiment matrix). What ships with this post is the <strong>validated pipeline and the formal semantics for $M$</strong>.</p>
  </section>

  <div class="footer">
    <p>MiniGridEnv &middot; AgentX OpenEnv Track &middot; UC Berkeley RDI</p>
    <p style="margin-top:.5rem;">
      <a href="https://github.com/sharma-yash01/MiniGridEnv" target="_blank" rel="noopener noreferrer">MiniGridEnv</a> &middot;
      <a href="https://github.com/sharma-yash01/MiniGridPT" target="_blank" rel="noopener noreferrer">MiniGridPT</a> &middot;
      <a href="https://huggingface.co/spaces/yashu2000/MiniGridEnv" target="_blank">MiniGridEnv HF Space</a> &middot;
      <a href="https://github.com/meta-pytorch/OpenEnv" target="_blank">OpenEnv Framework</a> &middot;
      <a href="https://huggingface.co/docs/trl/en/openenv" target="_blank">TRL x OpenEnv</a> &middot;
      <a href="https://github.com/Farama-Foundation/Minigrid" target="_blank">MiniGrid</a>
    </p>
  </div>

</div>
</body>
</html>