h4-polytopic-attention / SESSION_HANDOFF.md
grapheneaffiliates's picture
Upload SESSION_HANDOFF.md with huggingface_hub
13bf210 verified

Session Handoff β€” ARC-AGI Self-Compiling Intelligence

Date: 2026-03-28 (sessions 1+2+3, session 3 COMPLETE) Status: AGI-1 100%, AGI-2 38.3% eval, AGI-3 12.6% (unified)


Session 3 Final Results

Track Before Session 3 After Session 3 Change
ARC-AGI-1 400/400 (100%) 400/400 (100%) β€”
ARC-AGI-2 24/120 (20%) eval 46/120 (38.3%) eval +22 solutions
ARC-AGI-3 22/182 (12.1%) 23/182 (12.6%) unified 150K +1 level, new agent

Key Achievements

  • AGI-2 doubled from 20% to 38.3% with 10 parallel retry agents
  • 3 new AGI-3 explorer versions built (v2 lazy rebuilds, v3 grid-click, unified)
  • Grid-click breakthrough unlocked cd82 and lf52 (previously stuck at 0 states)
  • Fine-tune data ready: 519 entries in data/arc_finetune_all.jsonl
  • 544 total unique solutions across all tracks

What To Do Next (Priority Order)

1. Push AGI-3 Higher (13%β†’15%+)

  • Run unified agent at 200K+ actions β€” more budget = more levels
  • Optimize per-action speed (currently ~150ms, 3rd place ~31ms)
  • Try grid-click on all 14 zero-level games, not just the stuck 3
  • Focus on sk48 (316 states at 20K with grid-click, depth=125 β€” nearly cracking)

2. Push AGI-2 Higher (38%β†’50%+)

  • 74 unsolved eval tasks remain (was 100, solved 26 unique)
  • Retry solution files: data/arc2_solutions_retry{4-13}.json (28 raw, 26 unique after dedup)
  • Re-run on remaining 74 with different prompting (deeper analysis, print grids, 10+ hypotheses)
  • Hardest tasks identified by agents: spirals, path tracing, multi-step folding, nested frames

3. Fine-Tune 3B Model for Kaggle

  • Data ready: data/arc_finetune_all.jsonl (519 entries, 221 with Python code)
  • Spin up RunPod pod with GPU (~$2-5)
  • Train Qwen-3B or SmolLM-3B on solve functions
  • Submit to AGI-1 and AGI-2 Kaggle competitions

4. Competition Submissions

  • AGI-1: Fine-tuned model β†’ Kaggle (runs offline)
  • AGI-2: Fine-tuned model β†’ Kaggle (runs offline)
  • AGI-3: Unified explorer β†’ API submission

Final Scores

Track Score Details
ARC-AGI-1 400/400 (100%) Perfect score, all verified
ARC-AGI-2 46/120 (38.3%) eval 20 original + 26 unique retries (10 parallel agents)
ARC-AGI-3 23/182 (12.6%) Unified@150K, grid-click unlocks cd82+lf52

Session 3: AGI-3 Explorer Improvements (THIS SESSION)

Three Explorer Versions

Version File Key Innovation Best For
v1 olympus/arc3/explorer.py Priority-group BFS, proven on lp85/vc33 Games already solving levels
v2 olympus/arc3/explorer_v2.py Lazy rebuilds, winning path replay, depth-biased frontier Speed + multi-level games
v3 olympus/arc3/explorer_v3_gridclick.py Click every grid cell, not just segment centroids Games with <20 states (stuck games)

Grid-Click Breakthrough (v3)

Games that had 0 levels with 3-9 states explored were STUCK because segmentation missed interactive elements. Grid-click tries every cell:

  • lf52: 0 β†’ 1/10 levels (grid_step=2, 1024 click points, 20K actions)
  • cd82: 0 β†’ 1/6 levels (grid_step=4, 256 click points, only 5K actions!)
  • sk48: 0 β†’ 316 states explored (was 5), depth=125. Needs more budget to crack.

v2 100K Benchmark Results (COMPLETE)

TOTAL: 23/182 levels (12.6%) β€” +1 over v1's 22 levels at 91K

ar25: 2/8   bp35: 1/9   cd82: 0/6   cn04: 1/5*  dc22: 0/6
ft09: 2/6*  g50t: 0/7   ka59: 0/7   lf52: 0/10  lp85: 5/8
ls20: 1/7   m0r0: 2/6   r11l: 0/6   re86: 0/8   s5i5: 1/8
sb26: 0/8   sc25: 0/6   sk48: 0/8   sp80: 2/6   su15: 1/9*
tn36: 1/7   tr87: 0/6   tu93: 1/9*  vc33: 3/7   wa30: 0/9
(* = newly solved vs v1)

Unified Agent (v2+v3 merged) β€” 150K COMPLETE

olympus/arc3/explorer_unified.py β€” auto-switches to grid-click when stuck. Score: 23/182 (12.6%) at 150K actions, 5588 seconds total.

ar25:2/8  bp35:1/9  cd82:1/6*  cn04:0/5  dc22:0/6
ft09:0/6  g50t:0/7  ka59:0/7  lf52:1/10*  lp85:5/8
ls20:1/7  m0r0:2/6  r11l:0/6  re86:0/8  s5i5:1/8
sb26:0/8  sc25:0/6  sk48:0/8  sp80:2/6  su15:1/9
tn36:1/7  tr87:1/6  tu93:1/9  vc33:3/7  wa30:0/9
(* = solved by grid-click auto-switch)

Game randomization means individual game scores vary +-2 levels per run. Grid-click auto-switch successfully unlocked cd82 and lf52.

Running total: 15 levels from 15 games. Remaining 10 games typically add 7-10 more.

Key v2@100K wins vs v1@91K:

  • cn04: 0 β†’ 1/5 (was stuck, now solving)
  • ft09: back to 2/6 (was 0 at 50K, needed more budget)
  • lp85: 5/8 (matching v1's best)
  • ls20: 1/7 (back to matching v1)

v2 50K Full Benchmark (completed)

TOTAL: 17/182 levels (9.3%) at 50K actions
ar25:2/8 bp35:1/9 cn04:0/5 lp85:4/8 m0r0:2/6 sp80:2/6
s5i5:1/8 su15:1/9 tr87:1/6 vc33:3/7

4 Critical Bugs Found & Fixed in v2

  1. Winning path replay was dead code β€” winning_paths populated but never read. Fixed: efficient replay of state-changing-only actions.
  2. Death actions never recorded β€” GAME_OVER transition lost. Fixed: informational tracking (not blocking β€” hard step limits make death-blocking harmful).
  3. BFS + short episodes = shallow-only β€” Games like re86 (100-action episodes, 5 actions) can't find solutions at depth >60. Depth-biased frontier selection helps somewhat.
  4. Winning path too long β€” Saved full exploration history, not minimum path. Fixed: only save state-changing actions (effective_history).

Diagnostic Results (14 zero-level games at 10K actions)

Category 1 β€” Large state space, productive exploration:

re86: 5807 states, 99% change rate, arrows only, 100-action episodes
cn04: 5614 states, 67% change rate, clicks+arrows, 150-action episodes
sb26: 2191 states, 82% change rate, click+space+undo
wa30: 1444 states, 62% change rate, arrows only
ka59: 920 states, 23% change rate, clicks+arrows

Category 2 β€” Moderate state space:

sc25: 154 states, 62% change rate
g50t: 118 states, 53% change rate
r11l: 60 states, 100% change rate, click-only
tu93: 50 states, 44% change rate
dc22: 48 states, 2.3% change rate

Category 3 β€” Tiny state space (STUCK, needs grid-click):

cd82: 37 states β†’ 1/6 with grid-click βœ“
sk48: 5 states β†’ 316 states with grid-click, needs more budget
lf52: 3 states β†’ 1/10 with grid-click βœ“

Performance Optimization: Lazy Rebuilds

_rebuild_distances() was O(V+E) called on every node add/close. With 20K+ nodes (re86), this dominated runtime. Fixed with lazy rebuilds β€” only rebuild when choose_action needs routing to frontier. Result: ~30% speedup on large-state games.


NEXT SESSION: Build Unified Agent

The Plan

Merge all three innovations into one agent:

  1. v1 priority-group exploration β€” proven on lp85 (5/8), vc33 (4/7)
  2. v2 lazy rebuilds + winning path replay β€” speed + multi-level efficiency
  3. v3 grid-click fallback β€” auto-detect when <50 states after 5K actions, switch to grid-click

Implementation Steps

  1. Copy explorer_v2.py as base
  2. Add grid-click fallback: if explorer.num_states < 50 after 5000 actions, switch to GridClickAgent
  3. Run unified agent at 150K actions on all 25 games
  4. Target: 25+ levels (13.7%+), beating 3rd place's 12.58%

Quick Wins Still Available

  • sk48: Grid-click gets 316 states at 20K. Run at 100K β€” likely solves level 1+
  • cn04: Solved level 1 at 100K. More budget β†’ more levels
  • cd82: Solved level 1 quickly. More budget for levels 2+
  • lf52: Solved level 1. More budget for levels 2+

ARC-AGI-1: 400/400 (100%)

Solution Files

data/arc_python_solutions_b{0-34}.json     # Main batches (362 solutions)
data/arc_python_solutions_retry_{a,b,c}.json  # Retry waves (12)
data/arc_python_solutions_final6.json       # Final push (4)
data/arc_python_solutions_recovery.json     # Recovered lost solutions (38)
data/arc_python_solutions_last4.json        # Last 4 (3)
data/arc_python_solutions.json              # Original batch (10)
solve_234bbc79.py                           # Standalone: cyclic crossing shifts mod 3
solve_3631a71a.py                           # Standalone: transpose symmetry chain

Verification

py -c "
import json, glob, os
solved = set()
for f in glob.glob('data/arc_python_solutions*.json'):
    with open(f) as fh: solved.update(json.load(fh).keys())
solved.add('234bbc79'); solved.add('3631a71a')
arc1 = set(f.replace('.json','') for f in os.listdir('data/arc1/') if f.endswith('.json'))
print(f'{len(solved & arc1)}/{len(arc1)}')
"
# Expected output: 400/400

ARC-AGI-2: 24/120 eval (20%)

Solution Files

data/arc2_solutions_eval{0-3}.json          # First pass: 18 solutions
data/arc2_solutions_retry{0-3}.json         # Retries: +6 solutions
data/arc2_solutions_train_{aa-af}.json      # New training: 112 solutions

ARC-AGI-3: Agent Code

Explorer Files

olympus/arc3/explorer.py        # v1: Priority-group BFS (ORIGINAL)
olympus/arc3/explorer_v2.py     # v2: Lazy rebuilds, replay, depth-bias
olympus/arc3/explorer_v3_gridclick.py  # v3: Grid-cell click for stuck games
olympus/arc3/__init__.py

Diagnostic/Test Files (session 3)

diagnose_zero_games.py          # Diagnostic: 14 zero-level games at 10K
diagnose_zero_results.json      # Diagnostic results (JSON)
run_top3_verbose.py             # Verbose runner for top 3 games
test_v2_quick.py                # Quick v2 test script
benchmark_v2.py                 # Full benchmark runner

How to Run

# Activate venv
source .venv-arc3/Scripts/activate    # Windows

# v1 (original) β€” all games
ARC_API_KEY="58b421be-5980-4ee8-8e57-0f18dc9369f3" py olympus/arc3/explorer.py

# v2 (improved) β€” all games at 100K
PYTHONIOENCODING=utf-8 ARC_API_KEY="58b421be-5980-4ee8-8e57-0f18dc9369f3" py benchmark_v2.py

# v3 grid-click β€” stuck games
PYTHONPATH=. PYTHONIOENCODING=utf-8 ARC_API_KEY="58b421be-5980-4ee8-8e57-0f18dc9369f3" py olympus/arc3/explorer_v3_gridclick.py

# Single game (any version)
ARC_API_KEY="58b421be-5980-4ee8-8e57-0f18dc9369f3" py olympus/arc3/explorer_v2.py GAME_ID MAX_ACTIONS

25 Game IDs

ar25-e3c63847  bp35-0a0ad940  cd82-fb555c5d  cn04-65d47d14  dc22-4c9bff3e
ft09-0d8bbf25  g50t-5849a774  ka59-9f096b4a  lf52-271a04aa  lp85-305b61c3
ls20-9607627b  m0r0-dadda488  r11l-aa269680  re86-4e57566e  s5i5-a48e4b1d
sb26-7fbdac44  sc25-f9b21a2f  sk48-41055498  sp80-0ee2d095  su15-4c352900
tn36-ab4f63cc  tr87-cd924810  tu93-2b534c15  vc33-9851e02b  wa30-ee6fef47

Per-Game Episode Budgets (actions before GAME_OVER)

bp35:~32  su15:~23  tu93:~50  r11l:~60  sc25:~79  sp80:~59
tn36:~61  lf52:~64  sk48:no_limit  cd82:~100  re86:~100
ka59:~100  dc22:~127  g50t:~130  ls20:~130  ft09:~145
cn04:~150  s5i5:~150  m0r0:~151  wa30:~200  ar25:~236
sb26:~265  lp85:~344

Reference Repos (cloned locally)

ARC-AGI-3-Agents/              # Official agent template
arc-agi-3-just-explore/        # 3rd place solution (our reference)
ARC3-solution/                 # DriesSmit CNN approach

API Keys & Credentials

ARC-AGI-3

  • API Key: 58b421be-5980-4ee8-8e57-0f18dc9369f3
  • SDK: arc-agi + arcengine in .venv-arc3/ (Python 3.12)
  • 25 games available, 182 total levels

GitHub

  • Repo: grapheneaffiliate/h4-polytopic-attention (note: no 's')
  • Branch: main

HuggingFace

  • Account: grapheneaffiliates (with 's')
  • Repo: grapheneaffiliates/h4-polytopic-attention

Environment Setup

Python Environments

System Python 3.12: py (Windows launcher)
ARC-AGI-3 venv: .venv-arc3/ (Python 3.12, arc-agi + arcengine + numpy)

Key Paths (Windows)

Project root:    C:\Users\atchi\h4-polytopic-attention
Agent v1:        olympus/arc3/explorer.py
Agent v2:        olympus/arc3/explorer_v2.py
Agent v3:        olympus/arc3/explorer_v3_gridclick.py
Venv:            .venv-arc3/Scripts/activate

Critical Notes

  • DO NOT read game source code for ARC-AGI-3 (environment_files/ = answer key)
  • ACTION6 requires explicit data: env.step(action, data=action.action_data.model_dump())
  • PYTHONIOENCODING=utf-8 required on Windows for unicode output
  • PYTHONPATH=. required when running v3 grid-click from project root
  • Games are RANDOMIZED per scorecard β€” same game can give different results across runs
  • GitHub username: grapheneaffiliate (no s)
  • HuggingFace username: grapheneaffiliates (with s)