File size: 6,435 Bytes

483956d
 
 
 
 
 
 
 
 
520035d
483956d
b4102e7
483956d
 
 
520035d
483956d
 
 
520035d
483956d
 
 
 
 
520035d
 
 
483956d
0f72c92
520035d
483956d
520035d
 
0f72c92
520035d
 
483956d
 
520035d
b4102e7
eab8db3
 
 
 
4315b29
eab8db3
 
f82c4e8
eab8db3
 
 
 
ea34ded
eab8db3
 
bb271de
eab8db3
 
 
520035d
eab8db3
 
 
 
 
 
520035d
eab8db3
 
 
 
 
520035d
eab8db3
 
 
 
 
ea34ded
eab8db3
bb271de
 
520035d
bb271de
ea34ded
520035d
 
bb271de
 
ea34ded
eab8db3
520035d
ea34ded
bb271de
 
eab8db3
ea34ded
 
 
 
eab8db3
 
 
 
ea34ded
eab8db3
 
bb271de
ea34ded
eab8db3

---
library_name: pytorch
license: mit
pipeline_tag: other
tags:
  - arc-prize-2025
  - program-synthesis
  - tiny-recursive-models
  - recursive-reasoning
  - resume-training
  - act
  - reproducibility
datasets:
  - arc-prize-2025
model-index:
  - name: Tiny Recursive Models — ARC-AGI-2 (Resume Step 119432)
    results:
      - task:
          type: program-synthesis
          name: ARC Prize 2025 (legacy evaluation mapping)
        dataset:
          name: ARC Prize 2025 Public Evaluation
          type: arc-prize-2025
          split: evaluation
        metrics:
          - type: accuracy
            name: ARC Task Solve Rate (pass@1)
            value: 0.0083
          - type: accuracy
            name: ARC Task Solve Rate (pass@2)
            value: 0.0083
          - type: accuracy
            name: ARC Task Solve Rate (pass@10)
            value: 0.0083
          - type: accuracy
            name: ARC Task Solve Rate (pass@100)
            value: 0.0083
---

# Tiny Recursive Models — ARC-AGI-2 (8× H200 Resume, Step 119 432)

Overview
- 8× H200 resume snapshot at global step 119 432 from run `trm_arc2_8gpu_resume_step115815_plus100k_v2` (TinyRecursiveModels commit `e7b68717`).
- Consolidated `model.ckpt` with accompanying configuration and provenance files. Integrity hash: `2bc8bb3a5a85cd73e169a6fd285f9138427db894bd157edc20e92a58ed8ee33e`.
- Current evaluation on the ARC public evaluation split is approximately 0.83% pass@1, attributable to duplicate candidate generation in our evaluation pipeline after the resume. This release is intended for reproducibility and analysis rather than leaderboard submissions.

About “119 434” vs “119 432”
- Internal tracking referenced “step 119 434”; the persistent shard is `step_119432`. W&B logs confirm the resume guard initialized at step 115 815 and advanced into the 119k block. No distinct 119 434 shard remains.

Contents
- `model.ckpt` — Consolidated PyTorch checkpoint (weights, optimizer, EMA) reflecting `step_119432/*`.
- `COMMANDS.txt`, `COMMANDS_resumed.txt` — Exact `torchrun` invocations (8× H200) and resume parameters.
- `ENVIRONMENT.txt`, `all_config.yaml` — Hydra-resolved configurations captured on the training pod.
- `MANIFEST.txt` — Packaging metadata (step, source path, timestamp, sha256).
- `TRM_COMMIT.txt` — Upstream TinyRecursiveModels commit (`e7b68717f0a6c4cbb4ce6fbef787b14f42083bd9`).
- `dataset-metadata.json` — Kaggle packaging manifest (legacy identifier mapping). W&B CSV/summary are included for convenience.

Kaggle Status
- This checkpoint is not accompanied by a validated Kaggle submission from our pipeline. In our evaluation path, inference produced duplicate candidates across attempts, yielding ≈0.83% pass@1 on the public evaluation split.
- A third‑party Kaggle implementation demonstrating TRM inference is available on Kaggle; interested users can locate it by searching the platform. That implementation is independent of this release.

Current Kaggle Limitations
- Unique‑candidate enforcement: A production overlay inadvertently bypassed the uniqueness filter, allowing duplicates even with `ARC_SAMPLING_COUNT > 1`.
- Sampler configuration propagation: Variables such as `ARC_SAMPLING_COUNT`, `ARC_SAMPLING_MODE=sample`, and temperature did not reliably reach the evaluator under our overlays.
- Sampling degeneracy: Under misconfiguration, multinomial sampling collapsed to identical outputs across attempts.
- Limited attempt‑level telemetry: Evaluators emitted aggregate metrics only; per‑attempt logits and strings were not retained due to GPU runtime constraints and Kaggle logging limits. This prevents visualization of candidate selection and diversity.
- Identifier mapping: The checkpoint was trained with the legacy identifier mapping, whereas some evaluators assume the sorted mapping. Without remapping or a compatibility layer, comparisons can be brittle.

Attempted Mitigations
- Relaunched controlled evaluator pods with explicit sampling parameters and verified resume‑guard logs; duplicates persisted.
- Instrumented CPU/GPU debug evaluators (`scripts/debug_eval_cpu.py --samples 8 --log-attempts`); candidate diversity remained near zero.
- Adjusted temperature and top‑k settings; no material improvement under the broken overlay path.
- Prepared Kaggle datasets/notebooks and executed end‑to‑end; duplicate attempts persisted and no leaderboard submission was made.

Evaluation Guidance
- Enforce a strict unique‑candidate guard prior to scoring.
- Validate that sampling environment variables propagate into the inference process.
- Capture per‑attempt outputs and, where possible, logits to diagnose diversity.
- Match the identifier mapping to the checkpoint (legacy vs sorted) and use the corresponding dataset builder.

Reproduction
```python
from huggingface_hub import hf_hub_download
import torch

ckpt_path = hf_hub_download("seconds-0/trm-arc2-8gpu", "model.ckpt")
state = torch.load(ckpt_path, map_location="cpu")
print(state["hyperparameters"]["arch"]["hidden_size"])  # 512
```

```
# CoreWeave resume (reference)
kubectl apply -f infra/kubernetes/trm-train-8gpu-resume.yaml
# Requires configmaps: trm-common-script, trm-pyshim-cm, trm-eval-overlay-cm
```

Resume Guard Signals
- `RESUME_CHECKPOINT_PATH=/workspace/TinyRecursiveModels/checkpoints/Arc2concept-aug-1000-ACT-torch/trm_arc2_8gpu_resume_plus100k/step_115815`
- `RESUME_EXPECTED_STEP=115815`
- Pod logs contain: `[resume] initializing train_state.step to 115815` before training proceeds.

Known Issues and Next Steps
- Duplicate candidate generation: restore uniqueness enforcement, validate environment propagation, and re‑verify multinomial sampling.
- Identifier mapping mismatch: this release remains “legacy”; sorted‑mapping evaluations require remapping or fine‑tuning.
- Observability: add candidate‑level telemetry and visualizations to support reliable evaluation claims.

Ethics, License, and Intended Use
- MIT license. Intended for research and educational use. Users should independently validate evaluation protocols and candidate diversity prior to reporting results.

Acknowledgements
- Built on the Tiny Recursive Models codebase and evaluated against ARC Prize 2025 materials. A community‑maintained Kaggle implementation demonstrates a functioning TRM inference pipeline and can be located via search on Kaggle.