seconds-0 commited on
Commit
ea34ded
·
verified ·
1 Parent(s): b011c40

Update model card: candid Kaggle status and troubleshooting narrative

Browse files
Files changed (1) hide show
  1. README.md +55 -42
README.md CHANGED
@@ -39,63 +39,76 @@ model-index:
39
 
40
  # Tiny Recursive Models — ARC-AGI-2 (8× H200 Resume, Step 119 432)
41
 
42
- **What’s new (Nov 2025).** This refresh publishes the best-performing checkpoint from the CoreWeave resume campaign—`trm_arc2_8gpu_resume_step115815_plus100k_v2` at global step **119 432**. The job resumed from TinyRecursiveModels commit `e7b68717` with the full resume guard stack (`trm-common-script` + `trm-pyshim`) and legacy ARC identifier mapping. This is the same checkpoint we attempted to ship to Kaggle; the submission stalled at 0.83 % pass@1 because every task duplicated attempts, so we are documenting the shortfall here instead of claiming leaderboard progress.
 
 
 
43
 
44
- **Why the name mentions 119 434.** Internal tracking labelled this snapshot step 119 434”, but the persisted shard on the CoreWeave PVC is `step_119432`. The W&B records for the run confirm that resume guard initialized at the expected `115 815` step and advanced to the 119k block; no 119 434 shard survived the routine pruning. When downstream tooling expects the 119 434 identifier, point it at this artifact and note the two-step discrepancy.
 
45
 
46
- ## Checkpoint Snapshot
47
- - **Run name**: `trm_arc2_8gpu_resume_step115815_plus100k_v2`
48
- - **Global step**: 119 432 (3 617 optimizer updates after the 115 815 resume point)
49
- - **Architecture**: Tiny Recursive Model ACT V1 (`L_layers=2`, `H_cycles=3`, `L_cycles=4`, hidden size 512, 8 heads, RoPE, bfloat16 activations)
50
- - **Optimizer**: Adam-atan2 (`beta1=0.9`, `beta2=0.95`, `weight_decay=0.1`, EMA 0.999, global batch size 768)
51
- - **Dataset builder**: Legacy identifier order (`dataset/build_arc_dataset_legacy.py`) targeting `arc2concept-aug-1000`
52
- - **Resume provenance**:
53
- - `RESUME_CHECKPOINT_PATH` → `/workspace/TinyRecursiveModels/checkpoints/Arc2concept-aug-1000-ACT-torch/trm_arc2_8gpu_resume_plus100k/step_115815`
54
- - `RESUME_EXPECTED_STEP` → `115815`
55
- - `[resume] initializing train_state.step to 115815` appears in pod logs before training continues
56
- - **PVC retention**: Latest PVC shards now extend to `step_662428`; earlier 119k shards were pruned after packaging this export.
57
 
58
- ## Files Included
59
- | Path | Description |
60
- | --- | --- |
61
- | `model.ckpt` | Consolidated PyTorch checkpoint (optimizer, EMA, and weights) containing `step_119432/*` tensors. SHA-256: `2bc8bb3a5a85cd73e169a6fd285f9138427db894bd157edc20e92a58ed8ee33e`. |
62
- | `COMMANDS.txt` / `COMMANDS_resumed.txt` | Torch distributed launch (8 × H200) showing the resume flags and dataset path. |
63
- | `ENVIRONMENT.txt` | Hydra-resolved configuration captured on CoreWeave after overlays. |
64
- | `MANIFEST.txt` | Packaging metadata (checkpoint step, source path, timestamp, sha256). |
65
- | `TRM_COMMIT.txt` | Upstream TinyRecursiveModels Git SHA (`e7b68717f0a6c4cbb4ce6fbef787b14f42083bd9`). |
66
- | `all_config.yaml` | Structured config snapshot exported alongside the checkpoint. |
67
- | `dataset-metadata.json` | Kaggle dataset manifest (kept for parity with previous releases). |
68
 
69
- ## Evaluation Status
70
- - **Validation (CoreWeave pod evaluator, legacy mapping)**: `pass@1 = 0.83 %`, identical scores for pass@2/5/10/100 because samples were duplicates. Mean token accuracy ≈ 70.1 %, `train/lm_loss` ≈ 0.134 at resume, `all/lm_loss` ≈ 1.56.
71
- - **Kaggle inference notebook (test split)**: Also produced 259/259 duplicate attempts, yielding 0.83 % pass@1 and no leaderboard improvement. The issue remains unresolved; do not submit this checkpoint to Kaggle until the sampler divergence is fixed.
72
- - **Copy-mode diagnostics** (`scripts/debug_eval_cpu.py` in legacy mode): 0/120 grid matches (consistent with earlier baselines).
 
 
73
 
74
- The metrics bundled here are sufficient to reproduce our internal dashboards without requiring live W&B access. If you have Weights & Biases credentials, the run is listed under `trm_arc2_8gpu_resume_step115815_plus100k_v2` in project `trm-arc2`; the first logged step after resume exceeds 115 815, confirming the guard executed.
 
 
 
 
75
 
76
- ## Inference & Reproduction
 
 
 
 
 
 
 
 
77
  ```python
78
  from huggingface_hub import hf_hub_download
79
  import torch
80
 
81
- ckpt_path = hf_hub_download("seconds0/trm-arc2-8gpu", "model.ckpt")
82
  state = torch.load(ckpt_path, map_location="cpu")
83
  print(state["hyperparameters"]["arch"]["hidden_size"]) # 512
84
  ```
85
 
86
- To recreate the CoreWeave launch:
87
- ```bash
88
  kubectl apply -f infra/kubernetes/trm-train-8gpu-resume.yaml
89
- # Ensure ConfigMaps trm-common-script, trm-pyshim-cm, and trm-eval-overlay are applied first.
90
  ```
91
- Before submitting jobs, verify:
92
- 1. `RESUME_CHECKPOINT_PATH` points to the 115 815 shard.
93
- 2. `[resume] initializing train_state.step to 115815` appears once training boots.
94
- 3. The first W&B point is ≥115 815 with `train/lm_loss` ≈ 0.13.
95
 
96
- ## Known Gaps & Next Steps
97
- 1. **Sampler instability** – Deduplicate sampler outputs before retrying Kaggle submissions.
98
- 2. **Identifier remapping** – Remains legacy-only; switching to sorted identifiers requires remapping or finetuning.
99
- 3. **W&B rehydration** Set `WANDB_API_KEY` locally if you need fresh metrics; the release ships cached configs only.
 
 
 
 
 
 
 
 
100
 
101
- Please cite the Tiny Recursive Models paper and ARC Prize 2025 when using this checkpoint. Contributions, bug reports, and sampler fixes are welcome via the repository issues.
 
 
39
 
40
  # Tiny Recursive Models — ARC-AGI-2 (8× H200 Resume, Step 119 432)
41
 
42
+ Summary
43
+ - This is an 8× H200 resume snapshot at global step 119 432 from run `trm_arc2_8gpu_resume_step115815_plus100k_v2` (TinyRecursiveModels commit `e7b68717`).
44
+ - It packages one consolidated `model.ckpt` plus the exact configs used on CoreWeave. Integrity hash: `2bc8bb3a5a85cd73e169a6fd285f9138427db894bd157edc20e92a58ed8ee33e`.
45
+ - Evaluation quality is currently poor (ARC public evaluation pass@1 ≈ 0.83%) due to duplicate candidate generation after the resume. Until we address sampler issues, this artifact is provided for transparency and reproduction, not for leaderboard use.
46
 
47
+ Why the “119 434 label sometimes appears
48
+ - Internal tracking briefly referred to a “step 119 434” snapshot. The durable shard that survived PVC pruning is `step_119432`. W&B logs confirm the resume guard initialized at step 115 815 and advanced into the 119k block; no distinct 119 434 shard remains.
49
 
50
+ What’s included
51
+ - `model.ckpt` Consolidated PyTorch checkpoint (weights, optimizer, EMA) containing tensors from `step_119432/*`.
52
+ - `COMMANDS.txt`, `COMMANDS_resumed.txt` Exact `torchrun` invocations (8× H200) and resume flags.
53
+ - `ENVIRONMENT.txt`, `all_config.yaml` Hydra-resolved configs captured on the pod.
54
+ - `MANIFEST.txt` Packaging metadata (step, source path, timestamp, sha256).
55
+ - `TRM_COMMIT.txt` Upstream TinyRecursiveModels SHA (`e7b68717f0a6c4cbb4ce6fbef787b14f42083bd9`).
56
+ - `dataset-metadata.json` — Kaggle manifest retained for parity (legacy identifier mapping), plus cached W&B CSV/summary.
 
 
 
 
57
 
58
+ Kaggle status (candid)
59
+ - We did not achieve a working Kaggle submission from this checkpoint. Despite multiple attempts, both internal evaluator pods and Kaggle notebooks produced duplicate attempts for every ARC task, yielding ≈0.83% pass@1 with no diversity across samples.
60
+ - A separate, community-maintained Kaggle implementation exists and may be discoverable on Kaggle by searching for TRM ARC-AGI-2 inference. That work is independent; this release does not mirror their kernel or guarantees.
61
+ - Please treat this model card as a transparent record of what did and did not work in our pipeline rather than a polished Kaggle solution.
 
 
 
 
 
 
62
 
63
+ Why our Kaggle pipeline is broken right now
64
+ - Unique-candidate guard disabled: A production overlay path bypassed the uniqueness filter, allowing identical candidates to pass through even when `ARC_SAMPLING_COUNT>1`.
65
+ - Sampler settings not propagating: Environment variables like `ARC_SAMPLING_COUNT`, `ARC_SAMPLING_MODE=sample`, and temperature sometimes failed to reach the evaluator process under our overlays.
66
+ - Multinomial fallback may be degenerate: With the above misconfigurations, sampling often collapsed to identical outputs across K attempts.
67
+ - Limited attempt-level telemetry: Our evaluator builds aggregated only summary stats; we lacked per-attempt logits/strings on GPU, and Kaggle kernels restricted richer logging. As a result we could not visualize candidate selection or diagnose collapse directly inside the kernels.
68
+ - Identifier alignment: This checkpoint uses the legacy ARC identifier mapping (training-time choice). Kaggle evaluation notebooks increasingly assume the sorted mapping. Without remapping or dedicated compatibility code, comparisons are brittle.
69
 
70
+ What we tried (and why it wasn’t enough)
71
+ - Relaunched controlled eval pods with explicit sampling vars (`ARC_SAMPLING_COUNT>1`, `ARC_SAMPLING_MODE=sample`) and verified resume guard messages in logs — duplicates persisted.
72
+ - Instrumented a CPU debugger (`scripts/debug_eval_cpu.py --samples 8 --log-attempts`) and ran on GPU-backed pods where possible — candidate diversity remained near zero.
73
+ - Raised temperatures and experimented with multinomial/top-k settings — diversity did not improve under the broken overlay path.
74
+ - Prepared a Kaggle dataset and notebooks for this checkpoint — kernels executed end-to-end but emitted duplicate attempts and stalled at ≈0.83% pass@1.
75
 
76
+ Guidance if you still want to evaluate
77
+ - This artifact is best used to replicate our internal numbers or to help debug sampler behavior — not for immediate leaderboard submissions.
78
+ - If you build your own evaluator:
79
+ - Re-enable a strict unique-candidate guard before scoring.
80
+ - Ensure sampling environment variables propagate into the inference process.
81
+ - Capture per-attempt outputs and logits for diversity diagnostics.
82
+ - Be explicit about identifier mapping (legacy vs sorted) and ensure your dataset builder matches the checkpoint’s training-time choice.
83
+
84
+ Reproduction snippets
85
  ```python
86
  from huggingface_hub import hf_hub_download
87
  import torch
88
 
89
+ ckpt_path = hf_hub_download("seconds-0/trm-arc2-8gpu", "model.ckpt")
90
  state = torch.load(ckpt_path, map_location="cpu")
91
  print(state["hyperparameters"]["arch"]["hidden_size"]) # 512
92
  ```
93
 
94
+ ```
95
+ # CoreWeave resume (reference only)
96
  kubectl apply -f infra/kubernetes/trm-train-8gpu-resume.yaml
97
+ # Requires configmaps: trm-common-script, trm-pyshim-cm, trm-eval-overlay-cm
98
  ```
 
 
 
 
99
 
100
+ Resume guard signals to expect
101
+ - `RESUME_CHECKPOINT_PATH=/workspace/TinyRecursiveModels/checkpoints/Arc2concept-aug-1000-ACT-torch/trm_arc2_8gpu_resume_plus100k/step_115815`
102
+ - `RESUME_EXPECTED_STEP=115815`
103
+ - Pod logs contain: `[resume] initializing train_state.step to 115815` before training proceeds.
104
+
105
+ Known issues and next steps
106
+ - Sampler instability and duplicate attempts — must restore uniqueness guards, verify env propagation, and re-check multinomial sampling.
107
+ - Identifier mapping mismatch — this release remains “legacy”; remapping or finetuning would be needed for sorted-mapping evaluations.
108
+ - Limited visibility — add candidate-level telemetry and visualizations before making claims about ARC performance.
109
+
110
+ Ethics, license, and intended use
111
+ - MIT license. Research and educational use encouraged. Do not claim leaderboard results from this artifact without independently validating evaluation protocol and candidate diversity.
112
 
113
+ Acknowledgements
114
+ - Built on the Tiny Recursive Models codebase and evaluated against ARC Prize 2025 materials. Thanks to community contributors who have shared working Kaggle kernels; while we do not link to third-party work here, those resources can be found on Kaggle.