Update model card: candid Kaggle status and troubleshooting narrative
Browse files
README.md
CHANGED
|
@@ -39,63 +39,76 @@ model-index:
|
|
| 39 |
|
| 40 |
# Tiny Recursive Models — ARC-AGI-2 (8× H200 Resume, Step 119 432)
|
| 41 |
|
| 42 |
-
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
-
|
|
|
|
| 45 |
|
| 46 |
-
|
| 47 |
-
-
|
| 48 |
-
-
|
| 49 |
-
-
|
| 50 |
-
-
|
| 51 |
-
-
|
| 52 |
-
-
|
| 53 |
-
- `RESUME_CHECKPOINT_PATH` → `/workspace/TinyRecursiveModels/checkpoints/Arc2concept-aug-1000-ACT-torch/trm_arc2_8gpu_resume_plus100k/step_115815`
|
| 54 |
-
- `RESUME_EXPECTED_STEP` → `115815`
|
| 55 |
-
- `[resume] initializing train_state.step to 115815` appears in pod logs before training continues
|
| 56 |
-
- **PVC retention**: Latest PVC shards now extend to `step_662428`; earlier 119k shards were pruned after packaging this export.
|
| 57 |
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
| `COMMANDS.txt` / `COMMANDS_resumed.txt` | Torch distributed launch (8 × H200) showing the resume flags and dataset path. |
|
| 63 |
-
| `ENVIRONMENT.txt` | Hydra-resolved configuration captured on CoreWeave after overlays. |
|
| 64 |
-
| `MANIFEST.txt` | Packaging metadata (checkpoint step, source path, timestamp, sha256). |
|
| 65 |
-
| `TRM_COMMIT.txt` | Upstream TinyRecursiveModels Git SHA (`e7b68717f0a6c4cbb4ce6fbef787b14f42083bd9`). |
|
| 66 |
-
| `all_config.yaml` | Structured config snapshot exported alongside the checkpoint. |
|
| 67 |
-
| `dataset-metadata.json` | Kaggle dataset manifest (kept for parity with previous releases). |
|
| 68 |
|
| 69 |
-
|
| 70 |
-
-
|
| 71 |
-
-
|
| 72 |
-
-
|
|
|
|
|
|
|
| 73 |
|
| 74 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
|
| 76 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77 |
```python
|
| 78 |
from huggingface_hub import hf_hub_download
|
| 79 |
import torch
|
| 80 |
|
| 81 |
-
ckpt_path = hf_hub_download("
|
| 82 |
state = torch.load(ckpt_path, map_location="cpu")
|
| 83 |
print(state["hyperparameters"]["arch"]["hidden_size"]) # 512
|
| 84 |
```
|
| 85 |
|
| 86 |
-
|
| 87 |
-
|
| 88 |
kubectl apply -f infra/kubernetes/trm-train-8gpu-resume.yaml
|
| 89 |
-
#
|
| 90 |
```
|
| 91 |
-
Before submitting jobs, verify:
|
| 92 |
-
1. `RESUME_CHECKPOINT_PATH` points to the 115 815 shard.
|
| 93 |
-
2. `[resume] initializing train_state.step to 115815` appears once training boots.
|
| 94 |
-
3. The first W&B point is ≥115 815 with `train/lm_loss` ≈ 0.13.
|
| 95 |
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 100 |
|
| 101 |
-
|
|
|
|
|
|
| 39 |
|
| 40 |
# Tiny Recursive Models — ARC-AGI-2 (8× H200 Resume, Step 119 432)
|
| 41 |
|
| 42 |
+
Summary
|
| 43 |
+
- This is an 8× H200 resume snapshot at global step 119 432 from run `trm_arc2_8gpu_resume_step115815_plus100k_v2` (TinyRecursiveModels commit `e7b68717`).
|
| 44 |
+
- It packages one consolidated `model.ckpt` plus the exact configs used on CoreWeave. Integrity hash: `2bc8bb3a5a85cd73e169a6fd285f9138427db894bd157edc20e92a58ed8ee33e`.
|
| 45 |
+
- Evaluation quality is currently poor (ARC public evaluation pass@1 ≈ 0.83%) due to duplicate candidate generation after the resume. Until we address sampler issues, this artifact is provided for transparency and reproduction, not for leaderboard use.
|
| 46 |
|
| 47 |
+
Why the “119 434” label sometimes appears
|
| 48 |
+
- Internal tracking briefly referred to a “step 119 434” snapshot. The durable shard that survived PVC pruning is `step_119432`. W&B logs confirm the resume guard initialized at step 115 815 and advanced into the 119k block; no distinct 119 434 shard remains.
|
| 49 |
|
| 50 |
+
What’s included
|
| 51 |
+
- `model.ckpt` — Consolidated PyTorch checkpoint (weights, optimizer, EMA) containing tensors from `step_119432/*`.
|
| 52 |
+
- `COMMANDS.txt`, `COMMANDS_resumed.txt` — Exact `torchrun` invocations (8× H200) and resume flags.
|
| 53 |
+
- `ENVIRONMENT.txt`, `all_config.yaml` — Hydra-resolved configs captured on the pod.
|
| 54 |
+
- `MANIFEST.txt` — Packaging metadata (step, source path, timestamp, sha256).
|
| 55 |
+
- `TRM_COMMIT.txt` — Upstream TinyRecursiveModels SHA (`e7b68717f0a6c4cbb4ce6fbef787b14f42083bd9`).
|
| 56 |
+
- `dataset-metadata.json` — Kaggle manifest retained for parity (legacy identifier mapping), plus cached W&B CSV/summary.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
|
| 58 |
+
Kaggle status (candid)
|
| 59 |
+
- We did not achieve a working Kaggle submission from this checkpoint. Despite multiple attempts, both internal evaluator pods and Kaggle notebooks produced duplicate attempts for every ARC task, yielding ≈0.83% pass@1 with no diversity across samples.
|
| 60 |
+
- A separate, community-maintained Kaggle implementation exists and may be discoverable on Kaggle by searching for TRM ARC-AGI-2 inference. That work is independent; this release does not mirror their kernel or guarantees.
|
| 61 |
+
- Please treat this model card as a transparent record of what did and did not work in our pipeline rather than a polished Kaggle solution.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
|
| 63 |
+
Why our Kaggle pipeline is broken right now
|
| 64 |
+
- Unique-candidate guard disabled: A production overlay path bypassed the uniqueness filter, allowing identical candidates to pass through even when `ARC_SAMPLING_COUNT>1`.
|
| 65 |
+
- Sampler settings not propagating: Environment variables like `ARC_SAMPLING_COUNT`, `ARC_SAMPLING_MODE=sample`, and temperature sometimes failed to reach the evaluator process under our overlays.
|
| 66 |
+
- Multinomial fallback may be degenerate: With the above misconfigurations, sampling often collapsed to identical outputs across K attempts.
|
| 67 |
+
- Limited attempt-level telemetry: Our evaluator builds aggregated only summary stats; we lacked per-attempt logits/strings on GPU, and Kaggle kernels restricted richer logging. As a result we could not visualize candidate selection or diagnose collapse directly inside the kernels.
|
| 68 |
+
- Identifier alignment: This checkpoint uses the legacy ARC identifier mapping (training-time choice). Kaggle evaluation notebooks increasingly assume the sorted mapping. Without remapping or dedicated compatibility code, comparisons are brittle.
|
| 69 |
|
| 70 |
+
What we tried (and why it wasn’t enough)
|
| 71 |
+
- Relaunched controlled eval pods with explicit sampling vars (`ARC_SAMPLING_COUNT>1`, `ARC_SAMPLING_MODE=sample`) and verified resume guard messages in logs — duplicates persisted.
|
| 72 |
+
- Instrumented a CPU debugger (`scripts/debug_eval_cpu.py --samples 8 --log-attempts`) and ran on GPU-backed pods where possible — candidate diversity remained near zero.
|
| 73 |
+
- Raised temperatures and experimented with multinomial/top-k settings — diversity did not improve under the broken overlay path.
|
| 74 |
+
- Prepared a Kaggle dataset and notebooks for this checkpoint — kernels executed end-to-end but emitted duplicate attempts and stalled at ≈0.83% pass@1.
|
| 75 |
|
| 76 |
+
Guidance if you still want to evaluate
|
| 77 |
+
- This artifact is best used to replicate our internal numbers or to help debug sampler behavior — not for immediate leaderboard submissions.
|
| 78 |
+
- If you build your own evaluator:
|
| 79 |
+
- Re-enable a strict unique-candidate guard before scoring.
|
| 80 |
+
- Ensure sampling environment variables propagate into the inference process.
|
| 81 |
+
- Capture per-attempt outputs and logits for diversity diagnostics.
|
| 82 |
+
- Be explicit about identifier mapping (legacy vs sorted) and ensure your dataset builder matches the checkpoint’s training-time choice.
|
| 83 |
+
|
| 84 |
+
Reproduction snippets
|
| 85 |
```python
|
| 86 |
from huggingface_hub import hf_hub_download
|
| 87 |
import torch
|
| 88 |
|
| 89 |
+
ckpt_path = hf_hub_download("seconds-0/trm-arc2-8gpu", "model.ckpt")
|
| 90 |
state = torch.load(ckpt_path, map_location="cpu")
|
| 91 |
print(state["hyperparameters"]["arch"]["hidden_size"]) # 512
|
| 92 |
```
|
| 93 |
|
| 94 |
+
```
|
| 95 |
+
# CoreWeave resume (reference only)
|
| 96 |
kubectl apply -f infra/kubernetes/trm-train-8gpu-resume.yaml
|
| 97 |
+
# Requires configmaps: trm-common-script, trm-pyshim-cm, trm-eval-overlay-cm
|
| 98 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
| 99 |
|
| 100 |
+
Resume guard signals to expect
|
| 101 |
+
- `RESUME_CHECKPOINT_PATH=/workspace/TinyRecursiveModels/checkpoints/Arc2concept-aug-1000-ACT-torch/trm_arc2_8gpu_resume_plus100k/step_115815`
|
| 102 |
+
- `RESUME_EXPECTED_STEP=115815`
|
| 103 |
+
- Pod logs contain: `[resume] initializing train_state.step to 115815` before training proceeds.
|
| 104 |
+
|
| 105 |
+
Known issues and next steps
|
| 106 |
+
- Sampler instability and duplicate attempts — must restore uniqueness guards, verify env propagation, and re-check multinomial sampling.
|
| 107 |
+
- Identifier mapping mismatch — this release remains “legacy”; remapping or finetuning would be needed for sorted-mapping evaluations.
|
| 108 |
+
- Limited visibility — add candidate-level telemetry and visualizations before making claims about ARC performance.
|
| 109 |
+
|
| 110 |
+
Ethics, license, and intended use
|
| 111 |
+
- MIT license. Research and educational use encouraged. Do not claim leaderboard results from this artifact without independently validating evaluation protocol and candidate diversity.
|
| 112 |
|
| 113 |
+
Acknowledgements
|
| 114 |
+
- Built on the Tiny Recursive Models codebase and evaluated against ARC Prize 2025 materials. Thanks to community contributors who have shared working Kaggle kernels; while we do not link to third-party work here, those resources can be found on Kaggle.
|