seconds-0 commited on
Commit
eab8db3
·
verified ·
1 Parent(s): ea34ded

Polish model card prose: professional Kaggle status, limitations, guidance

Browse files
Files changed (1) hide show
  1. README.md +41 -44
README.md CHANGED
@@ -39,49 +39,46 @@ model-index:
39
 
40
  # Tiny Recursive Models — ARC-AGI-2 (8× H200 Resume, Step 119 432)
41
 
42
- Summary
43
- - This is an 8× H200 resume snapshot at global step 119 432 from run `trm_arc2_8gpu_resume_step115815_plus100k_v2` (TinyRecursiveModels commit `e7b68717`).
44
- - It packages one consolidated `model.ckpt` plus the exact configs used on CoreWeave. Integrity hash: `2bc8bb3a5a85cd73e169a6fd285f9138427db894bd157edc20e92a58ed8ee33e`.
45
- - Evaluation quality is currently poor (ARC public evaluation pass@1 0.83%) due to duplicate candidate generation after the resume. Until we address sampler issues, this artifact is provided for transparency and reproduction, not for leaderboard use.
46
 
47
- Why the “119 434” label sometimes appears
48
- - Internal tracking briefly referred to a “step 119 434 snapshot. The durable shard that survived PVC pruning is `step_119432`. W&B logs confirm the resume guard initialized at step 115 815 and advanced into the 119k block; no distinct 119 434 shard remains.
49
 
50
- What’s included
51
- - `model.ckpt` — Consolidated PyTorch checkpoint (weights, optimizer, EMA) containing tensors from `step_119432/*`.
52
- - `COMMANDS.txt`, `COMMANDS_resumed.txt` — Exact `torchrun` invocations (8× H200) and resume flags.
53
- - `ENVIRONMENT.txt`, `all_config.yaml` — Hydra-resolved configs captured on the pod.
54
  - `MANIFEST.txt` — Packaging metadata (step, source path, timestamp, sha256).
55
- - `TRM_COMMIT.txt` — Upstream TinyRecursiveModels SHA (`e7b68717f0a6c4cbb4ce6fbef787b14f42083bd9`).
56
- - `dataset-metadata.json` — Kaggle manifest retained for parity (legacy identifier mapping), plus cached W&B CSV/summary.
57
 
58
- Kaggle status (candid)
59
- - We did not achieve a working Kaggle submission from this checkpoint. Despite multiple attempts, both internal evaluator pods and Kaggle notebooks produced duplicate attempts for every ARC task, yielding ≈0.83% pass@1 with no diversity across samples.
60
- - A separate, community-maintained Kaggle implementation exists and may be discoverable on Kaggle by searching for TRM ARC-AGI-2 inference. That work is independent; this release does not mirror their kernel or guarantees.
61
- - Please treat this model card as a transparent record of what did and did not work in our pipeline rather than a polished Kaggle solution.
62
 
63
- Why our Kaggle pipeline is broken right now
64
- - Unique-candidate guard disabled: A production overlay path bypassed the uniqueness filter, allowing identical candidates to pass through even when `ARC_SAMPLING_COUNT>1`.
65
- - Sampler settings not propagating: Environment variables like `ARC_SAMPLING_COUNT`, `ARC_SAMPLING_MODE=sample`, and temperature sometimes failed to reach the evaluator process under our overlays.
66
- - Multinomial fallback may be degenerate: With the above misconfigurations, sampling often collapsed to identical outputs across K attempts.
67
- - Limited attempt-level telemetry: Our evaluator builds aggregated only summary stats; we lacked per-attempt logits/strings on GPU, and Kaggle kernels restricted richer logging. As a result we could not visualize candidate selection or diagnose collapse directly inside the kernels.
68
- - Identifier alignment: This checkpoint uses the legacy ARC identifier mapping (training-time choice). Kaggle evaluation notebooks increasingly assume the sorted mapping. Without remapping or dedicated compatibility code, comparisons are brittle.
69
 
70
- What we tried (and why it wasn’t enough)
71
- - Relaunched controlled eval pods with explicit sampling vars (`ARC_SAMPLING_COUNT>1`, `ARC_SAMPLING_MODE=sample`) and verified resume guard messages in logs duplicates persisted.
72
- - Instrumented a CPU debugger (`scripts/debug_eval_cpu.py --samples 8 --log-attempts`) and ran on GPU-backed pods where possible — candidate diversity remained near zero.
73
- - Raised temperatures and experimented with multinomial/top-k settings diversity did not improve under the broken overlay path.
74
- - Prepared a Kaggle dataset and notebooks for this checkpoint — kernels executed end-to-end but emitted duplicate attempts and stalled at ≈0.83% pass@1.
75
 
76
- Guidance if you still want to evaluate
77
- - This artifact is best used to replicate our internal numbers or to help debug sampler behavior — not for immediate leaderboard submissions.
78
- - If you build your own evaluator:
79
- - Re-enable a strict unique-candidate guard before scoring.
80
- - Ensure sampling environment variables propagate into the inference process.
81
- - Capture per-attempt outputs and logits for diversity diagnostics.
82
- - Be explicit about identifier mapping (legacy vs sorted) and ensure your dataset builder matches the checkpoint’s training-time choice.
83
 
84
- Reproduction snippets
85
  ```python
86
  from huggingface_hub import hf_hub_download
87
  import torch
@@ -92,23 +89,23 @@ print(state["hyperparameters"]["arch"]["hidden_size"]) # 512
92
  ```
93
 
94
  ```
95
- # CoreWeave resume (reference only)
96
  kubectl apply -f infra/kubernetes/trm-train-8gpu-resume.yaml
97
  # Requires configmaps: trm-common-script, trm-pyshim-cm, trm-eval-overlay-cm
98
  ```
99
 
100
- Resume guard signals to expect
101
  - `RESUME_CHECKPOINT_PATH=/workspace/TinyRecursiveModels/checkpoints/Arc2concept-aug-1000-ACT-torch/trm_arc2_8gpu_resume_plus100k/step_115815`
102
  - `RESUME_EXPECTED_STEP=115815`
103
  - Pod logs contain: `[resume] initializing train_state.step to 115815` before training proceeds.
104
 
105
- Known issues and next steps
106
- - Sampler instability and duplicate attempts — must restore uniqueness guards, verify env propagation, and re-check multinomial sampling.
107
- - Identifier mapping mismatch this release remains “legacy”; remapping or finetuning would be needed for sorted-mapping evaluations.
108
- - Limited visibility — add candidate-level telemetry and visualizations before making claims about ARC performance.
109
 
110
- Ethics, license, and intended use
111
- - MIT license. Research and educational use encouraged. Do not claim leaderboard results from this artifact without independently validating evaluation protocol and candidate diversity.
112
 
113
  Acknowledgements
114
- - Built on the Tiny Recursive Models codebase and evaluated against ARC Prize 2025 materials. Thanks to community contributors who have shared working Kaggle kernels; while we do not link to third-party work here, those resources can be found on Kaggle.
 
39
 
40
  # Tiny Recursive Models — ARC-AGI-2 (8× H200 Resume, Step 119 432)
41
 
42
+ Overview
43
+ - 8× H200 resume snapshot at global step 119 432 from run `trm_arc2_8gpu_resume_step115815_plus100k_v2` (TinyRecursiveModels commit `e7b68717`).
44
+ - Consolidated `model.ckpt` with accompanying configuration and provenance files. Integrity hash: `2bc8bb3a5a85cd73e169a6fd285f9138427db894bd157edc20e92a58ed8ee33e`.
45
+ - Current evaluation on the ARC public evaluation split is approximately 0.83% pass@1, attributable to duplicate candidate generation in our evaluation pipeline after the resume. This release is intended for reproducibility and analysis rather than leaderboard submissions.
46
 
47
+ About “119 434” vs “119 432”
48
+ - Internal tracking referenced “step 119 434”; the persistent shard is `step_119432`. W&B logs confirm the resume guard initialized at step 115 815 and advanced into the 119k block. No distinct 119 434 shard remains.
49
 
50
+ Contents
51
+ - `model.ckpt` — Consolidated PyTorch checkpoint (weights, optimizer, EMA) reflecting `step_119432/*`.
52
+ - `COMMANDS.txt`, `COMMANDS_resumed.txt` — Exact `torchrun` invocations (8× H200) and resume parameters.
53
+ - `ENVIRONMENT.txt`, `all_config.yaml` — Hydra-resolved configurations captured on the training pod.
54
  - `MANIFEST.txt` — Packaging metadata (step, source path, timestamp, sha256).
55
+ - `TRM_COMMIT.txt` — Upstream TinyRecursiveModels commit (`e7b68717f0a6c4cbb4ce6fbef787b14f42083bd9`).
56
+ - `dataset-metadata.json` — Kaggle packaging manifest (legacy identifier mapping). W&B CSV/summary are included for convenience.
57
 
58
+ Kaggle Status
59
+ - This checkpoint is not accompanied by a validated Kaggle submission from our pipeline. In our evaluation path, inference produced duplicate candidates across attempts, yielding ≈0.83% pass@1 on the public evaluation split.
60
+ - A third‑party Kaggle implementation demonstrating TRM inference is available on Kaggle; interested users can locate it by searching the platform. That implementation is independent of this release.
 
61
 
62
+ Current Kaggle Limitations
63
+ - Uniquecandidate enforcement: A production overlay inadvertently bypassed the uniqueness filter, allowing duplicates even with `ARC_SAMPLING_COUNT > 1`.
64
+ - Sampler configuration propagation: Variables such as `ARC_SAMPLING_COUNT`, `ARC_SAMPLING_MODE=sample`, and temperature did not reliably reach the evaluator under our overlays.
65
+ - Sampling degeneracy: Under misconfiguration, multinomial sampling collapsed to identical outputs across attempts.
66
+ - Limited attemptlevel telemetry: Evaluators emitted aggregate metrics only; perattempt logits and strings were not retained due to GPU runtime constraints and Kaggle logging limits. This prevents visualization of candidate selection and diversity.
67
+ - Identifier mapping: The checkpoint was trained with the legacy identifier mapping, whereas some evaluators assume the sorted mapping. Without remapping or a compatibility layer, comparisons can be brittle.
68
 
69
+ Attempted Mitigations
70
+ - Relaunched controlled evaluator pods with explicit sampling parameters and verified resumeguard logs; duplicates persisted.
71
+ - Instrumented CPU/GPU debug evaluators (`scripts/debug_eval_cpu.py --samples 8 --log-attempts`); candidate diversity remained near zero.
72
+ - Adjusted temperature and topk settings; no material improvement under the broken overlay path.
73
+ - Prepared Kaggle datasets/notebooks and executed endtoend; duplicate attempts persisted and no leaderboard submission was made.
74
 
75
+ Evaluation Guidance
76
+ - Enforce a strict unique‑candidate guard prior to scoring.
77
+ - Validate that sampling environment variables propagate into the inference process.
78
+ - Capture per‑attempt outputs and, where possible, logits to diagnose diversity.
79
+ - Match the identifier mapping to the checkpoint (legacy vs sorted) and use the corresponding dataset builder.
 
 
80
 
81
+ Reproduction
82
  ```python
83
  from huggingface_hub import hf_hub_download
84
  import torch
 
89
  ```
90
 
91
  ```
92
+ # CoreWeave resume (reference)
93
  kubectl apply -f infra/kubernetes/trm-train-8gpu-resume.yaml
94
  # Requires configmaps: trm-common-script, trm-pyshim-cm, trm-eval-overlay-cm
95
  ```
96
 
97
+ Resume Guard Signals
98
  - `RESUME_CHECKPOINT_PATH=/workspace/TinyRecursiveModels/checkpoints/Arc2concept-aug-1000-ACT-torch/trm_arc2_8gpu_resume_plus100k/step_115815`
99
  - `RESUME_EXPECTED_STEP=115815`
100
  - Pod logs contain: `[resume] initializing train_state.step to 115815` before training proceeds.
101
 
102
+ Known Issues and Next Steps
103
+ - Duplicate candidate generation: restore uniqueness enforcement, validate environment propagation, and re‑verify multinomial sampling.
104
+ - Identifier mapping mismatch: this release remains “legacy”; sorted‑mapping evaluations require remapping or fine‑tuning.
105
+ - Observability: add candidatelevel telemetry and visualizations to support reliable evaluation claims.
106
 
107
+ Ethics, License, and Intended Use
108
+ - MIT license. Intended for research and educational use. Users should independently validate evaluation protocols and candidate diversity prior to reporting results.
109
 
110
  Acknowledgements
111
+ - Built on the Tiny Recursive Models codebase and evaluated against ARC Prize 2025 materials. A community‑maintained Kaggle implementation demonstrates a functioning TRM inference pipeline and can be located via search on Kaggle.