Polish model card prose: professional Kaggle status, limitations, guidance
Browse files
README.md
CHANGED
|
@@ -39,49 +39,46 @@ model-index:
|
|
| 39 |
|
| 40 |
# Tiny Recursive Models — ARC-AGI-2 (8× H200 Resume, Step 119 432)
|
| 41 |
|
| 42 |
-
|
| 43 |
-
-
|
| 44 |
-
-
|
| 45 |
-
-
|
| 46 |
|
| 47 |
-
|
| 48 |
-
- Internal tracking
|
| 49 |
|
| 50 |
-
|
| 51 |
-
- `model.ckpt` — Consolidated PyTorch checkpoint (weights, optimizer, EMA)
|
| 52 |
-
- `COMMANDS.txt`, `COMMANDS_resumed.txt` — Exact `torchrun` invocations (8× H200) and resume
|
| 53 |
-
- `ENVIRONMENT.txt`, `all_config.yaml` — Hydra-resolved
|
| 54 |
- `MANIFEST.txt` — Packaging metadata (step, source path, timestamp, sha256).
|
| 55 |
-
- `TRM_COMMIT.txt` — Upstream TinyRecursiveModels
|
| 56 |
-
- `dataset-metadata.json` — Kaggle manifest
|
| 57 |
|
| 58 |
-
Kaggle
|
| 59 |
-
-
|
| 60 |
-
- A
|
| 61 |
-
- Please treat this model card as a transparent record of what did and did not work in our pipeline rather than a polished Kaggle solution.
|
| 62 |
|
| 63 |
-
|
| 64 |
-
- Unique
|
| 65 |
-
- Sampler
|
| 66 |
-
-
|
| 67 |
-
- Limited attempt
|
| 68 |
-
- Identifier
|
| 69 |
|
| 70 |
-
|
| 71 |
-
- Relaunched controlled
|
| 72 |
-
- Instrumented
|
| 73 |
-
-
|
| 74 |
-
- Prepared
|
| 75 |
|
| 76 |
-
Guidance
|
| 77 |
-
-
|
| 78 |
-
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
- Capture per-attempt outputs and logits for diversity diagnostics.
|
| 82 |
-
- Be explicit about identifier mapping (legacy vs sorted) and ensure your dataset builder matches the checkpoint’s training-time choice.
|
| 83 |
|
| 84 |
-
Reproduction
|
| 85 |
```python
|
| 86 |
from huggingface_hub import hf_hub_download
|
| 87 |
import torch
|
|
@@ -92,23 +89,23 @@ print(state["hyperparameters"]["arch"]["hidden_size"]) # 512
|
|
| 92 |
```
|
| 93 |
|
| 94 |
```
|
| 95 |
-
# CoreWeave resume (reference
|
| 96 |
kubectl apply -f infra/kubernetes/trm-train-8gpu-resume.yaml
|
| 97 |
# Requires configmaps: trm-common-script, trm-pyshim-cm, trm-eval-overlay-cm
|
| 98 |
```
|
| 99 |
|
| 100 |
-
Resume
|
| 101 |
- `RESUME_CHECKPOINT_PATH=/workspace/TinyRecursiveModels/checkpoints/Arc2concept-aug-1000-ACT-torch/trm_arc2_8gpu_resume_plus100k/step_115815`
|
| 102 |
- `RESUME_EXPECTED_STEP=115815`
|
| 103 |
- Pod logs contain: `[resume] initializing train_state.step to 115815` before training proceeds.
|
| 104 |
|
| 105 |
-
Known
|
| 106 |
-
-
|
| 107 |
-
- Identifier mapping mismatch
|
| 108 |
-
-
|
| 109 |
|
| 110 |
-
Ethics,
|
| 111 |
-
- MIT license.
|
| 112 |
|
| 113 |
Acknowledgements
|
| 114 |
-
- Built on the Tiny Recursive Models codebase and evaluated against ARC Prize 2025 materials.
|
|
|
|
| 39 |
|
| 40 |
# Tiny Recursive Models — ARC-AGI-2 (8× H200 Resume, Step 119 432)
|
| 41 |
|
| 42 |
+
Overview
|
| 43 |
+
- 8× H200 resume snapshot at global step 119 432 from run `trm_arc2_8gpu_resume_step115815_plus100k_v2` (TinyRecursiveModels commit `e7b68717`).
|
| 44 |
+
- Consolidated `model.ckpt` with accompanying configuration and provenance files. Integrity hash: `2bc8bb3a5a85cd73e169a6fd285f9138427db894bd157edc20e92a58ed8ee33e`.
|
| 45 |
+
- Current evaluation on the ARC public evaluation split is approximately 0.83% pass@1, attributable to duplicate candidate generation in our evaluation pipeline after the resume. This release is intended for reproducibility and analysis rather than leaderboard submissions.
|
| 46 |
|
| 47 |
+
About “119 434” vs “119 432”
|
| 48 |
+
- Internal tracking referenced “step 119 434”; the persistent shard is `step_119432`. W&B logs confirm the resume guard initialized at step 115 815 and advanced into the 119k block. No distinct 119 434 shard remains.
|
| 49 |
|
| 50 |
+
Contents
|
| 51 |
+
- `model.ckpt` — Consolidated PyTorch checkpoint (weights, optimizer, EMA) reflecting `step_119432/*`.
|
| 52 |
+
- `COMMANDS.txt`, `COMMANDS_resumed.txt` — Exact `torchrun` invocations (8× H200) and resume parameters.
|
| 53 |
+
- `ENVIRONMENT.txt`, `all_config.yaml` — Hydra-resolved configurations captured on the training pod.
|
| 54 |
- `MANIFEST.txt` — Packaging metadata (step, source path, timestamp, sha256).
|
| 55 |
+
- `TRM_COMMIT.txt` — Upstream TinyRecursiveModels commit (`e7b68717f0a6c4cbb4ce6fbef787b14f42083bd9`).
|
| 56 |
+
- `dataset-metadata.json` — Kaggle packaging manifest (legacy identifier mapping). W&B CSV/summary are included for convenience.
|
| 57 |
|
| 58 |
+
Kaggle Status
|
| 59 |
+
- This checkpoint is not accompanied by a validated Kaggle submission from our pipeline. In our evaluation path, inference produced duplicate candidates across attempts, yielding ≈0.83% pass@1 on the public evaluation split.
|
| 60 |
+
- A third‑party Kaggle implementation demonstrating TRM inference is available on Kaggle; interested users can locate it by searching the platform. That implementation is independent of this release.
|
|
|
|
| 61 |
|
| 62 |
+
Current Kaggle Limitations
|
| 63 |
+
- Unique‑candidate enforcement: A production overlay inadvertently bypassed the uniqueness filter, allowing duplicates even with `ARC_SAMPLING_COUNT > 1`.
|
| 64 |
+
- Sampler configuration propagation: Variables such as `ARC_SAMPLING_COUNT`, `ARC_SAMPLING_MODE=sample`, and temperature did not reliably reach the evaluator under our overlays.
|
| 65 |
+
- Sampling degeneracy: Under misconfiguration, multinomial sampling collapsed to identical outputs across attempts.
|
| 66 |
+
- Limited attempt‑level telemetry: Evaluators emitted aggregate metrics only; per‑attempt logits and strings were not retained due to GPU runtime constraints and Kaggle logging limits. This prevents visualization of candidate selection and diversity.
|
| 67 |
+
- Identifier mapping: The checkpoint was trained with the legacy identifier mapping, whereas some evaluators assume the sorted mapping. Without remapping or a compatibility layer, comparisons can be brittle.
|
| 68 |
|
| 69 |
+
Attempted Mitigations
|
| 70 |
+
- Relaunched controlled evaluator pods with explicit sampling parameters and verified resume‑guard logs; duplicates persisted.
|
| 71 |
+
- Instrumented CPU/GPU debug evaluators (`scripts/debug_eval_cpu.py --samples 8 --log-attempts`); candidate diversity remained near zero.
|
| 72 |
+
- Adjusted temperature and top‑k settings; no material improvement under the broken overlay path.
|
| 73 |
+
- Prepared Kaggle datasets/notebooks and executed end‑to‑end; duplicate attempts persisted and no leaderboard submission was made.
|
| 74 |
|
| 75 |
+
Evaluation Guidance
|
| 76 |
+
- Enforce a strict unique‑candidate guard prior to scoring.
|
| 77 |
+
- Validate that sampling environment variables propagate into the inference process.
|
| 78 |
+
- Capture per‑attempt outputs and, where possible, logits to diagnose diversity.
|
| 79 |
+
- Match the identifier mapping to the checkpoint (legacy vs sorted) and use the corresponding dataset builder.
|
|
|
|
|
|
|
| 80 |
|
| 81 |
+
Reproduction
|
| 82 |
```python
|
| 83 |
from huggingface_hub import hf_hub_download
|
| 84 |
import torch
|
|
|
|
| 89 |
```
|
| 90 |
|
| 91 |
```
|
| 92 |
+
# CoreWeave resume (reference)
|
| 93 |
kubectl apply -f infra/kubernetes/trm-train-8gpu-resume.yaml
|
| 94 |
# Requires configmaps: trm-common-script, trm-pyshim-cm, trm-eval-overlay-cm
|
| 95 |
```
|
| 96 |
|
| 97 |
+
Resume Guard Signals
|
| 98 |
- `RESUME_CHECKPOINT_PATH=/workspace/TinyRecursiveModels/checkpoints/Arc2concept-aug-1000-ACT-torch/trm_arc2_8gpu_resume_plus100k/step_115815`
|
| 99 |
- `RESUME_EXPECTED_STEP=115815`
|
| 100 |
- Pod logs contain: `[resume] initializing train_state.step to 115815` before training proceeds.
|
| 101 |
|
| 102 |
+
Known Issues and Next Steps
|
| 103 |
+
- Duplicate candidate generation: restore uniqueness enforcement, validate environment propagation, and re‑verify multinomial sampling.
|
| 104 |
+
- Identifier mapping mismatch: this release remains “legacy”; sorted‑mapping evaluations require remapping or fine‑tuning.
|
| 105 |
+
- Observability: add candidate‑level telemetry and visualizations to support reliable evaluation claims.
|
| 106 |
|
| 107 |
+
Ethics, License, and Intended Use
|
| 108 |
+
- MIT license. Intended for research and educational use. Users should independently validate evaluation protocols and candidate diversity prior to reporting results.
|
| 109 |
|
| 110 |
Acknowledgements
|
| 111 |
+
- Built on the Tiny Recursive Models codebase and evaluated against ARC Prize 2025 materials. A community‑maintained Kaggle implementation demonstrates a functioning TRM inference pipeline and can be located via search on Kaggle.
|