Fix model card: highlight real ARC task solve rate (2.92% pass@2)
Browse files
README.md
CHANGED
|
@@ -24,19 +24,19 @@ model-index:
|
|
| 24 |
split: evaluation
|
| 25 |
metrics:
|
| 26 |
- type: accuracy
|
| 27 |
-
name:
|
| 28 |
-
value: 0.
|
| 29 |
-
- type: loss
|
| 30 |
-
name: LM Loss
|
| 31 |
-
value: 2.0186
|
| 32 |
- type: accuracy
|
| 33 |
-
name:
|
| 34 |
-
value: 0.
|
|
|
|
|
|
|
|
|
|
| 35 |
---
|
| 36 |
|
| 37 |
# Tiny Recursive Models — ARC-AGI-2 (8×GPU)
|
| 38 |
|
| 39 |
-
**Abstract.** This release packages the complete paper-faithful Tiny Recursive Models (TRM) checkpoint
|
| 40 |
|
| 41 |
**Special thanks** to Shawn Lewis (CTO of Weights & Biases) and the CoreWeave team (coreweave.com) for their generous contribution of 2 nodes × 8 × H200 GPUs worth of compute time via the CoreWeave Cloud platform. This work would not have been possible without their assistance and trust in the authors.
|
| 42 |
|
|
@@ -87,12 +87,24 @@ This release reproduces the ARC-AGI-2 configuration described in the TRM paper u
|
|
| 87 |
- `ARC/pass@1000`: **13.75 %**
|
| 88 |
|
| 89 |
## Evaluation
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
-
|
| 95 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 96 |
|
| 97 |
## How to Use
|
| 98 |
Install TinyRecursiveModels (commit above) and load the checkpoint via PyTorch:
|
|
|
|
| 24 |
split: evaluation
|
| 25 |
metrics:
|
| 26 |
- type: accuracy
|
| 27 |
+
name: ARC Task Solve Rate (pass@2)
|
| 28 |
+
value: 0.0292
|
|
|
|
|
|
|
|
|
|
| 29 |
- type: accuracy
|
| 30 |
+
name: ARC Task Solve Rate (pass@100)
|
| 31 |
+
value: 0.0819
|
| 32 |
+
- type: accuracy
|
| 33 |
+
name: pass@1
|
| 34 |
+
value: 0.0167
|
| 35 |
---
|
| 36 |
|
| 37 |
# Tiny Recursive Models — ARC-AGI-2 (8×GPU)
|
| 38 |
|
| 39 |
+
**Abstract.** This release packages the complete paper-faithful Tiny Recursive Models (TRM) checkpoint achieving **2.92% task solve rate (pass@2)** on ARC-AGI-2, the official ARC Prize 2025 competition metric. The model was trained for the full 100,000 steps (step counter displays 72,385 due to training restarts). With increased sampling, the model achieves 8.19% at pass@100. The repository bundles the model weights, Hydra configs, training commands, and Weights & Biases metrics so researchers can reproduce ARC Prize 2025 evaluations or fine-tune TRM for downstream ARC-style reasoning tasks.
|
| 40 |
|
| 41 |
**Special thanks** to Shawn Lewis (CTO of Weights & Biases) and the CoreWeave team (coreweave.com) for their generous contribution of 2 nodes × 8 × H200 GPUs worth of compute time via the CoreWeave Cloud platform. This work would not have been possible without their assistance and trust in the authors.
|
| 42 |
|
|
|
|
| 87 |
- `ARC/pass@1000`: **13.75 %**
|
| 88 |
|
| 89 |
## Evaluation
|
| 90 |
+
|
| 91 |
+
### ARC-AGI-2 Task Solve Rates
|
| 92 |
+
**These are the real puzzle-solving performance metrics:**
|
| 93 |
+
- **pass@1**: 1.67% (single attempt per task)
|
| 94 |
+
- **pass@2**: **2.92%** (official ARC Prize 2025 competition metric)
|
| 95 |
+
- **pass@10**: 5.83%
|
| 96 |
+
- **pass@100**: 8.19%
|
| 97 |
+
- **pass@1000**: 13.75%
|
| 98 |
+
|
| 99 |
+
### Model-Level Metrics
|
| 100 |
+
**These measure internal model behavior, not task success:**
|
| 101 |
+
- Token-level accuracy: 62.83% (not indicative of puzzle-solving)
|
| 102 |
+
- LM Loss: 2.0186
|
| 103 |
+
- Halt accuracy: 90.7% (ACT controller stopping mechanism)
|
| 104 |
+
|
| 105 |
+
### Evaluation Details
|
| 106 |
+
- Evaluator script: `TinyRecursiveModels/evaluators/arc.py` with default two-attempt submission writer
|
| 107 |
+
- Submission artifact: `/kaggle/working/trm_eval_outputs/evaluator_ARC_step_72385/submission.json`
|
| 108 |
|
| 109 |
## How to Use
|
| 110 |
Install TinyRecursiveModels (commit above) and load the checkpoint via PyTorch:
|