arc-codet5-660m / README.md
mindware's picture
Update README.md
1f58c6b verified
|
raw
history blame
4.45 kB
metadata
license: mit
language:
  - en
base_model:
  - Salesforce/codet5-large
tags:
  - ARC-AGI
  - ARC
  - code
datasets:
  - WizardLMTeam/WizardLM_evol_instruct_V2_196k
  - Open-Orca/SlimOrca
  - camel-ai/math
  - skeskinen/TinyStories-GPT4
  - rajpurkar/squad_v2
  - garage-bAInd/Open-Platypus
  - Sharathhebbar24/arxiv-math-instruct-50k
  - AlgorithmicResearchGroup/arxiv-physics-instruct-tune-30k
  - TIGER-Lab/MathInstruct
  - neoneye/histogram-comparisons-small-v1
  - ise-uiuc/Magicoder-Evol-Instruct-110K
  - PrimeIntellect/INTELLECT-MATH-SFT-Data
  - PrimeIntellect/verifiable-math-problems
  - sethapun/arithmetic_2md_1to1000
  - EleutherAI/proof-pile-2
  - MMInstruction/M3IT
  - stingning/ultrachat
  - timdettmers/openassistant-guanaco
  - Dahoas/instruct-synthetic-prompt-responses
  - pankajmathur/WizardLM_Orca

This checkpoint is the primary CodeT5-based solver we used for the MindsAI @ Tufa Labs entry in the ARC Prize 2025 competition. It shares the same architecture as mindware/arc-codet5-660m-scr (a 16-layer decoder variant of Salesforce/codet5-large), but does not include the Span-Corruption Refinement (SCR) auxiliary training stage. Instead, it represents the best non-refinement checkpoint obtained during long-horizon pretraining on TPU-v4 systems.

  • No SCR stage: this model was trained purely with the original span-corruption + instruction fine-tuning curriculum + ARC fine tunining.
  • Decoder-only pruning: the original decoder depth (24) was reduced to 16 layers after experiments showed encoder pruning harmed sample efficiency, while decoder pruning could be recovered through extended training.
  • Long-run TPU training: training spanned roughly two years on a V4-64 TPU, made possible by Google’s TPU Research Cloud program.

📚 ARC-Related Datasets & Frameworks RE-ARC Link: https://github.com/michaelhodel/re-arc Note: This is the repository from Michael Hodel, which procedurally generates examples for the 400 ARC training tasks. We also include RE-ARC eval and ARC 1.5 (also by Michael Hodel). ConceptARC Link: https://github.com/victorvikram/ConceptARC 1D-ARC (likely "ID ARC") Link: https://khalil-research.github.io/LLM4ARC/ ARC_gym Sort-of-ARC Andreas Koepf - Generated many tasks based upon the RE-ARC methodology using various foundation models. Additionally generated from a generator Andreas wrote based on the icecuber solution. It also includes extra tasks like predicting the solution graph. Jack Cole - Wrote generators for 60-80 tasks. Many were inspired by ARC items. Others were large concept datasets (cellular automata, math equation derived boards).

There is a large amount of ARC-related tasks that are not solving for the board (like generating code, predicting various parameters or features related to the task). There are other non-ARC related tasks.

ARC Data Formatting

  • ARC tasks ship as JSON where each task_id contains train pairs and test inputs; every grid is a rectangular list of lists with integers 0-9. Dimensions follow the original 1×1–30×30 spec, though the evaluator accepts up to 50×50.
  • Example task payload:
    {
      "task_id": {
        "train": [
          {"input": [[0,0],[1,1]], "output": [[1,1],[1,1]]}
        ],
        "test": [
          {"input": [[0,0,0],[0,1,0],[0,0,0]]}
        ]
      }
    }
    
  • Model prompts (prompt column during training/TTT/inference) are serialized text strings: solve: train input1 <train_input> output1 <prefix><train_output>. … test tinput1 <test_input> toutput1 . Each grid token <train_input> / <train_output> / <test_input> is produced by grid_to_string, so rows are concatenated digits separated by spaces. Multiple train examples increment the index (input2, output2, etc.).
  • Prompt example:
    solve: train input1 000 010 000 output1 11 3 3 10 111 101 111. input2 00 02 output2 5 2 2 20 22 20. test tinput1 0000 0300 0000 0000 toutput1 
    
  • Model targets (correct_answer column and expected decoder output before post-processing) follow output_prefix semantics: {total_chars} {height} {width} {symbols} {row_strings}. Here total_chars = height*width + (height - 1) and symbols is the deduplicated sequence of colors as they are first encountered when scanning the board row-major; that rule applies to every output grid we emit (training outputs inside the prompt and the predicted test toutput). Example target string for a 3×3 donut:
     11 3 3 10 111 101 111.