recursive-sat-qwen2.5-1.5b

This is a paper model: the REC-3 release artifact from a paper-aligned replication of recursive SAT reasoning at 1.5B scale.

It is a supervised fine-tune of Qwen/Qwen2.5-1.5B-Instruct trained on recursive SAT traces derived from SATBench with explicit <call> / <return> structure. The goal is research replication and analysis, not general-purpose production use.

What This Model Is

  • Base model: Qwen/Qwen2.5-1.5B-Instruct
  • Release artifact: results/runs/REC-3/published_model
  • Training run: REC-3
  • Seed: 303
  • Config: configs/rec_seed303.yaml
  • Dataset source: LLM4Code/SATBench
  • Task: SAT / UNSAT classification via recursive trace supervision

Why REC-3

REC-1 and REC-3 tie on mean accuracy, but REC-3 is the cleaner release candidate on end-to-end behavior:

  • Mean accuracy: 45.33%
  • Easy: 39.0%
  • Medium: 54.0%
  • Hard: 43.0%
  • Parse failure rate: 7.0%
  • Valid trace rate: 99.0%

Compared with REC-1, REC-3 keeps the same mean accuracy while reducing parse failure (7.0% vs 8.33%), improving hard accuracy (43.0% vs 42.0%), and slightly improving valid trace rate (99.0% vs 98.33%).

Important Caveat

This is a paper model, not a claim of robust general recursive reasoning.

The underlying paper draft treats the result as a qualified replication:

  • recursive SFT improves end-to-end SATBench accuracy over raw direct prompting
  • the strongest gain is on medium-difficulty SAT instances
  • absolute performance remains far below the 3B source-paper result
  • recursion behavior is still shallow overall

Use this release as a research artifact tied to the experiment, metrics, and discussion in the paper repo.

Training Summary

  • Objective: recursive_sft
  • Train examples: 74,827
  • Validation examples: 619
  • Global step: 46,770
  • Best checkpoint: checkpoint-9354
  • Accelerator used for the main run: cuda

Evaluation Summary

Main held-out evaluation uses 100 examples each from SATBench easy, medium, and hard buckets.

Baseline vs released model:

  • Base direct prompt mean accuracy: 37.33%
  • REC-3 mean accuracy: 45.33%
  • Absolute gain: +8.0 points
  • Base parse failure rate: 28.67%
  • REC-3 parse failure rate: 7.0%

Prompt Format

The model was trained on recursive traces using:

  • <call> ... </call> for subproblem decomposition
  • <return> ... </return> for compact returned answers

It is best treated as a specialized research model for this protocolized SAT setting.

Files In This Release

  • model.safetensors
  • config.json
  • generation_config.json
  • tokenizer.json
  • tokenizer_config.json
  • chat_template.jinja
  • export_metadata.json

Intended Use

  • paper artifact release
  • replication reference
  • SAT recursive-trace evaluation
  • qualitative inspection of recursive protocol behavior

Out Of Scope

  • production reasoning system
  • general mathematical reasoning benchmark model
  • safety-critical use
  • claims beyond the SATBench replication setting
Downloads last month
7
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
Input a message to start chatting with groc/recursive-sat-qwen2.5-1.5b.

Model tree for groc/recursive-sat-qwen2.5-1.5b

Finetuned
(1594)
this model

Dataset used to train groc/recursive-sat-qwen2.5-1.5b