CoT Oracle GRPO Checkpoint: Step 500

This repo contains the step-500 GRPO checkpoint derived from the final no-DPO CoT Oracle model.

What This Checkpoint Is

  • Base model family: Qwen/Qwen3-8B
  • Starting checkpoint: ceselder/cot-oracle-qwen3-8b-final-sprint-checkpoint-no-DPO
  • Adapter format: PEFT LoRA
  • Injection layer: 1
  • Activation readout layers: [9, 18, 27]
  • Checkpoint step: 500

Exact GRPO Recipe

From calibration_grpo/config.yaml:

  • Corpus: ceselder/cot-oracle-corpus-v5
  • Stride: 5
  • Question inclusion rate: 1.0
  • Rollouts per prompt: 8
  • Rollout temperature: 1.0
  • Max new tokens: 250
  • Rollout batch size: 16
  • Repetition penalty: 1.1
  • Judge model: google/gemini-3-flash-preview
  • Eval judge model: anthropic/claude-sonnet-4-6
  • Reward weights: passes_swap_test=1.0, specific_and_falsifiable=1.0, adds_insight=1.0, not_provably_wrong=3.0, follows_instructions=1.0
  • GRPO clip epsilon: 0.2
  • Learning rate: 3e-6
  • Warmup steps: 20
  • Batch size: 4
  • Gradient accumulation steps: 1
  • Max steps: 1000
  • Save every: 100
  • Temperature floor: 0.6

Notes

  • This repo is a standalone re-upload of the step_500/ subfolder from ceselder/cot-oracle-grpo-grpo-0320-1849.
  • It is included in the paper collection as the best GRPO checkpoint.
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ceselder/cot-oracle-grpo-step-500

Finetuned
Qwen/Qwen3-8B
Adapter
(966)
this model

Collection including ceselder/cot-oracle-grpo-step-500