Bidirectional Outcome Reward Model (Dream-7B, GSM8K)

LoRA adapter + reward head trained on final (mask=0) states for ORM reranking of Dream-7B Best-of-N samples. Used as ORM-Rerank@N in the paper's compute-matched Pareto.

Details

  • Base: Dream-org/Dream-v0-Instruct-7B (frozen)
  • Attention: bidirectional
  • Training data: final states only (mask=0)
  • No step embedding
  • Training: 8,407 steps (~6 epochs on 42,208 final-state samples), seed 42
  • Final-state accuracy at mask=0: 0.918 (s42/s43 mean)
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AnonyRepo/bidir-orm-dream7b-gsm8k

Adapter
(13)
this model