AnonyRepo
/

bidir-orm-dream7b-gsm8k

outcome-reward-model

discrete-diffusion

Model card Files Files and versions

Bidirectional Outcome Reward Model (Dream-7B, GSM8K)

LoRA adapter + reward head trained on final (mask=0) states for ORM reranking of Dream-7B Best-of-N samples. Used as ORM-Rerank@N in the paper's compute-matched Pareto.

Details

Base: Dream-org/Dream-v0-Instruct-7B (frozen)
Attention: bidirectional
Training data: final states only (mask=0)
No step embedding
Training: 8,407 steps (~6 epochs on 42,208 final-state samples), seed 42
Final-state accuracy at mask=0: 0.918 (s42/s43 mean)

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AnonyRepo/bidir-orm-dream7b-gsm8k

Base model

Dream-org/Dream-v0-Instruct-7B

Adapter

(13)

this model