Bidirectional Outcome Reward Model (Dream-7B, GSM8K)
LoRA adapter + reward head trained on final (mask=0) states for ORM reranking of Dream-7B Best-of-N samples. Used as ORM-Rerank@N in the paper's compute-matched Pareto.
Details
- Base: Dream-org/Dream-v0-Instruct-7B (frozen)
- Attention: bidirectional
- Training data: final states only (mask=0)
- No step embedding
- Training: 8,407 steps (~6 epochs on 42,208 final-state samples), seed 42
- Final-state accuracy at mask=0: 0.918 (s42/s43 mean)
- Downloads last month
- -
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for AnonyRepo/bidir-orm-dream7b-gsm8k
Base model
Dream-org/Dream-v0-Instruct-7B