Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Paper β’ 2305.18290 β’ Published β’ 65
| Task | 0-shot | 5-shot |
|---|---|---|
| kobest_boolq | 0.950142 | 0.944444 |
| kobest_copa | 0.751 | 0.835 |
| kobest_hellaswag | 0.474 | 0.508 |
| kobest_sentineg | 0.811083 | 0.972292 |
| Average | 0.74655625 | 0.81493399 |
| Average | Ko-ARC | Ko-HellaSwag | Ko-MMLU | Ko-TruthfulQA | Ko-CommonGen V2 |
|---|---|---|---|---|---|
| 57.97 | 57.51 | 67.01 | 56.3 | 54.86 | 54.19 |
Base model
upstage/SOLAR-10.7B-v1.0