Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Paper
β’
2305.18290
β’
Published
β’
64
| kobest_boolq | kobest_copa | kobest_hellaswag | kobest_sentineg |
|---|---|---|---|
| 0.931613 | 0.740751 | 0.468602 | 0.488465 |