Update README.md
Browse files
README.md
CHANGED
|
@@ -29,8 +29,9 @@ The model is trained on a mixture of the following datasets. We also provide the
|
|
| 29 |
- [HelpSteer](https://huggingface.co/datasets/nvidia/HelpSteer)
|
| 30 |
- [Orca](argilla/distilabel-intel-orca-dpo-pairs)
|
| 31 |
|
| 32 |
-
Difference between this mixture and
|
| 33 |
|
|
|
|
| 34 |
- SHP: we only use the samples with score ratio > 2, for each prompt, we take 5 comparison at most, leading to 109526;
|
| 35 |
- Ultrafeedback: similar to UltraFeedback-Binarized, we use the fine-grained score instead of the overall one to rank samples. Meanwhile, for each prompt, we take all possible 6 pairs of comparisons. Finally, we delete the selected pairs with equal scores, leading to 267416.
|
| 36 |
- HelpSteer: we use the mean of helpfulness and correctness to rank samples. Meanwhile, we take all possible 6 pairs of comparisons. Finally, we delete the selected pairs with equal scores, leading to 21576;
|
|
|
|
| 29 |
- [HelpSteer](https://huggingface.co/datasets/nvidia/HelpSteer)
|
| 30 |
- [Orca](argilla/distilabel-intel-orca-dpo-pairs)
|
| 31 |
|
| 32 |
+
Difference between this mixture and the original dataset
|
| 33 |
|
| 34 |
+
- HH-RLHF: we only use the helpful subset and we delete the noisy samples where chosen_response == rejected_response;
|
| 35 |
- SHP: we only use the samples with score ratio > 2, for each prompt, we take 5 comparison at most, leading to 109526;
|
| 36 |
- Ultrafeedback: similar to UltraFeedback-Binarized, we use the fine-grained score instead of the overall one to rank samples. Meanwhile, for each prompt, we take all possible 6 pairs of comparisons. Finally, we delete the selected pairs with equal scores, leading to 267416.
|
| 37 |
- HelpSteer: we use the mean of helpfulness and correctness to rank samples. Meanwhile, we take all possible 6 pairs of comparisons. Finally, we delete the selected pairs with equal scores, leading to 21576;
|