Update README.md
Browse files
README.md
CHANGED
|
@@ -26,7 +26,7 @@ This model is a **ReMax (Reinforcement Learning with Maximization)** fine-tune o
|
|
| 26 |
It was aligned using the **[HelpSteer2](https://huggingface.co/datasets/Jennny/helpsteer2-helpfulness-preference)** dataset to improve helpfulness and instruction following capabilities. Unlike standard PPO, ReMax eliminates the need for a value model (Critic) and uses a greedy baseline to reduce variance, making it highly efficient for alignment.
|
| 27 |
The goal of the fine-tuning was to improve helpfulness/harmlessness behavior as measured by the HelpSteer2 dataset, while also enabling controlled model diffing experiments as part of the AIPlans research workflow.
|
| 28 |
Developed by: AIPlans
|
| 29 |
-
|
| 30 |
Funded by : AIPlans
|
| 31 |
|
| 32 |
Shared by: AIPlans
|
|
|
|
| 26 |
It was aligned using the **[HelpSteer2](https://huggingface.co/datasets/Jennny/helpsteer2-helpfulness-preference)** dataset to improve helpfulness and instruction following capabilities. Unlike standard PPO, ReMax eliminates the need for a value model (Critic) and uses a greedy baseline to reduce variance, making it highly efficient for alignment.
|
| 27 |
The goal of the fine-tuning was to improve helpfulness/harmlessness behavior as measured by the HelpSteer2 dataset, while also enabling controlled model diffing experiments as part of the AIPlans research workflow.
|
| 28 |
Developed by: AIPlans
|
| 29 |
+
LoRA was not used in this finetuning.
|
| 30 |
Funded by : AIPlans
|
| 31 |
|
| 32 |
Shared by: AIPlans
|