AIPlans
/

Qwen3-0.6B-ReMax

Reinforcement Learning

text-generation

text-generation-inference

Model card Files Files and versions

Kabs9000 commited on Dec 22, 2025

Commit

2236393

·

verified ·

1 Parent(s): 8285ae4

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -26,7 +26,7 @@ This model is a **ReMax (Reinforcement Learning with Maximization)** fine-tune o
 It was aligned using the **[HelpSteer2](https://huggingface.co/datasets/Jennny/helpsteer2-helpfulness-preference)** dataset to improve helpfulness and instruction following capabilities. Unlike standard PPO, ReMax eliminates the need for a value model (Critic) and uses a greedy baseline to reduce variance, making it highly efficient for alignment.
 The goal of the fine-tuning was to improve helpfulness/harmlessness behavior as measured by the HelpSteer2 dataset, while also enabling controlled model diffing experiments as part of the AIPlans research workflow.
 Developed by: AIPlans
 Funded by : AIPlans
 Shared by: AIPlans

 It was aligned using the **[HelpSteer2](https://huggingface.co/datasets/Jennny/helpsteer2-helpfulness-preference)** dataset to improve helpfulness and instruction following capabilities. Unlike standard PPO, ReMax eliminates the need for a value model (Critic) and uses a greedy baseline to reduce variance, making it highly efficient for alignment.
 The goal of the fine-tuning was to improve helpfulness/harmlessness behavior as measured by the HelpSteer2 dataset, while also enabling controlled model diffing experiments as part of the AIPlans research workflow.
 Developed by: AIPlans
+LoRA was not used in this finetuning.
 Funded by : AIPlans
 Shared by: AIPlans