Feng Luo
commited on
Commit
·
6daa7d4
1
Parent(s):
060c6fd
update readme
Browse files
README.md
CHANGED
|
@@ -16,7 +16,7 @@ This is the official model repository for AutoL2S-Plus-7B, a model fine-tuned fo
|
|
| 16 |
In this stage, long and short chains of thought (CoT) are paired and trained jointly, using a special `<EASY>` token to enable automatic switching between CoT modes. The resulting sft model is released as [amandaa/AutoL2S-7b](https://huggingface.co/amandaa/AutoL2S-7b/tree/main).
|
| 17 |
|
| 18 |
- **Stage 2: Off-Policy RL with Length-Aware Objective**
|
| 19 |
-
In the second stage, we further refine reasoning efficiency through an RL objective that balances accuracy and length.
|
| 20 |
|
| 21 |
This repository contains:
|
| 22 |
|
|
|
|
| 16 |
In this stage, long and short chains of thought (CoT) are paired and trained jointly, using a special `<EASY>` token to enable automatic switching between CoT modes. The resulting sft model is released as [amandaa/AutoL2S-7b](https://huggingface.co/amandaa/AutoL2S-7b/tree/main).
|
| 17 |
|
| 18 |
- **Stage 2: Off-Policy RL with Length-Aware Objective**
|
| 19 |
+
In the second stage, we further refine reasoning efficiency through an RL objective that balances accuracy and length and get AutoL2S-Plus. This model is rewarded for generating shorter reasoning paths while maintaining correctness. Because the length objective is non-differentiable, we apply a PPO-style clipped loss and compute per-sample advantages by leveraging long- and short-form outputs from the SFT-based AutoL2S model, which serves as the reference policy.
|
| 20 |
|
| 21 |
This repository contains:
|
| 22 |
|