Feng Luo commited on
Commit
6daa7d4
·
1 Parent(s): 060c6fd

update readme

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -16,7 +16,7 @@ This is the official model repository for AutoL2S-Plus-7B, a model fine-tuned fo
16
  In this stage, long and short chains of thought (CoT) are paired and trained jointly, using a special `<EASY>` token to enable automatic switching between CoT modes. The resulting sft model is released as [amandaa/AutoL2S-7b](https://huggingface.co/amandaa/AutoL2S-7b/tree/main).
17
 
18
  - **Stage 2: Off-Policy RL with Length-Aware Objective**
19
- In the second stage, we further refine reasoning efficiency through an RL objective that balances accuracy and length. The model is rewarded for generating shorter reasoning paths while maintaining correctness. Because the length objective is non-differentiable, we apply a PPO-style clipped loss and compute per-sample advantages by leveraging long- and short-form outputs from the SFT-based AutoL2S model, which serves as the reference policy.
20
 
21
  This repository contains:
22
 
 
16
  In this stage, long and short chains of thought (CoT) are paired and trained jointly, using a special `<EASY>` token to enable automatic switching between CoT modes. The resulting sft model is released as [amandaa/AutoL2S-7b](https://huggingface.co/amandaa/AutoL2S-7b/tree/main).
17
 
18
  - **Stage 2: Off-Policy RL with Length-Aware Objective**
19
+ In the second stage, we further refine reasoning efficiency through an RL objective that balances accuracy and length and get AutoL2S-Plus. This model is rewarded for generating shorter reasoning paths while maintaining correctness. Because the length objective is non-differentiable, we apply a PPO-style clipped loss and compute per-sample advantages by leveraging long- and short-form outputs from the SFT-based AutoL2S model, which serves as the reference policy.
20
 
21
  This repository contains:
22