amandaa
/

AutoL2S-Plus-7b

Model card Files Files and versions

Feng Luo commited on 12 days ago

Commit

6daa7d4

·

1 Parent(s): 060c6fd

update readme

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -16,7 +16,7 @@ This is the official model repository for AutoL2S-Plus-7B, a model fine-tuned fo
   In this stage, long and short chains of thought (CoT) are paired and trained jointly, using a special `<EASY>` token to enable automatic switching between CoT modes. The resulting sft model is released as [amandaa/AutoL2S-7b](https://huggingface.co/amandaa/AutoL2S-7b/tree/main).
 - **Stage 2: Off-Policy RL with Length-Aware Objective**
-  In the second stage, we further refine reasoning efficiency through an RL objective that balances accuracy and length. The model is rewarded for generating shorter reasoning paths while maintaining correctness. Because the length objective is non-differentiable, we apply a PPO-style clipped loss and compute per-sample advantages by leveraging long- and short-form outputs from the SFT-based AutoL2S model, which serves as the reference policy.
 This repository contains:

   In this stage, long and short chains of thought (CoT) are paired and trained jointly, using a special `<EASY>` token to enable automatic switching between CoT modes. The resulting sft model is released as [amandaa/AutoL2S-7b](https://huggingface.co/amandaa/AutoL2S-7b/tree/main).
 - **Stage 2: Off-Policy RL with Length-Aware Objective**
+  In the second stage, we further refine reasoning efficiency through an RL objective that balances accuracy and length and get AutoL2S-Plus. This model is rewarded for generating shorter reasoning paths while maintaining correctness. Because the length objective is non-differentiable, we apply a PPO-style clipped loss and compute per-sample advantages by leveraging long- and short-form outputs from the SFT-based AutoL2S model, which serves as the reference policy.
 This repository contains: