Text Generation
Transformers
Safetensors
qwen2
conversational
text-generation-inference

Add project page link

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +13 -9
README.md CHANGED
@@ -1,18 +1,16 @@
1
  ---
2
- license: apache-2.0
 
3
  datasets:
4
  - open-thoughts/OpenThoughts2-1M
5
  - Vinnnf/Hybrid-OpenThoughts2-1M-1.5B
6
- base_model:
7
- - deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
8
- pipeline_tag: text-generation
9
  library_name: transformers
 
 
10
  ---
11
 
12
-
13
  # Thinkless: LLM Learns When to Think
14
 
15
-
16
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/646a1939c37ca1e12308fe81/SRxJKkSuC0y-oMB7SFeR6.png)
17
 
18
  <table>
@@ -47,6 +45,10 @@ library_name: transformers
47
  <td>📊 <strong>Data for RL</strong></td>
48
  <td><a href="https://huggingface.co/datasets/agentica-org/DeepScaleR-Preview-Dataset">agentica-org/DeepScaleR-Preview-Dataset</a></td>
49
  </tr>
 
 
 
 
50
  </tbody>
51
  </table>
52
 
@@ -55,7 +57,7 @@ library_name: transformers
55
  > [!NOTE]
56
  > ***Can LLMs learn when to think?***
57
 
58
- We propose Thinkless, a learnable framework that empowers an LLM to adaptively select between short-form and long-form reasoning based on both task complexity and the model's ability. Thinkless is trained under a reinforcement learning paradigm and employs two control tokens, \<short\> for concise responses and \<think\> for detailed reasoning. At the core of our method is a Decoupled Group Relative Policy Optimization (DeGRPO) algorithm, which decomposes the learning objective of hybrid reasoning into two components: (1) a control token loss that governs the selection of the reasoning mode, and (2) a response loss that improves the accuracy of the generated answers. This decoupled formulation enables fine-grained control over the contributions of each objective, stabilizing training and effectively preventing collapse observed in vanilla GRPO. Empirically, on several benchmarks such as Minerva Algebra, MATH-500, and GSM8K, Thinkless is able to reduce the usage of long-chain thinking by 50\% - 90\%, significantly reducing the computational cost of Reasoning Language Models.
59
 
60
 
61
  ## Pipeline
@@ -76,7 +78,8 @@ model = AutoModelForCausalLM.from_pretrained(
76
  tokenizer = AutoTokenizer.from_pretrained(model_name)
77
 
78
  instruction = "Please reason step by step, and put your final answer within \\boxed{}."
79
- prompt = f"{instruction}\nThe arithmetic mean of 7, 2, $x$ and 10 is 9. What is the value of $x$?"
 
80
 
81
  messages = [
82
  {"role": "user", "content": prompt}
@@ -108,7 +111,8 @@ num_tokens = len(generated_ids[0])
108
  response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
109
 
110
  print(text+response)
111
- print(f"\nThink Mode: {think_mode}")
 
112
  print(f"Number of tokens: {num_tokens}")
113
  ```
114
 
 
1
  ---
2
+ base_model:
3
+ - deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
4
  datasets:
5
  - open-thoughts/OpenThoughts2-1M
6
  - Vinnnf/Hybrid-OpenThoughts2-1M-1.5B
 
 
 
7
  library_name: transformers
8
+ license: apache-2.0
9
+ pipeline_tag: text-generation
10
  ---
11
 
 
12
  # Thinkless: LLM Learns When to Think
13
 
 
14
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/646a1939c37ca1e12308fe81/SRxJKkSuC0y-oMB7SFeR6.png)
15
 
16
  <table>
 
45
  <td>📊 <strong>Data for RL</strong></td>
46
  <td><a href="https://huggingface.co/datasets/agentica-org/DeepScaleR-Preview-Dataset">agentica-org/DeepScaleR-Preview-Dataset</a></td>
47
  </tr>
48
+ <tr>
49
+ <td> 🌐 <strong>Project Page</strong></td>
50
+ <td><a href="https://sites.google.com/view/eagle-llm">Thinkless Website</a></td>
51
+ </tr>
52
  </tbody>
53
  </table>
54
 
 
57
  > [!NOTE]
58
  > ***Can LLMs learn when to think?***
59
 
60
+ We propose Thinkless, a learnable framework that empowers an LLM to adaptively select between short-form and long-form reasoning based on both task complexity and the model's ability. Thinkless is trained under a reinforcement learning paradigm and employs two control tokens, \<short\> for concise responses and \<think\> for detailed reasoning. At the core of our method is a Decoupled Group Relative Policy Optimization (DeGRPO) algorithm, which decomposes the learning objective of hybrid reasoning into two components: (1) a control token loss that governs the selection of the reasoning mode, and (2) a response loss that improves the accuracy of the generated answers. This decoupled formulation enables fine-grained control over the contributions of each objective, stabilizing training and effectively preventing collapse observed in vanilla GRPO. Empirically, on several benchmarks such as Minerva Algebra, MATH-500, and GSM8K, Thinkless is able to reduce the usage of long-chain thinking by 50% - 90%, significantly reducing the computational cost of Reasoning Language Models.
61
 
62
 
63
  ## Pipeline
 
78
  tokenizer = AutoTokenizer.from_pretrained(model_name)
79
 
80
  instruction = "Please reason step by step, and put your final answer within \\boxed{}."
81
+ prompt = f"{instruction}
82
+ The arithmetic mean of 7, 2, $x$ and 10 is 9. What is the value of $x$?"
83
 
84
  messages = [
85
  {"role": "user", "content": prompt}
 
111
  response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
112
 
113
  print(text+response)
114
+ print(f"
115
+ Think Mode: {think_mode}")
116
  print(f"Number of tokens: {num_tokens}")
117
  ```
118