Jackrong commited on
Commit
fad76af
·
verified ·
1 Parent(s): 8806a50

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +54 -12
README.md CHANGED
@@ -1,21 +1,63 @@
1
  ---
2
- base_model: Jackrong/Llama-3.1-8B-Think-Zero-GRPO
 
3
  tags:
4
- - text-generation-inference
5
- - transformers
6
- - unsloth
7
- - llama
8
- license: apache-2.0
 
 
 
 
 
 
 
9
  language:
10
  - en
 
 
11
  ---
12
 
13
- # Uploaded finetuned model
14
 
15
- - **Developed by:** Jackrong
16
- - **License:** apache-2.0
17
- - **Finetuned from model :** Jackrong/Llama-3.1-8B-Think-Zero-GRPO
18
 
19
- This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
20
 
21
- [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: llama3.1
3
+ base_model: unsloth/Llama-3.1-8B-Instruct
4
  tags:
5
+ - reasoning
6
+ - thinking
7
+ - grpo
8
+ - r1
9
+ - llama-cpp
10
+ - gguf
11
+ datasets:
12
+ - unsloth/OpenMathReasoning-mini
13
+ - open-r1/DAPO-Math-17k-Processed
14
+ - Jackrong/ShareGPT-gpt-oss-120B-reasoning
15
+ - Jackrong/Chinese-Qwen3-235B-Thinking-Distill
16
+ - Jackrong/MultiReason-ChatAlpaca
17
  language:
18
  - en
19
+ - zh
20
+ pipeline_tag: text-generation
21
  ---
22
 
23
+ # Jackrong/Llama3.1-8B-Thinking-R1
24
 
25
+ ## 1. Model Summary
26
+ **Jackrong/Llama3.1-8B-Thinking-R1** is a deep reasoning model built upon `Llama-3.1-8B-Instruct`. This model is designed to solve complex logic, mathematics, and programming problems through a structured "Think-and-Answer" paradigm.
 
27
 
28
+ The core feature of the model is its refined Chain-of-Thought (CoT) capability. Before providing a final answer, the model performs self-correction, logical decomposition, and multi-path exploration within `<think>` tags.
29
 
30
+ ## 2. Training Methodology
31
+ This model utilizes a unique three-stage training pipeline to ensure stability and depth in reasoning:
32
+
33
+ ### Stage 1: Cold-start SFT (Supervised Fine-Tuning)
34
+ Initial fine-tuning is performed using high-quality mathematical reasoning data to help the model acquire basic reasoning formats. During this stage, the model learns how to use `<think>` tags for logical guidance and establishes its initial mental framework.
35
+
36
+ ### Stage 2: GRPO Reinforcement Learning (Group Relative Policy Optimization)
37
+ The **GRPO** algorithm is employed to conduct large-scale reinforcement training, guided by **Accuracy Rewards** and **Format Rewards**. In this phase, the model not only learns how to reach the correct answer but also optimizes the efficiency of its thought process, reducing logical redundancy.
38
+
39
+ ### Stage 3: Final CoT Distillation SFT
40
+ Building upon the reinforcement learning stage, the model undergoes final instruction fine-tuning using high-quality CoT data distilled from ultra-large-scale models (such as GPT-OSS-120B and Qwen3-235B). This stage significantly enhances the model's expressiveness in complex contexts and improves logical rigor.
41
+
42
+ ## 3. Training Features
43
+ - **Reinforcement Learning Framework**: Utilizes the **GRPO** algorithm, guiding the model to autonomously learn logical decomposition via format and accuracy rewards.
44
+ - **Cold-start SFT**: Uses datasets like `OpenMathReasoning` for warm-up, ensuring the model masters the fundamental thinking format.
45
+ - **Multi-stage Distillation**: Incorporates reasoning logic distilled from 120B+ scale models, significantly boosting Chinese logic and multi-turn dialogue reasoning performance.
46
+ - **Efficient Fine-Tuning**: Built on the **Unsloth** framework using LoRA (Rank 64) technology to maintain reasoning capabilities while mitigating catastrophic forgetting.
47
+ - **Long Context Support**: Supports a context length of up to **65,536** tokens, capable of handling complex, long-chain reasoning tasks.
48
+
49
+ ## 4. Datasets
50
+ The model evolved through the three stages mentioned above using a combination of the following datasets:
51
+
52
+ - **unsloth/OpenMathReasoning-mini**: Provides core mathematical reasoning logic.
53
+ - **open-r1/DAPO-Math-17k-Processed**: Used for alignment optimization during the RL phase.
54
+ - **Jackrong/ShareGPT-gpt-oss-120B-reasoning**: Introduces English reasoning path distillation from ultra-large models.
55
+ - **Jackrong/Chinese-Qwen3-235B-Thinking-Distill**: Specifically enhances the depth of Chinese logical thinking.
56
+ - **Jackrong/MultiReason-ChatAlpaca**: Optimizes complex reasoning performance in multi-turn dialogue scenarios.
57
+ - **Natural-Reasoning**: Enhances logical deduction for commonsense queries.
58
+ - **Reasoning-Instruction**: Structured reasoning instruction pairs.
59
+
60
+ ## 5. References
61
+ - **Developed by**: Jackrong
62
+ - **Base Model**: Llama-3.1-8B-Instruct
63
+ - **Training Framework**: Unsloth / TRL / PyTorch