|
|
--- |
|
|
license: llama3.1 |
|
|
base_model: unsloth/Llama-3.1-8B-Instruct |
|
|
tags: |
|
|
- reasoning |
|
|
- thinking |
|
|
- grpo |
|
|
- r1 |
|
|
- llama-cpp |
|
|
- gguf |
|
|
datasets: |
|
|
- unsloth/OpenMathReasoning-mini |
|
|
- open-r1/DAPO-Math-17k-Processed |
|
|
- Jackrong/ShareGPT-gpt-oss-120B-reasoning |
|
|
- Jackrong/Chinese-Qwen3-235B-Thinking-Distill |
|
|
- Jackrong/MultiReason-ChatAlpaca |
|
|
language: |
|
|
- en |
|
|
- zh |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
# Llama3.1-8B-Thinking-R1 |
|
|
|
|
|
|
|
|
 |
|
|
|
|
|
## 1. Model Summary |
|
|
**Jackrong/Llama3.1-8B-Thinking-R1** is a deep reasoning model built upon `Llama-3.1-8B-Instruct`. This model is designed to solve complex logic, mathematics, and programming problems through a structured "Think-and-Answer" paradigm. |
|
|
|
|
|
The core feature of the model is its refined Chain-of-Thought (CoT) capability. Before providing a final answer, the model performs self-correction, logical decomposition, and multi-path exploration within `<think>` tags. |
|
|
|
|
|
## 2. Training Methodology |
|
|
This model utilizes a unique three-stage training pipeline to ensure stability and depth in reasoning: |
|
|
|
|
|
### Stage 1: Cold-start SFT (Supervised Fine-Tuning) |
|
|
Initial fine-tuning is performed using high-quality mathematical reasoning data to help the model acquire basic reasoning formats. During this stage, the model learns how to use `<think>` tags for logical guidance and establishes its initial mental framework. |
|
|
|
|
|
### Stage 2: GRPO Reinforcement Learning (Group Relative Policy Optimization) |
|
|
The **GRPO** algorithm is employed to conduct large-scale reinforcement training, guided by **Accuracy Rewards** and **Format Rewards**. In this phase, the model not only learns how to reach the correct answer but also optimizes the efficiency of its thought process, reducing logical redundancy. |
|
|
|
|
|
### Stage 3: Final CoT Distillation SFT |
|
|
Building upon the reinforcement learning stage, the model undergoes final instruction fine-tuning using high-quality CoT data distilled from ultra-large-scale models (such as GPT-OSS-120B and Qwen3-235B). This stage significantly enhances the model's expressiveness in complex contexts and improves logical rigor. |
|
|
|
|
|
## 3. Training Features |
|
|
- **Reinforcement Learning Framework**: Utilizes the **GRPO** algorithm, guiding the model to autonomously learn logical decomposition via format and accuracy rewards. |
|
|
- **Cold-start SFT**: Uses datasets like `OpenMathReasoning` for warm-up, ensuring the model masters the fundamental thinking format. |
|
|
- **Multi-stage Distillation**: Incorporates reasoning logic distilled from 120B+ scale models, significantly boosting Chinese logic and multi-turn dialogue reasoning performance. |
|
|
- **Efficient Fine-Tuning**: Built on the **Unsloth** framework using LoRA (Rank 64) technology to maintain reasoning capabilities while mitigating catastrophic forgetting. |
|
|
- **Long Context Support**: Supports a context length of up to **65,536** tokens, capable of handling complex, long-chain reasoning tasks. |
|
|
|
|
|
## 4. Datasets |
|
|
The model evolved through the three stages mentioned above using a combination of the following datasets: |
|
|
|
|
|
- **unsloth/OpenMathReasoning-mini**: Provides core mathematical reasoning logic. |
|
|
- **open-r1/DAPO-Math-17k-Processed**: Used for alignment optimization during the RL phase. |
|
|
- **Jackrong/ShareGPT-gpt-oss-120B-reasoning**: Introduces English reasoning path distillation from ultra-large models. |
|
|
- **Jackrong/Chinese-Qwen3-235B-Thinking-Distill**: Specifically enhances the depth of Chinese logical thinking. |
|
|
- **Jackrong/MultiReason-ChatAlpaca**: Optimizes complex reasoning performance in multi-turn dialogue scenarios. |
|
|
- **Natural-Reasoning**: Enhances logical deduction for commonsense queries. |
|
|
- **Reasoning-Instruction**: Structured reasoning instruction pairs. |
|
|
|
|
|
## 5. References |
|
|
- **Developed by**: Jackrong |
|
|
- **Base Model**: Llama-3.1-8B-Instruct |
|
|
- **Training Framework**: Unsloth / TRL / PyTorch |