DeepSeek-R1-8B-GRPO-Reasoning
This model is a fine-tuned version of unsloth/DeepSeek-R1-Distill-Llama-8B using Group Relative Policy Optimization (GRPO) for enhanced reasoning capabilities on multiple-choice questions.
Model Description
This model has been specifically trained to handle reasoning tasks with a structured format, particularly excelling at Chinese multiple-choice questions. The model generates responses with explicit reasoning steps followed by clear solutions.
Key Features:
- Structured Reasoning: Uses
<start_working_out>and<end_working_out>tags for reasoning process - Clear Solutions: Provides answers in
<SOLUTION>answer</SOLUTION>format - High Accuracy: Achieved 100% success rate on test dataset (900 samples)
- Bilingual: Supports both Chinese and English
Model Details
- Developed by: [Your Name/Organization]
- Model type: Causal Language Model (Fine-tuned)
- Language(s): Chinese (primary), English
- License: Apache 2.0
- Finetuned from model: unsloth/DeepSeek-R1-Distill-Llama-8B
- Training Method: GRPO (Group Relative Policy Optimization)
- Fine-tuning Library: Unsloth + TRL
Model Architecture
- Base Model: DeepSeek-R1-Distill-Llama-8B (8 billion parameters)
- Fine-tuning Method: LoRA (Low-Rank Adaptation)
- LoRA Rank: 32
- LoRA Alpha: 64
- Trainable Parameters: 83,886,080 (1.05% of total parameters)
Uses
Direct Use
This model is designed for structured reasoning tasks, particularly multiple-choice questions that require step-by-step analysis.
Input Format:
Question: [Your question here]
Options:
A. [Option A]
B. [Option B]
C. [Option C]
D. [Option D]
Output Format:
<start_working_out>
[Reasoning process]
<end_working_out>
<SOLUTION>A</SOLUTION>
Example Usage
from unsloth import FastLanguageModel
import torch
# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="your-username/deepseek-r1-8b-grpo-reasoning",
max_seq_length=1024,
dtype=None,
load_in_4bit=True,
)
# Format your question
messages = [
{"role": "system", "content": "You are given a problem. Think about the problem and provide your working out. Place it between <start_working_out> and <end_working_out>. Then, provide your solution between <SOLUTION></SOLUTION>"},
{"role": "user", "content": """Question: Which is the capital of France?
Options:
A. London
B. Paris
C. Berlin
D. Madrid"""}
]
# Generate response
text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=256,
repetition_penalty=1.2,
do_sample=False,
pad_token_id=tokenizer.eos_token_id,
eos_token_id=tokenizer.eos_token_id,
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Training Details
Training Data
- Dataset Size: 3,000 Chinese multiple-choice questions
- Data Format: Questions with 4 options (A, B, C, D) and correct answers
- Data Split: 80% training, 20% validation
- Domain: Various topics including history, social studies, general knowledge
Training Procedure
Pre-fine-tuning (SFT):
- Epochs: 2
- Learning Rate: 2e-4
- Batch Size: 1
- Optimizer: AdamW 8-bit
GRPO Training:
- Steps: 20
- Learning Rate: 5e-6
- Batch Size: 4 (adjusted from 1 due to num_generations=4)
- Gradient Accumulation: 1
- Number of Generations: 4
- Max Sequence Length: 512
- Optimizer: AdamW 8-bit
Training Infrastructure
- Hardware: NVIDIA GeForce RTX 3090 (24GB VRAM)
- Training Framework: Unsloth + TRL
- Training Time: ~1.5 hours total
- Memory Usage: ~5.9GB CUDA memory
Reward Functions
The GRPO training used multiple reward functions:
- Format Matching (Exact): +3.0 for perfect format compliance
- Format Matching (Approximate): ±0.5 for partial format compliance
- Answer Correctness: +5.0 for correct answers, -2.5 for incorrect
- Debug Logging: For monitoring training progress
Evaluation
Testing Results
- Test Dataset: 900 samples
- Success Rate: 100%
- Answer Distribution:
- A: 324 (36.0%)
- B: 330 (36.7%)
- C: 178 (19.8%)
- D: 68 (7.6%)
Performance Metrics
- Format Compliance: High adherence to required reasoning structure
- Reasoning Quality: Consistent step-by-step analysis
- Answer Accuracy: Perfect performance on evaluation set
Limitations and Considerations
- Domain Specificity: Optimized for multiple-choice questions
- Language Bias: Primarily trained on Chinese content
- Format Dependency: Requires specific input/output format for optimal performance
- Limited Context: Max sequence length of 512 tokens
How to Get Started with the Model
# Install required packages
pip install unsloth transformers torch
# Load and use the model (see example above)
Citation
If you use this model, please cite:
@misc{deepseek-r1-grpo-reasoning,
title={DeepSeek-R1-8B-GRPO-Reasoning: A Fine-tuned Model for Structured Reasoning},
author={[Your Name]},
year={2025},
howpublished={\url{https://huggingface.co/your-username/deepseek-r1-8b-grpo-reasoning}}
}
Acknowledgments
- Base Model: DeepSeek AI for the original DeepSeek-R1-Distill model
- Fine-tuning Framework: Unsloth team for the efficient training library
- Training Method: TRL library for GRPO implementation
Framework Versions
- PEFT: 0.15.1
- Transformers: 4.52.4
- Unsloth: 2025.6.1
- TRL: Latest version with GRPO support
- PyTorch: 2.6.0+cu124
- Downloads last month
- -