DeepSeek-R1-8B-GRPO-Reasoning

This model is a fine-tuned version of unsloth/DeepSeek-R1-Distill-Llama-8B using Group Relative Policy Optimization (GRPO) for enhanced reasoning capabilities on multiple-choice questions.

Model Description

This model has been specifically trained to handle reasoning tasks with a structured format, particularly excelling at Chinese multiple-choice questions. The model generates responses with explicit reasoning steps followed by clear solutions.

Key Features:

Structured Reasoning: Uses <start_working_out> and <end_working_out> tags for reasoning process
Clear Solutions: Provides answers in <SOLUTION>answer</SOLUTION> format
High Accuracy: Achieved 100% success rate on test dataset (900 samples)
Bilingual: Supports both Chinese and English

Model Details

Developed by: [Your Name/Organization]
Model type: Causal Language Model (Fine-tuned)
Language(s): Chinese (primary), English
License: Apache 2.0
Finetuned from model: unsloth/DeepSeek-R1-Distill-Llama-8B
Training Method: GRPO (Group Relative Policy Optimization)
Fine-tuning Library: Unsloth + TRL

Model Architecture

Base Model: DeepSeek-R1-Distill-Llama-8B (8 billion parameters)
Fine-tuning Method: LoRA (Low-Rank Adaptation)
LoRA Rank: 32
LoRA Alpha: 64
Trainable Parameters: 83,886,080 (1.05% of total parameters)

Uses

Direct Use

This model is designed for structured reasoning tasks, particularly multiple-choice questions that require step-by-step analysis.

Input Format:

Question: [Your question here]
Options:
A. [Option A]
B. [Option B] 
C. [Option C]
D. [Option D]

Output Format:

<start_working_out>
[Reasoning process]
<end_working_out>
<SOLUTION>A</SOLUTION>

Example Usage

from unsloth import FastLanguageModel
import torch

# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="your-username/deepseek-r1-8b-grpo-reasoning",
    max_seq_length=1024,
    dtype=None,
    load_in_4bit=True,
)

# Format your question
messages = [
    {"role": "system", "content": "You are given a problem. Think about the problem and provide your working out. Place it between <start_working_out> and <end_working_out>. Then, provide your solution between <SOLUTION></SOLUTION>"},
    {"role": "user", "content": """Question: Which is the capital of France?
Options:
A. London
B. Paris
C. Berlin
D. Madrid"""}
]

# Generate response
text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        repetition_penalty=1.2,
        do_sample=False,
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Training Details

Training Data

Dataset Size: 3,000 Chinese multiple-choice questions
Data Format: Questions with 4 options (A, B, C, D) and correct answers
Data Split: 80% training, 20% validation
Domain: Various topics including history, social studies, general knowledge

Training Procedure

Pre-fine-tuning (SFT):

Epochs: 2
Learning Rate: 2e-4
Batch Size: 1
Optimizer: AdamW 8-bit

GRPO Training:

Steps: 20
Learning Rate: 5e-6
Batch Size: 4 (adjusted from 1 due to num_generations=4)
Gradient Accumulation: 1
Number of Generations: 4
Max Sequence Length: 512
Optimizer: AdamW 8-bit

Training Infrastructure

Hardware: NVIDIA GeForce RTX 3090 (24GB VRAM)
Training Framework: Unsloth + TRL
Training Time: ~1.5 hours total
Memory Usage: ~5.9GB CUDA memory

Reward Functions

The GRPO training used multiple reward functions:

Format Matching (Exact): +3.0 for perfect format compliance
Format Matching (Approximate): ±0.5 for partial format compliance
Answer Correctness: +5.0 for correct answers, -2.5 for incorrect
Debug Logging: For monitoring training progress

Evaluation

Testing Results

Test Dataset: 900 samples
Success Rate: 100%
Answer Distribution:
- A: 324 (36.0%)
- B: 330 (36.7%)
- C: 178 (19.8%)
- D: 68 (7.6%)

Performance Metrics

Format Compliance: High adherence to required reasoning structure
Reasoning Quality: Consistent step-by-step analysis
Answer Accuracy: Perfect performance on evaluation set

Limitations and Considerations

Domain Specificity: Optimized for multiple-choice questions
Language Bias: Primarily trained on Chinese content
Format Dependency: Requires specific input/output format for optimal performance
Limited Context: Max sequence length of 512 tokens

How to Get Started with the Model

# Install required packages
pip install unsloth transformers torch

# Load and use the model (see example above)

Citation

If you use this model, please cite:

@misc{deepseek-r1-grpo-reasoning,
  title={DeepSeek-R1-8B-GRPO-Reasoning: A Fine-tuned Model for Structured Reasoning},
  author={[Your Name]},
  year={2025},
  howpublished={\url{https://huggingface.co/your-username/deepseek-r1-8b-grpo-reasoning}}
}

Acknowledgments

Base Model: DeepSeek AI for the original DeepSeek-R1-Distill model
Fine-tuning Framework: Unsloth team for the efficient training library
Training Method: TRL library for GRPO implementation

Framework Versions

PEFT: 0.15.1
Transformers: 4.52.4
Unsloth: 2025.6.1
TRL: Latest version with GRPO support
PyTorch: 2.6.0+cu124

Downloads last month: 2