DeepSeek-R1-8B-GRPO-Reasoning

This model is a fine-tuned version of unsloth/DeepSeek-R1-Distill-Llama-8B using Group Relative Policy Optimization (GRPO) for enhanced reasoning capabilities on multiple-choice questions.

Model Description

This model has been specifically trained to handle reasoning tasks with a structured format, particularly excelling at Chinese multiple-choice questions. The model generates responses with explicit reasoning steps followed by clear solutions.

Key Features:

  • Structured Reasoning: Uses <start_working_out> and <end_working_out> tags for reasoning process
  • Clear Solutions: Provides answers in <SOLUTION>answer</SOLUTION> format
  • High Accuracy: Achieved 100% success rate on test dataset (900 samples)
  • Bilingual: Supports both Chinese and English

Model Details

  • Developed by: [Your Name/Organization]
  • Model type: Causal Language Model (Fine-tuned)
  • Language(s): Chinese (primary), English
  • License: Apache 2.0
  • Finetuned from model: unsloth/DeepSeek-R1-Distill-Llama-8B
  • Training Method: GRPO (Group Relative Policy Optimization)
  • Fine-tuning Library: Unsloth + TRL

Model Architecture

  • Base Model: DeepSeek-R1-Distill-Llama-8B (8 billion parameters)
  • Fine-tuning Method: LoRA (Low-Rank Adaptation)
  • LoRA Rank: 32
  • LoRA Alpha: 64
  • Trainable Parameters: 83,886,080 (1.05% of total parameters)

Uses

Direct Use

This model is designed for structured reasoning tasks, particularly multiple-choice questions that require step-by-step analysis.

Input Format:

Question: [Your question here]
Options:
A. [Option A]
B. [Option B] 
C. [Option C]
D. [Option D]

Output Format:

<start_working_out>
[Reasoning process]
<end_working_out>
<SOLUTION>A</SOLUTION>

Example Usage

from unsloth import FastLanguageModel
import torch

# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="your-username/deepseek-r1-8b-grpo-reasoning",
    max_seq_length=1024,
    dtype=None,
    load_in_4bit=True,
)

# Format your question
messages = [
    {"role": "system", "content": "You are given a problem. Think about the problem and provide your working out. Place it between <start_working_out> and <end_working_out>. Then, provide your solution between <SOLUTION></SOLUTION>"},
    {"role": "user", "content": """Question: Which is the capital of France?
Options:
A. London
B. Paris
C. Berlin
D. Madrid"""}
]

# Generate response
text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        repetition_penalty=1.2,
        do_sample=False,
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Training Details

Training Data

  • Dataset Size: 3,000 Chinese multiple-choice questions
  • Data Format: Questions with 4 options (A, B, C, D) and correct answers
  • Data Split: 80% training, 20% validation
  • Domain: Various topics including history, social studies, general knowledge

Training Procedure

Pre-fine-tuning (SFT):

  • Epochs: 2
  • Learning Rate: 2e-4
  • Batch Size: 1
  • Optimizer: AdamW 8-bit

GRPO Training:

  • Steps: 20
  • Learning Rate: 5e-6
  • Batch Size: 4 (adjusted from 1 due to num_generations=4)
  • Gradient Accumulation: 1
  • Number of Generations: 4
  • Max Sequence Length: 512
  • Optimizer: AdamW 8-bit

Training Infrastructure

  • Hardware: NVIDIA GeForce RTX 3090 (24GB VRAM)
  • Training Framework: Unsloth + TRL
  • Training Time: ~1.5 hours total
  • Memory Usage: ~5.9GB CUDA memory

Reward Functions

The GRPO training used multiple reward functions:

  1. Format Matching (Exact): +3.0 for perfect format compliance
  2. Format Matching (Approximate): ±0.5 for partial format compliance
  3. Answer Correctness: +5.0 for correct answers, -2.5 for incorrect
  4. Debug Logging: For monitoring training progress

Evaluation

Testing Results

  • Test Dataset: 900 samples
  • Success Rate: 100%
  • Answer Distribution:
    • A: 324 (36.0%)
    • B: 330 (36.7%)
    • C: 178 (19.8%)
    • D: 68 (7.6%)

Performance Metrics

  • Format Compliance: High adherence to required reasoning structure
  • Reasoning Quality: Consistent step-by-step analysis
  • Answer Accuracy: Perfect performance on evaluation set

Limitations and Considerations

  • Domain Specificity: Optimized for multiple-choice questions
  • Language Bias: Primarily trained on Chinese content
  • Format Dependency: Requires specific input/output format for optimal performance
  • Limited Context: Max sequence length of 512 tokens

How to Get Started with the Model

# Install required packages
pip install unsloth transformers torch

# Load and use the model (see example above)

Citation

If you use this model, please cite:

@misc{deepseek-r1-grpo-reasoning,
  title={DeepSeek-R1-8B-GRPO-Reasoning: A Fine-tuned Model for Structured Reasoning},
  author={[Your Name]},
  year={2025},
  howpublished={\url{https://huggingface.co/your-username/deepseek-r1-8b-grpo-reasoning}}
}

Acknowledgments

  • Base Model: DeepSeek AI for the original DeepSeek-R1-Distill model
  • Fine-tuning Framework: Unsloth team for the efficient training library
  • Training Method: TRL library for GRPO implementation

Framework Versions

  • PEFT: 0.15.1
  • Transformers: 4.52.4
  • Unsloth: 2025.6.1
  • TRL: Latest version with GRPO support
  • PyTorch: 2.6.0+cu124
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support