You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Model Card: Qwen2.5-3B Code Reasoning Fine-tuned

Model Details

Model Description

This model is a fine-tuned version of Qwen/Qwen2.5-3B, specifically optimized for competitive programming and code generation tasks with step-by-step reasoning capabilities. The model has been trained using a two-stage approach: Supervised Fine-Tuning (SFT) followed by Generalized Reward-guided Policy Optimization (GRPO).

Developed by: o5-mini team, Politecnico di Milano
Model type: Causal Language Model
Language(s): English (primary), Python code
Finetuned from model: Qwen/Qwen2.5-3B
Model size: 3B parameters + LoRA adapters (rank 32)

Model Sources

Base Model: Qwen/Qwen2.5-3B
Training Dataset: nvidia/OpenCodeReasoning (split_0)
Training Framework: Unsloth + TRL (Transformers Reinforcement Learning)

Uses

Direct Use

This model is designed for:

Competitive programming problem solving
Code generation with step-by-step reasoning
Algorithm implementation and explanation

Training Details

Training Data

Primary Dataset: nvidia/OpenCodeReasoning (split_0)
Training Samples:
- SFT: 80 samples
- GRPO: 100 samples
Data Filtering: Samples were filtered based on reasoning token length.

Training Procedure

Stage 1: Supervised Fine-Tuning (SFT)

Training objective: Next token prediction on formatted reasoning + code pairs
Batch size: 1 (with gradient accumulation steps: 2)
Learning rate: 2e-4
Epochs: 2
Optimizer: AdamW 8-bit
Weight decay: 0.01
Warmup steps: 5

Stage 2: Generalized Reward-guided Policy Optimization (GRPO)

Training objective: Policy optimization using multiple reward functions
Reward functions:
- Format matching (exact and approximate)
- Solution correctness evaluation (using Gemini-2.0-flash as reward model)
Learning rate: 5e-5
Max steps: 100
Temperature: 0.6
Generations per step: 4

Technical Specifications

Maximum sequence length: 32768 tokens
LoRA configuration:
- Rank: 32
- Alpha: 64 (2 × rank)
- Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Precision: 16-bit training
Hardware: GPU A100 40GB (but can fit easily on a T4 GPU)

Evaluation

Testing Data, Factors & Metrics

LiveCodeBench Evaluation

The model was evaluated on LiveCodeBench problem set v1, focusing on code generation tasks.

Performance Comparison:

Model	Pass@1	Pass@5	Easy Pass@1	Medium Pass@1	Hard Pass@1
Fine-tuned Model	0.1885 (18.85%)	0.28 (28.00%)	0.4239 (42.39%)	0.0905 (9.05%)	0.0 (0%)
Base Qwen2.5-3B	0.1585 (15.85%)	0.2175 (21.75%)	0.3127 (31.27%)	0.1131 (11.31%)	0.0 (0%)
Improvement	+3.0%	+7.25%	+11.12%	+2.26%	±0%

Model Architecture & Reasoning Format

The model generates responses in a structured format:

<think>
[Step-by-step reasoning and problem analysis]
</think>
```python
[Python code solution]

This format encourages the model to:

Think through the problem systematically
Provide clear reasoning steps
Generate clean, executable code solutions

Technical Limitations and Biases

Biases

Dataset Bias: Inherits biases from the nvidia/OpenCodeReasoning dataset
Problem Type Bias: Optimized for competitive programming style problems
Language Bias: Strongly biased toward Python implementations

Additional Information

Not Recommended For

Production code generation without review
Complex software architecture decisions
Security-critical code implementation
Problems requiring extensive domain knowledge beyond basic algorithms

Model Access

Inference: Compatible with vLLM for fast inference
Format: LoRA adapters can be merged with base model or used separately
Hardware Requirements: Supports both CPU and GPU inference

Citation

If you use this model in your research, please cite:

@misc{qwen25-3b-code-reasoning,
  title={Qwen2.5-3B Fine-tuned for Code Reasoning},
  author={[Team o5 mini]},
  year={2025},
}

Downloads last month: -

Safetensors

Model size

3B params

Tensor type

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MarioCap/Qwen2.5-3B-OCR-100S

Base model

Qwen/Qwen2.5-3B

Finetuned

(408)

this model

MarioCap
/

Qwen2.5-3B-OCR-100S