Qwen3-4B Code Fine-Tuned
Fine-tuned Qwen3-4B on 10K verified reasoning traces from rStar-Coder (1 epoch SFT).
Optimized for algorithmic/competitive programming tasks.
π Performance (EvalPlus Framework)
| Benchmark | Base | Plus | vs Base Model |
|---|---|---|---|
| HumanEval | 68.9% | 64.0% | +6.9% β |
| MBPP | 58.2% | 50.8% | -8.8% β οΈ |
Evaluated using EvalPlus with greedy decoding
Performance Trade-off
- β Improved on complex algorithmic tasks (HumanEval: 62% β 68.9%)
- β οΈ Regression on simple practical tasks (MBPP: 67% β 58.2%)
Why? Trained on competition-style problems (LeetCode, Codeforces) which emphasizes algorithmic reasoning over simple utility functions.
Use this model if: You need help with algorithms, data structures, competitive programming
Use base model if: You need simple utility functions, basic string/list operations
π Quick Start
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"prometheus04/qwen3-4b-code-finetuned",
torch_dtype="auto",
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("prometheus04/qwen3-4b-code-finetuned", trust_remote_code=True)
# Complete a function
messages = [
{"role": "system", "content": "You are a programming expert."},
{"role": "user", "content": "def fibonacci(n):\n "}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.0)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
π Training Details
- Base Model: Qwen/Qwen3-4B (4B parameters)
- Dataset: microsoft/rStar-Coder synthetic_sft (10K samples)
- Competition problems from LeetCode, Codeforces, etc.
- Execution-verified solutions with reasoning traces
- Method: LoRA fine-tuning
- Rank: 32
- Alpha: 64
- Target modules: All linear layers (q,k,v,o,gate,up,down)
- rsLoRA: Enabled
- Training:
- Epochs: 1
- Batch size: 2 Γ 8 grad accum = 16 effective
- Learning rate: 2e-4 (cosine schedule)
- Optimizer: AdamW 8-bit
- Max seq length: 4096
π‘ Key Features
β
Trained on execution-verified competition solutions
β
Curriculum learning (easy β hard)
β
Decontaminated from HumanEval/MBPP
β
Efficient LoRA (1.62% trainable params)
β
Production-ready merged weights
π Comparison
| Model | HumanEval | MBPP | Specialization |
|---|---|---|---|
| Qwen3-4B Base | 62% | 67% | General |
| This Model | 68.9% | 58.2% | Algorithms |
| GPT-3.5-turbo | ~75% | ~70% | General |
π― Strengths
- Binary search, dynamic programming, graph algorithms
- Recursion, backtracking, tree traversal
- Complex data structure manipulation
- Competitive programming patterns
β οΈ Limitations
- Not recommended for simple utility functions (use base model instead)
- Trained on Python-only data
- May overthink simple problems
- Best for algorithmic/competitive programming tasks
- Optimal for functions <4K tokens
π§ Recommended Use Cases
β
LeetCode/HackerRank style problems
β
Algorithm implementation
β
Data structure coding
β
Competitive programming practice
β
Technical interview preparation
β Simple string manipulation
β Basic list operations
β Trivial utility functions
π Citation
@misc{qwen3-4b-code-finetuned,
author = {prometheus04},
title = {Qwen3-4B Code Fine-Tuned on rStar-Coder},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/prometheus04/qwen3-4b-code-finetuned}},
}
π License
Apache 2.0 (inherited from base model)
- Downloads last month
- 48
Model tree for prometheus04/qwen3-4b-code-finetuned
Dataset used to train prometheus04/qwen3-4b-code-finetuned
Evaluation results
- pass@1 on HumanEvalself-reported68.900
- pass@1 on HumanEval+self-reported64.000
- pass@1 on MBPPself-reported58.200
- pass@1 on MBPP+self-reported50.800