Qwen3-4B Code Fine-Tuned

Fine-tuned Qwen3-4B on 10K verified reasoning traces from rStar-Coder (1 epoch SFT).

Optimized for algorithmic/competitive programming tasks.

📊 Performance (EvalPlus Framework)

Benchmark	Base	Plus	vs Base Model
HumanEval	68.9%	64.0%	+6.9% ✅
MBPP	58.2%	50.8%	-8.8% ⚠️

Evaluated using EvalPlus with greedy decoding

Performance Trade-off

✅ Improved on complex algorithmic tasks (HumanEval: 62% → 68.9%)
⚠️ Regression on simple practical tasks (MBPP: 67% → 58.2%)

Why? Trained on competition-style problems (LeetCode, Codeforces) which emphasizes algorithmic reasoning over simple utility functions.

Use this model if: You need help with algorithms, data structures, competitive programming
Use base model if: You need simple utility functions, basic string/list operations

🚀 Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "prometheus04/qwen3-4b-code-finetuned",
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained("prometheus04/qwen3-4b-code-finetuned", trust_remote_code=True)

# Complete a function
messages = [
    {"role": "system", "content": "You are a programming expert."},
    {"role": "user", "content": "def fibonacci(n):\n    "}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.0)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

📝 Training Details

Base Model: Qwen/Qwen3-4B (4B parameters)
Dataset: microsoft/rStar-Coder synthetic_sft (10K samples)
- Competition problems from LeetCode, Codeforces, etc.
- Execution-verified solutions with reasoning traces
Method: LoRA fine-tuning
- Rank: 32
- Alpha: 64
- Target modules: All linear layers (q,k,v,o,gate,up,down)
- rsLoRA: Enabled
Training:
- Epochs: 1
- Batch size: 2 × 8 grad accum = 16 effective
- Learning rate: 2e-4 (cosine schedule)
- Optimizer: AdamW 8-bit
- Max seq length: 4096

💡 Key Features

✅ Trained on execution-verified competition solutions
✅ Curriculum learning (easy → hard)
✅ Decontaminated from HumanEval/MBPP
✅ Efficient LoRA (1.62% trainable params)
✅ Production-ready merged weights

📈 Comparison

Model	HumanEval	MBPP	Specialization
Qwen3-4B Base	62%	67%	General
This Model	68.9%	58.2%	Algorithms
GPT-3.5-turbo	~75%	~70%	General

🎯 Strengths

Binary search, dynamic programming, graph algorithms
Recursion, backtracking, tree traversal
Complex data structure manipulation
Competitive programming patterns

⚠️ Limitations

Not recommended for simple utility functions (use base model instead)
Trained on Python-only data
May overthink simple problems
Best for algorithmic/competitive programming tasks
Optimal for functions <4K tokens

🔧 Recommended Use Cases

✅ LeetCode/HackerRank style problems
✅ Algorithm implementation
✅ Data structure coding
✅ Competitive programming practice
✅ Technical interview preparation

❌ Simple string manipulation
❌ Basic list operations
❌ Trivial utility functions

📄 Citation

@misc{qwen3-4b-code-finetuned,
  author = {prometheus04},
  title = {Qwen3-4B Code Fine-Tuned on rStar-Coder},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/prometheus04/qwen3-4b-code-finetuned}},
}

📜 License

Apache 2.0 (inherited from base model)

Downloads last month: 13

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for prometheus04/qwen3-4b-code-finetuned

Base model

Qwen/Qwen3-4B-Base

Finetuned

Qwen/Qwen3-4B

Finetuned

(836)

this model

Dataset used to train prometheus04/qwen3-4b-code-finetuned

Evaluation results

pass@1 on HumanEval
self-reported

68.900
pass@1 on HumanEval+
self-reported

64.000
pass@1 on MBPP
self-reported

58.200
pass@1 on MBPP+
self-reported

50.800