Project: PrettyBird BCE Basic Coder 8B
Project Overview
This project involves fine-tuning the Qwen/Qwen2.5-Coder-7B-Instruct model using the Unsloth library. The goal was to produce a high-performance coding assistant. Following the fine-tuning process, the model was converted into GGUF format with multiple quantization levels to optimize for various deployment scenarios.
Methodology
- Fine-Tuning Framework: Unsloth (LoRA adapters)
- Base Model: Qwen/Qwen2.5-Coder-7B-Instruct
- Inference Engine: Ollama
- Hardware: NVIDIA A100 GPU
Quantitative Results
The models were benchmarked for inference speed (Tokens Per Second) using the api/generate raw mode to ensure consistent evaluation.
| Model Tag | Quantization | Mean TPS | Speedup vs Baseline |
|---|---|---|---|
| f16 | Full Precision | ~97.0 | 1.0x |
| q8_0 | 8-bit | ~141.0 | 1.45x |
| q5_k_m | 5-bit | ~151.0 | 1.56x |
| q4_k_m | 4-bit | ~161.0 | 1.66x |
| q2_k | 2-bit | ~140.0 | 1.44x |
Performance Analysis
- Sweet Spot: The
q4_k_mmodel demonstrated the highest throughput, achieving approximately 161 TPS. It represents the optimal balance between speed and precision, offering a ~1.7x speedup over the full-precision baseline. - The 2-bit Anomaly: The
q2_kmodel, despite being the most compressed, performed slower than the 4-bit and 5-bit variants (~140 TPS). This counter-intuitive result is attributed to the computational overhead required to dequantize highly compressed weights on-the-fly, which creates a bottleneck on high-performance hardware like the NVIDIA A100.
Recommendations
For production deployment, the q4_k_m model is recommended as the primary candidate due to its superior throughput and efficient memory usage. The q5_k_m model serves as a high-fidelity alternative if slightly higher reasoning precision is required.