Upload README SIMPLE METRICS.md
Browse files- README SIMPLE METRICS.md +28 -0
README SIMPLE METRICS.md
ADDED
|
@@ -0,0 +1,28 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Project: PrettyBird BCE Basic Coder 8B
|
| 2 |
+
|
| 3 |
+
## Project Overview
|
| 4 |
+
This project involves fine-tuning the `Qwen/Qwen2.5-Coder-7B-Instruct` model using the Unsloth library. The goal was to produce a high-performance coding assistant. Following the fine-tuning process, the model was converted into GGUF format with multiple quantization levels to optimize for various deployment scenarios.
|
| 5 |
+
|
| 6 |
+
## Methodology
|
| 7 |
+
- **Fine-Tuning Framework**: Unsloth (LoRA adapters)
|
| 8 |
+
- **Base Model**: Qwen/Qwen2.5-Coder-7B-Instruct
|
| 9 |
+
- **Inference Engine**: Ollama
|
| 10 |
+
- **Hardware**: NVIDIA A100 GPU
|
| 11 |
+
|
| 12 |
+
## Quantitative Results
|
| 13 |
+
The models were benchmarked for inference speed (Tokens Per Second) using the `api/generate` raw mode to ensure consistent evaluation.
|
| 14 |
+
|
| 15 |
+
| Model Tag | Quantization | Mean TPS | Speedup vs Baseline |
|
| 16 |
+
| :--- | :--- | :--- | :--- |
|
| 17 |
+
| f16 | Full Precision | ~97.0 | 1.0x |
|
| 18 |
+
| q8_0 | 8-bit | ~141.0 | 1.45x |
|
| 19 |
+
| q5_k_m | 5-bit | ~151.0 | 1.56x |
|
| 20 |
+
| q4_k_m | 4-bit | ~161.0 | 1.66x |
|
| 21 |
+
| q2_k | 2-bit | ~140.0 | 1.44x |
|
| 22 |
+
|
| 23 |
+
## Performance Analysis
|
| 24 |
+
- **Sweet Spot**: The `q4_k_m` model demonstrated the highest throughput, achieving approximately 161 TPS. It represents the optimal balance between speed and precision, offering a ~1.7x speedup over the full-precision baseline.
|
| 25 |
+
- **The 2-bit Anomaly**: The `q2_k` model, despite being the most compressed, performed slower than the 4-bit and 5-bit variants (~140 TPS). This counter-intuitive result is attributed to the computational overhead required to dequantize highly compressed weights on-the-fly, which creates a bottleneck on high-performance hardware like the NVIDIA A100.
|
| 26 |
+
|
| 27 |
+
## Recommendations
|
| 28 |
+
For production deployment, the **`q4_k_m`** model is recommended as the primary candidate due to its superior throughput and efficient memory usage. The **`q5_k_m`** model serves as a high-fidelity alternative if slightly higher reasoning precision is required.
|