prettybird_bce_basic_coder_8b / README SIMPLE METRICS.md
prometechinc's picture
Upload README SIMPLE METRICS.md
fc84e59 verified

Project: PrettyBird BCE Basic Coder 8B

Project Overview

This project involves fine-tuning the Qwen/Qwen2.5-Coder-7B-Instruct model using the Unsloth library. The goal was to produce a high-performance coding assistant. Following the fine-tuning process, the model was converted into GGUF format with multiple quantization levels to optimize for various deployment scenarios.

Methodology

  • Fine-Tuning Framework: Unsloth (LoRA adapters)
  • Base Model: Qwen/Qwen2.5-Coder-7B-Instruct
  • Inference Engine: Ollama
  • Hardware: NVIDIA A100 GPU

Quantitative Results

The models were benchmarked for inference speed (Tokens Per Second) using the api/generate raw mode to ensure consistent evaluation.

Model Tag Quantization Mean TPS Speedup vs Baseline
f16 Full Precision ~97.0 1.0x
q8_0 8-bit ~141.0 1.45x
q5_k_m 5-bit ~151.0 1.56x
q4_k_m 4-bit ~161.0 1.66x
q2_k 2-bit ~140.0 1.44x

Performance Analysis

  • Sweet Spot: The q4_k_m model demonstrated the highest throughput, achieving approximately 161 TPS. It represents the optimal balance between speed and precision, offering a ~1.7x speedup over the full-precision baseline.
  • The 2-bit Anomaly: The q2_k model, despite being the most compressed, performed slower than the 4-bit and 5-bit variants (~140 TPS). This counter-intuitive result is attributed to the computational overhead required to dequantize highly compressed weights on-the-fly, which creates a bottleneck on high-performance hardware like the NVIDIA A100.

Recommendations

For production deployment, the q4_k_m model is recommended as the primary candidate due to its superior throughput and efficient memory usage. The q5_k_m model serves as a high-fidelity alternative if slightly higher reasoning precision is required.