prettybird_bce_basic_coder_8b / README SIMPLE METRICS.md

Upload README SIMPLE METRICS.md

fc84e59 verified 22 days ago

1.89 kB

	# Project: PrettyBird BCE Basic Coder 8B

	## Project Overview
	This project involves fine-tuning the `Qwen/Qwen2.5-Coder-7B-Instruct` model using the Unsloth library. The goal was to produce a high-performance coding assistant. Following the fine-tuning process, the model was converted into GGUF format with multiple quantization levels to optimize for various deployment scenarios.

	## Methodology
	- Fine-Tuning Framework: Unsloth (LoRA adapters)
	- Base Model: Qwen/Qwen2.5-Coder-7B-Instruct
	- Inference Engine: Ollama
	- Hardware: NVIDIA A100 GPU

	## Quantitative Results
	The models were benchmarked for inference speed (Tokens Per Second) using the `api/generate` raw mode to ensure consistent evaluation.

	\| Model Tag \| Quantization \| Mean TPS \| Speedup vs Baseline \|
	\| :--- \| :--- \| :--- \| :--- \|
	\| f16 \| Full Precision \| ~97.0 \| 1.0x \|
	\| q8_0 \| 8-bit \| ~141.0 \| 1.45x \|
	\| q5_k_m \| 5-bit \| ~151.0 \| 1.56x \|
	\| q4_k_m \| 4-bit \| ~161.0 \| 1.66x \|
	\| q2_k \| 2-bit \| ~140.0 \| 1.44x \|

	## Performance Analysis
	- Sweet Spot: The `q4_k_m` model demonstrated the highest throughput, achieving approximately 161 TPS. It represents the optimal balance between speed and precision, offering a ~1.7x speedup over the full-precision baseline.
	- The 2-bit Anomaly: The `q2_k` model, despite being the most compressed, performed slower than the 4-bit and 5-bit variants (~140 TPS). This counter-intuitive result is attributed to the computational overhead required to dequantize highly compressed weights on-the-fly, which creates a bottleneck on high-performance hardware like the NVIDIA A100.

	## Recommendations
	For production deployment, the `q4_k_m` model is recommended as the primary candidate due to its superior throughput and efficient memory usage. The `q5_k_m` model serves as a high-fidelity alternative if slightly higher reasoning precision is required.

	# Project: PrettyBird BCE Basic Coder 8B

	## Project Overview
	This project involves fine-tuning the `Qwen/Qwen2.5-Coder-7B-Instruct` model using the Unsloth library. The goal was to produce a high-performance coding assistant. Following the fine-tuning process, the model was converted into GGUF format with multiple quantization levels to optimize for various deployment scenarios.

	## Methodology
	- Fine-Tuning Framework: Unsloth (LoRA adapters)
	- Base Model: Qwen/Qwen2.5-Coder-7B-Instruct
	- Inference Engine: Ollama
	- Hardware: NVIDIA A100 GPU

	## Quantitative Results
	The models were benchmarked for inference speed (Tokens Per Second) using the `api/generate` raw mode to ensure consistent evaluation.

	\| Model Tag \| Quantization \| Mean TPS \| Speedup vs Baseline \|
	\| :--- \| :--- \| :--- \| :--- \|
	\| f16 \| Full Precision \| ~97.0 \| 1.0x \|
	\| q8_0 \| 8-bit \| ~141.0 \| 1.45x \|
	\| q5_k_m \| 5-bit \| ~151.0 \| 1.56x \|
	\| q4_k_m \| 4-bit \| ~161.0 \| 1.66x \|
	\| q2_k \| 2-bit \| ~140.0 \| 1.44x \|

	## Performance Analysis
	- Sweet Spot: The `q4_k_m` model demonstrated the highest throughput, achieving approximately 161 TPS. It represents the optimal balance between speed and precision, offering a ~1.7x speedup over the full-precision baseline.
	- The 2-bit Anomaly: The `q2_k` model, despite being the most compressed, performed slower than the 4-bit and 5-bit variants (~140 TPS). This counter-intuitive result is attributed to the computational overhead required to dequantize highly compressed weights on-the-fly, which creates a bottleneck on high-performance hardware like the NVIDIA A100.

	## Recommendations
	For production deployment, the `q4_k_m` model is recommended as the primary candidate due to its superior throughput and efficient memory usage. The `q5_k_m` model serves as a high-fidelity alternative if slightly higher reasoning precision is required.