Update benchmark results: multi-codebase (jq, Rails, FastAPI) weighted scoring
Browse files
README.md
CHANGED
|
@@ -32,8 +32,8 @@ GGUF conversion of [lightonai/LateOn-Code-edge](https://huggingface.co/lightonai
|
|
| 32 |
| Variant | Size | Quality |
|
| 33 |
|---------|------|---------|
|
| 34 |
| [f32](https://huggingface.co/embedme/lightonai-lateon-code-edge-f32) | 66 MB | Original precision (lossless) |
|
| 35 |
-
| **f16** (this repo) | 34 MB |
|
| 36 |
-
| [Q8_0](https://huggingface.co/embedme/lightonai-lateon-code-edge-Q8_0) | 19 MB |
|
| 37 |
|
| 38 |
## Files
|
| 39 |
|
|
@@ -66,37 +66,41 @@ LIMIT 10;
|
|
| 66 |
|
| 67 |
|
| 68 |
|
| 69 |
-
##
|
| 70 |
|
| 71 |
-
|
| 72 |
|
| 73 |
-
###
|
| 74 |
|
| 75 |
-
| Variant |
|
| 76 |
-
|---------|-----------
|
| 77 |
-
| f32 |
|
| 78 |
-
| f16 |
|
| 79 |
-
| Q8_0 |
|
| 80 |
|
| 81 |
-
###
|
| 82 |
|
| 83 |
-
|
|
| 84 |
-
|---------|--------
|
| 85 |
-
|
|
| 86 |
-
|
|
| 87 |
-
|
|
| 88 |
|
| 89 |
-
|
| 90 |
|
| 91 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
|
|
|
| 98 |
|
| 99 |
-
**Key finding**: f16 is lossless — 100% top-1 agreement with f32 even on hard semantic queries, at half the model size. Recommended for best quality/size trade-off.
|
| 100 |
|
| 101 |
## Conversion
|
| 102 |
|
|
|
|
| 32 |
| Variant | Size | Quality |
|
| 33 |
|---------|------|---------|
|
| 34 |
| [f32](https://huggingface.co/embedme/lightonai-lateon-code-edge-f32) | 66 MB | Original precision (lossless) |
|
| 35 |
+
| **f16** (this repo) | 34 MB | Lossless — 100% top-1 agreement, 240/300 weighted |
|
| 36 |
+
| [Q8_0](https://huggingface.co/embedme/lightonai-lateon-code-edge-Q8_0) | 19 MB | 79% weighted score, 96-100% top-1 agreement, 3.5× smaller |
|
| 37 |
|
| 38 |
## Files
|
| 39 |
|
|
|
|
| 66 |
|
| 67 |
|
| 68 |
|
| 69 |
+
## Quantization Quality Benchmark
|
| 70 |
|
| 71 |
+
Tested across 3 codebases (jq/C, Rails/Ruby, FastAPI/Python) with 150 questions total (15 easy + 20 medium + 15 hard per codebase). Weighted scoring: easy×1, medium×2, hard×3 = 100 points per codebase, 300 total.
|
| 72 |
|
| 73 |
+
### Aggregate Weighted Scores
|
| 74 |
|
| 75 |
+
| Variant | Weighted Score | Percentage |
|
| 76 |
+
|---------|---------------|------------|
|
| 77 |
+
| f32 | 240 / 300 | **80.0%** |
|
| 78 |
+
| f16 | 240 / 300 | **80.0%** |
|
| 79 |
+
| Q8_0 | 237 / 300 | **79.0%** |
|
| 80 |
|
| 81 |
+
### Per-Corpus Scores
|
| 82 |
|
| 83 |
+
| Corpus | f32 | f16 | Q8_0 |
|
| 84 |
+
|---------|---------|---------|---------|
|
| 85 |
+
| jq (C) | 66/100 | 66/100 | 63/100 |
|
| 86 |
+
| Rails (Ruby) | 79/100 | 79/100 | 79/100 |
|
| 87 |
+
| FastAPI (Python) | 95/100 | 95/100 | 95/100 |
|
| 88 |
|
| 89 |
+
### Quantization Quality (Top-1 Agreement vs f32)
|
| 90 |
|
| 91 |
+
| Corpus | f16 | Q8_0 |
|
| 92 |
+
|---------|--------|--------|
|
| 93 |
+
| jq | 100.0% | 96.0% |
|
| 94 |
+
| Rails | 100.0% | 100.0% |
|
| 95 |
+
| FastAPI | 100.0% | 98.0% |
|
| 96 |
|
| 97 |
+
### Key Findings
|
| 98 |
+
|
| 99 |
+
- **f16 is lossless** — identical weighted score (240/300) and 100% top-1 agreement across all codebases
|
| 100 |
+
- **Q8_0 loses only 1%** — 237/300 vs 240/300, drops only on hard queries in jq corpus
|
| 101 |
+
- **Q8_0 is fastest** — 2.5s avg query vs 3.4s f32 vs 13.4s f16 (CPU without FP16 hardware)
|
| 102 |
+
- Easy/medium questions show zero quality difference between all variants
|
| 103 |
|
|
|
|
| 104 |
|
| 105 |
## Conversion
|
| 106 |
|