AI Engineering Lab commited on
Commit ·
a99aedc
1
Parent(s): bf29ad1
results: RTX 3090 consolidated — 4 runs, 15 measurements, avg -7.5% TPS
Browse files- README.md +9 -12
- results/turboquant-3090-all-runs-2026-04.json +81 -0
README.md
CHANGED
|
@@ -51,21 +51,18 @@ Tested on two consumer GPUs. Results verified across multiple independent runs (
|
|
| 51 |
|
| 52 |
### RTX 3090 (24 GB) — Mistral-Small-3.2-24B Q4_K_M
|
| 53 |
|
| 54 |
-
*
|
| 55 |
|
| 56 |
| | Baseline (f16) | TurboQuant turbo3 | Delta |
|
| 57 |
|--|:--------------:|:-----------------:|:-----:|
|
| 58 |
| **Context** | 8,192 tokens | **100,000 tokens** | **+12.2×** |
|
| 59 |
-
| **VRAM** | 15.
|
| 60 |
-
| **Tokens/s** |
|
| 61 |
| **KV-Cache size** | ~1 GB (f16) | ~2.8 GB (3-bit) | **4.3× compression** |
|
| 62 |
|
| 63 |
-
> **12× more context. +12% VRAM. −
|
| 64 |
|
| 65 |
-
|
| 66 |
-
Run 2 (warm): Baseline 51.2 TPS / 15,695 MB → Turbo3 47.1 TPS / 17,581 MB
|
| 67 |
-
|
| 68 |
-
Raw data: [`results/turboquant-rtx3090-2026-04-01.json`](results/turboquant-rtx3090-2026-04-01.json) · [`results/turboquant-rtx3090-2026-04-01-v2.json`](results/turboquant-rtx3090-2026-04-01-v2.json)
|
| 69 |
|
| 70 |
### RTX 4070 Laptop (8 GB) — Llama-3.1-8B-Instruct Q4_K_M
|
| 71 |
|
|
@@ -83,10 +80,10 @@ Raw data: [`results/turboquant-4070-results-2026-04-01.json`](results/turboquant
|
|
| 83 |
|
| 84 |
### Cross-GPU Summary
|
| 85 |
|
| 86 |
-
| GPU | VRAM | Model | Max Context (turbo3) | Speed Loss |
|
| 87 |
-
|-----|------|-------|---------------------|-----------|
|
| 88 |
-
| RTX 3090 | 24 GB | Mistral-Small-3.2 24B | 100,000 tokens | −
|
| 89 |
-
| RTX 4070 Laptop | 8 GB | Llama-3.1 8B | 64,000 tokens | −
|
| 90 |
|
| 91 |
TurboQuant scales with the GPU: the principle (+7-12× context, minimal speed loss) holds across hardware classes.
|
| 92 |
|
|
|
|
| 51 |
|
| 52 |
### RTX 3090 (24 GB) — Mistral-Small-3.2-24B Q4_K_M
|
| 53 |
|
| 54 |
+
*4 independent benchmark runs, 15 total measurements.*
|
| 55 |
|
| 56 |
| | Baseline (f16) | TurboQuant turbo3 | Delta |
|
| 57 |
|--|:--------------:|:-----------------:|:-----:|
|
| 58 |
| **Context** | 8,192 tokens | **100,000 tokens** | **+12.2×** |
|
| 59 |
+
| **VRAM** | 15.3 GB | 17.1 GB | +1.8 GB only |
|
| 60 |
+
| **Tokens/s** | 51.0 | 47.2 | **−7.5%** |
|
| 61 |
| **KV-Cache size** | ~1 GB (f16) | ~2.8 GB (3-bit) | **4.3× compression** |
|
| 62 |
|
| 63 |
+
> **12× more context. +12% VRAM. −7.5% speed. Same model weights.**
|
| 64 |
|
| 65 |
+
Raw data: [`results/turboquant-3090-all-runs-2026-04.json`](results/turboquant-3090-all-runs-2026-04.json) (all 4 runs)
|
|
|
|
|
|
|
|
|
|
| 66 |
|
| 67 |
### RTX 4070 Laptop (8 GB) — Llama-3.1-8B-Instruct Q4_K_M
|
| 68 |
|
|
|
|
| 80 |
|
| 81 |
### Cross-GPU Summary
|
| 82 |
|
| 83 |
+
| GPU | VRAM | Model | Max Context (turbo3) | Speed Loss | Runs |
|
| 84 |
+
|-----|------|-------|---------------------|-----------|------|
|
| 85 |
+
| RTX 3090 | 24 GB | Mistral-Small-3.2 24B | 100,000 tokens | −7.5% | 4 |
|
| 86 |
+
| RTX 4070 Laptop | 8 GB | Llama-3.1 8B | 64,000 tokens | −4.6% | 2 |
|
| 87 |
|
| 88 |
TurboQuant scales with the GPU: the principle (+7-12× context, minimal speed loss) holds across hardware classes.
|
| 89 |
|
results/turboquant-3090-all-runs-2026-04.json
ADDED
|
@@ -0,0 +1,81 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"date": "2026-04-01 / 2026-04-02",
|
| 3 |
+
"hardware": {
|
| 4 |
+
"gpu": "NVIDIA GeForce RTX 3090",
|
| 5 |
+
"vram_gb": 24,
|
| 6 |
+
"node": ".90 (win-pc-gpu)"
|
| 7 |
+
},
|
| 8 |
+
"model": "Mistral-Small-3.2-24B-Instruct-2506-Q4_K_M",
|
| 9 |
+
"model_size_gb": 13.3,
|
| 10 |
+
"total_runs": 4,
|
| 11 |
+
"total_measurements": {
|
| 12 |
+
"baseline": 8,
|
| 13 |
+
"turboquant": 7
|
| 14 |
+
},
|
| 15 |
+
"runs": [
|
| 16 |
+
{
|
| 17 |
+
"run": 1,
|
| 18 |
+
"note": "cold",
|
| 19 |
+
"baseline_tps": 49.2,
|
| 20 |
+
"baseline_vram_mb": 15408,
|
| 21 |
+
"turbo3_tps": 45.0,
|
| 22 |
+
"turbo3_vram_mb": 17224
|
| 23 |
+
},
|
| 24 |
+
{
|
| 25 |
+
"run": 2,
|
| 26 |
+
"note": "warm",
|
| 27 |
+
"baseline_tps": 51.2,
|
| 28 |
+
"baseline_vram_mb": 15695,
|
| 29 |
+
"turbo3_tps": 47.1,
|
| 30 |
+
"turbo3_vram_mb": 17581
|
| 31 |
+
},
|
| 32 |
+
{
|
| 33 |
+
"run": 3,
|
| 34 |
+
"note": "fresh session 2026-04-02",
|
| 35 |
+
"baseline_tps_runs": [
|
| 36 |
+
51.39,
|
| 37 |
+
51.47,
|
| 38 |
+
51.29
|
| 39 |
+
],
|
| 40 |
+
"baseline_tps_avg": 51.38,
|
| 41 |
+
"baseline_vram_mb": 15758,
|
| 42 |
+
"turbo3_tps_runs": [
|
| 43 |
+
48.44,
|
| 44 |
+
47.18,
|
| 45 |
+
48.4
|
| 46 |
+
],
|
| 47 |
+
"turbo3_tps_avg": 48.01,
|
| 48 |
+
"turbo3_vram_mb": 17644
|
| 49 |
+
},
|
| 50 |
+
{
|
| 51 |
+
"run": 4,
|
| 52 |
+
"note": "fresh session 2026-04-02",
|
| 53 |
+
"baseline_tps_runs": [
|
| 54 |
+
51.2,
|
| 55 |
+
51.15,
|
| 56 |
+
50.92
|
| 57 |
+
],
|
| 58 |
+
"baseline_tps_avg": 51.09,
|
| 59 |
+
"baseline_vram_mb": 15758,
|
| 60 |
+
"turbo3_tps_runs": [
|
| 61 |
+
47.13,
|
| 62 |
+
46.81
|
| 63 |
+
],
|
| 64 |
+
"turbo3_tps_avg": 46.97,
|
| 65 |
+
"turbo3_vram_mb": 17644
|
| 66 |
+
}
|
| 67 |
+
],
|
| 68 |
+
"summary": {
|
| 69 |
+
"ctx": {
|
| 70 |
+
"baseline": 8192,
|
| 71 |
+
"turboquant": 100000,
|
| 72 |
+
"multiplier": 12.2
|
| 73 |
+
},
|
| 74 |
+
"baseline_tps_avg": 50.98,
|
| 75 |
+
"turbo3_tps_avg": 47.15,
|
| 76 |
+
"tps_delta_pct": -7.5,
|
| 77 |
+
"baseline_vram_mb_avg": 15655,
|
| 78 |
+
"turbo3_vram_mb_avg": 17523,
|
| 79 |
+
"vram_delta_gb": 1.82
|
| 80 |
+
}
|
| 81 |
+
}
|