Spaces:

ai-engineering-at
/

llama-cpp-turboquant-guide

Running

App Files Files Community

AI Engineering Lab commited on Apr 2

Commit

a99aedc

1 Parent(s): bf29ad1

results: RTX 3090 consolidated — 4 runs, 15 measurements, avg -7.5% TPS

Browse files

Files changed (2) hide show

README.md +9 -12
results/turboquant-3090-all-runs-2026-04.json +81 -0

README.md CHANGED Viewed

@@ -51,21 +51,18 @@ Tested on two consumer GPUs. Results verified across multiple independent runs (
 ### RTX 3090 (24 GB) — Mistral-Small-3.2-24B Q4_K_M
-*Average of 2 independent benchmark runs.*
 | | Baseline (f16) | TurboQuant turbo3 | Delta |
 |--|:--------------:|:-----------------:|:-----:|
 | **Context** | 8,192 tokens | **100,000 tokens** | **+12.2×** |
-| **VRAM** | 15.5 GB | 17.4 GB | +1.9 GB only |
-| **Tokens/s** | 50.2 | 46.0 | **−8.3%** |
 | **KV-Cache size** | ~1 GB (f16) | ~2.8 GB (3-bit) | **4.3× compression** |
-> **12× more context. +12% VRAM. −8% speed. Same model weights.**
-Run 1 (cold): Baseline 49.2 TPS / 15,408 MB → Turbo3 45.0 TPS / 17,224 MB
-Run 2 (warm): Baseline 51.2 TPS / 15,695 MB → Turbo3 47.1 TPS / 17,581 MB
-Raw data: [`results/turboquant-rtx3090-2026-04-01.json`](results/turboquant-rtx3090-2026-04-01.json) · [`results/turboquant-rtx3090-2026-04-01-v2.json`](results/turboquant-rtx3090-2026-04-01-v2.json)
 ### RTX 4070 Laptop (8 GB) — Llama-3.1-8B-Instruct Q4_K_M
@@ -83,10 +80,10 @@ Raw data: [`results/turboquant-4070-results-2026-04-01.json`](results/turboquant
 ### Cross-GPU Summary
-| GPU | VRAM | Model | Max Context (turbo3) | Speed Loss |
-|-----|------|-------|---------------------|-----------|
-| RTX 3090 | 24 GB | Mistral-Small-3.2 24B | 100,000 tokens | −8.3% |
-| RTX 4070 Laptop | 8 GB | Llama-3.1 8B | 64,000 tokens | −3.2% |
 TurboQuant scales with the GPU: the principle (+7-12× context, minimal speed loss) holds across hardware classes.

 ### RTX 3090 (24 GB) — Mistral-Small-3.2-24B Q4_K_M
+*4 independent benchmark runs, 15 total measurements.*
 | | Baseline (f16) | TurboQuant turbo3 | Delta |
 |--|:--------------:|:-----------------:|:-----:|
 | **Context** | 8,192 tokens | **100,000 tokens** | **+12.2×** |
+| **VRAM** | 15.3 GB | 17.1 GB | +1.8 GB only |
+| **Tokens/s** | 51.0 | 47.2 | **−7.5%** |
 | **KV-Cache size** | ~1 GB (f16) | ~2.8 GB (3-bit) | **4.3× compression** |
+> **12× more context. +12% VRAM. −7.5% speed. Same model weights.**
+Raw data: [`results/turboquant-3090-all-runs-2026-04.json`](results/turboquant-3090-all-runs-2026-04.json) (all 4 runs)
 ### RTX 4070 Laptop (8 GB) — Llama-3.1-8B-Instruct Q4_K_M
 ### Cross-GPU Summary
+| GPU | VRAM | Model | Max Context (turbo3) | Speed Loss | Runs |
+|-----|------|-------|---------------------|-----------|------|
+| RTX 3090 | 24 GB | Mistral-Small-3.2 24B | 100,000 tokens | −7.5% | 4 |
+| RTX 4070 Laptop | 8 GB | Llama-3.1 8B | 64,000 tokens | −4.6% | 2 |
 TurboQuant scales with the GPU: the principle (+7-12× context, minimal speed loss) holds across hardware classes.

results/turboquant-3090-all-runs-2026-04.json ADDED Viewed

	@@ -0,0 +1,81 @@

+{
+  "date": "2026-04-01 / 2026-04-02",
+  "hardware": {
+    "gpu": "NVIDIA GeForce RTX 3090",
+    "vram_gb": 24,
+    "node": ".90 (win-pc-gpu)"
+  },
+  "model": "Mistral-Small-3.2-24B-Instruct-2506-Q4_K_M",
+  "model_size_gb": 13.3,
+  "total_runs": 4,
+  "total_measurements": {
+    "baseline": 8,
+    "turboquant": 7
+  },
+  "runs": [
+    {
+      "run": 1,
+      "note": "cold",
+      "baseline_tps": 49.2,
+      "baseline_vram_mb": 15408,
+      "turbo3_tps": 45.0,
+      "turbo3_vram_mb": 17224
+    },
+    {
+      "run": 2,
+      "note": "warm",
+      "baseline_tps": 51.2,
+      "baseline_vram_mb": 15695,
+      "turbo3_tps": 47.1,
+      "turbo3_vram_mb": 17581
+    },
+    {
+      "run": 3,
+      "note": "fresh session 2026-04-02",
+      "baseline_tps_runs": [
+        51.39,
+        51.47,
+        51.29
+      ],
+      "baseline_tps_avg": 51.38,
+      "baseline_vram_mb": 15758,
+      "turbo3_tps_runs": [
+        48.44,
+        47.18,
+        48.4
+      ],
+      "turbo3_tps_avg": 48.01,
+      "turbo3_vram_mb": 17644
+    },
+    {
+      "run": 4,
+      "note": "fresh session 2026-04-02",
+      "baseline_tps_runs": [
+        51.2,
+        51.15,
+        50.92
+      ],
+      "baseline_tps_avg": 51.09,
+      "baseline_vram_mb": 15758,
+      "turbo3_tps_runs": [
+        47.13,
+        46.81
+      ],
+      "turbo3_tps_avg": 46.97,
+      "turbo3_vram_mb": 17644
+    }
+  ],
+  "summary": {
+    "ctx": {
+      "baseline": 8192,
+      "turboquant": 100000,
+      "multiplier": 12.2
+    },
+    "baseline_tps_avg": 50.98,
+    "turbo3_tps_avg": 47.15,
+    "tps_delta_pct": -7.5,
+    "baseline_vram_mb_avg": 15655,
+    "turbo3_vram_mb_avg": 17523,
+    "vram_delta_gb": 1.82
+  }
+}