AI Engineering Lab commited on
Commit
a99aedc
·
1 Parent(s): bf29ad1

results: RTX 3090 consolidated — 4 runs, 15 measurements, avg -7.5% TPS

Browse files
README.md CHANGED
@@ -51,21 +51,18 @@ Tested on two consumer GPUs. Results verified across multiple independent runs (
51
 
52
  ### RTX 3090 (24 GB) — Mistral-Small-3.2-24B Q4_K_M
53
 
54
- *Average of 2 independent benchmark runs.*
55
 
56
  | | Baseline (f16) | TurboQuant turbo3 | Delta |
57
  |--|:--------------:|:-----------------:|:-----:|
58
  | **Context** | 8,192 tokens | **100,000 tokens** | **+12.2×** |
59
- | **VRAM** | 15.5 GB | 17.4 GB | +1.9 GB only |
60
- | **Tokens/s** | 50.2 | 46.0 | **−8.3%** |
61
  | **KV-Cache size** | ~1 GB (f16) | ~2.8 GB (3-bit) | **4.3× compression** |
62
 
63
- > **12× more context. +12% VRAM. −8% speed. Same model weights.**
64
 
65
- Run 1 (cold): Baseline 49.2 TPS / 15,408 MB → Turbo3 45.0 TPS / 17,224 MB
66
- Run 2 (warm): Baseline 51.2 TPS / 15,695 MB → Turbo3 47.1 TPS / 17,581 MB
67
-
68
- Raw data: [`results/turboquant-rtx3090-2026-04-01.json`](results/turboquant-rtx3090-2026-04-01.json) · [`results/turboquant-rtx3090-2026-04-01-v2.json`](results/turboquant-rtx3090-2026-04-01-v2.json)
69
 
70
  ### RTX 4070 Laptop (8 GB) — Llama-3.1-8B-Instruct Q4_K_M
71
 
@@ -83,10 +80,10 @@ Raw data: [`results/turboquant-4070-results-2026-04-01.json`](results/turboquant
83
 
84
  ### Cross-GPU Summary
85
 
86
- | GPU | VRAM | Model | Max Context (turbo3) | Speed Loss |
87
- |-----|------|-------|---------------------|-----------|
88
- | RTX 3090 | 24 GB | Mistral-Small-3.2 24B | 100,000 tokens | −8.3% |
89
- | RTX 4070 Laptop | 8 GB | Llama-3.1 8B | 64,000 tokens | −3.2% |
90
 
91
  TurboQuant scales with the GPU: the principle (+7-12× context, minimal speed loss) holds across hardware classes.
92
 
 
51
 
52
  ### RTX 3090 (24 GB) — Mistral-Small-3.2-24B Q4_K_M
53
 
54
+ *4 independent benchmark runs, 15 total measurements.*
55
 
56
  | | Baseline (f16) | TurboQuant turbo3 | Delta |
57
  |--|:--------------:|:-----------------:|:-----:|
58
  | **Context** | 8,192 tokens | **100,000 tokens** | **+12.2×** |
59
+ | **VRAM** | 15.3 GB | 17.1 GB | +1.8 GB only |
60
+ | **Tokens/s** | 51.0 | 47.2 | **−7.5%** |
61
  | **KV-Cache size** | ~1 GB (f16) | ~2.8 GB (3-bit) | **4.3× compression** |
62
 
63
+ > **12× more context. +12% VRAM. −7.5% speed. Same model weights.**
64
 
65
+ Raw data: [`results/turboquant-3090-all-runs-2026-04.json`](results/turboquant-3090-all-runs-2026-04.json) (all 4 runs)
 
 
 
66
 
67
  ### RTX 4070 Laptop (8 GB) — Llama-3.1-8B-Instruct Q4_K_M
68
 
 
80
 
81
  ### Cross-GPU Summary
82
 
83
+ | GPU | VRAM | Model | Max Context (turbo3) | Speed Loss | Runs |
84
+ |-----|------|-------|---------------------|-----------|------|
85
+ | RTX 3090 | 24 GB | Mistral-Small-3.2 24B | 100,000 tokens | −7.5% | 4 |
86
+ | RTX 4070 Laptop | 8 GB | Llama-3.1 8B | 64,000 tokens | −4.6% | 2 |
87
 
88
  TurboQuant scales with the GPU: the principle (+7-12× context, minimal speed loss) holds across hardware classes.
89
 
results/turboquant-3090-all-runs-2026-04.json ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "date": "2026-04-01 / 2026-04-02",
3
+ "hardware": {
4
+ "gpu": "NVIDIA GeForce RTX 3090",
5
+ "vram_gb": 24,
6
+ "node": ".90 (win-pc-gpu)"
7
+ },
8
+ "model": "Mistral-Small-3.2-24B-Instruct-2506-Q4_K_M",
9
+ "model_size_gb": 13.3,
10
+ "total_runs": 4,
11
+ "total_measurements": {
12
+ "baseline": 8,
13
+ "turboquant": 7
14
+ },
15
+ "runs": [
16
+ {
17
+ "run": 1,
18
+ "note": "cold",
19
+ "baseline_tps": 49.2,
20
+ "baseline_vram_mb": 15408,
21
+ "turbo3_tps": 45.0,
22
+ "turbo3_vram_mb": 17224
23
+ },
24
+ {
25
+ "run": 2,
26
+ "note": "warm",
27
+ "baseline_tps": 51.2,
28
+ "baseline_vram_mb": 15695,
29
+ "turbo3_tps": 47.1,
30
+ "turbo3_vram_mb": 17581
31
+ },
32
+ {
33
+ "run": 3,
34
+ "note": "fresh session 2026-04-02",
35
+ "baseline_tps_runs": [
36
+ 51.39,
37
+ 51.47,
38
+ 51.29
39
+ ],
40
+ "baseline_tps_avg": 51.38,
41
+ "baseline_vram_mb": 15758,
42
+ "turbo3_tps_runs": [
43
+ 48.44,
44
+ 47.18,
45
+ 48.4
46
+ ],
47
+ "turbo3_tps_avg": 48.01,
48
+ "turbo3_vram_mb": 17644
49
+ },
50
+ {
51
+ "run": 4,
52
+ "note": "fresh session 2026-04-02",
53
+ "baseline_tps_runs": [
54
+ 51.2,
55
+ 51.15,
56
+ 50.92
57
+ ],
58
+ "baseline_tps_avg": 51.09,
59
+ "baseline_vram_mb": 15758,
60
+ "turbo3_tps_runs": [
61
+ 47.13,
62
+ 46.81
63
+ ],
64
+ "turbo3_tps_avg": 46.97,
65
+ "turbo3_vram_mb": 17644
66
+ }
67
+ ],
68
+ "summary": {
69
+ "ctx": {
70
+ "baseline": 8192,
71
+ "turboquant": 100000,
72
+ "multiplier": 12.2
73
+ },
74
+ "baseline_tps_avg": 50.98,
75
+ "turbo3_tps_avg": 47.15,
76
+ "tps_delta_pct": -7.5,
77
+ "baseline_vram_mb_avg": 15655,
78
+ "turbo3_vram_mb_avg": 17523,
79
+ "vram_delta_gb": 1.82
80
+ }
81
+ }