Fix model size: hosted model is arch2_4_8k_16l (~31B/16L), not 8.34B (that is the 4-layer scale point)
Browse files
README.md
CHANGED
|
@@ -29,10 +29,10 @@ model-index:
|
|
| 29 |
- name: Top-1 Accuracy (114M, 10K steps)
|
| 30 |
type: accuracy
|
| 31 |
value: 56.8
|
| 32 |
-
- name: Val Perplexity (8.34B, 10K steps)
|
| 33 |
type: perplexity
|
| 34 |
value: 4.52
|
| 35 |
-
- name: Top-1 Accuracy (8.34B, 10K steps)
|
| 36 |
type: accuracy
|
| 37 |
value: 61.4
|
| 38 |
---
|
|
@@ -44,27 +44,36 @@ designed so its inter-layer dependency graph permits vertical pipelining on CPU.
|
|
| 44 |
Part of the **cflow** project — a CPU-first streaming inference engine written in
|
| 45 |
Rust.
|
| 46 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
## Key Results
|
| 48 |
|
| 49 |
| Metric | Value |
|
| 50 |
|---|---|
|
| 51 |
-
| CPU decode throughput (
|
| 52 |
| Effective memory bandwidth | 61 GB/s (30% of 204.8 GB/s peak) |
|
| 53 |
| Bandwidth reduction from pipelining | **2.00x** (16.50 → 4.50 MB/token) |
|
| 54 |
| Test perplexity (114M, TinyStories, 10K steps) | 6.50 |
|
| 55 |
-
|
|
| 56 |
|
| 57 |
### CPU Decode Benchmark (AWS r6i.8xlarge, Ice Lake Xeon, 256 GB DDR4)
|
| 58 |
|
| 59 |
| Engine | Model | Quant | tok/s |
|
| 60 |
|---|---|---|---|
|
| 61 |
-
| **cflow** | arch2_4_8k_16l (
|
| 62 |
| Ollama (llama.cpp) | Qwen2.5-32B (32B dense) | Q4 GGUF | 4.75 |
|
| 63 |
| vLLM CPU | Qwen2.5-32B-Instruct (32B dense) | GPTQ-Int4 | 1.65 |
|
| 64 |
|
| 65 |
-
> **Note:** cflow and the baselines run different models — cflow's
|
| 66 |
-
> ~
|
| 67 |
-
>
|
|
|
|
|
|
|
| 68 |
|
| 69 |
## Model Description
|
| 70 |
|
|
@@ -82,7 +91,7 @@ maintaining competitive perplexity.
|
|
| 82 |
|
| 83 |
### Architecture Details
|
| 84 |
|
| 85 |
-
| Parameter | 114M (screening) |
|
| 86 |
|---|---|---|
|
| 87 |
| Hidden dim | 512 | 8,192 |
|
| 88 |
| Layers | 6 | 16 |
|
|
@@ -135,7 +144,12 @@ expert_out = moe(ffn_norm(x)) # router sees CURRENT residual
|
|
| 135 |
| Precision | float32 |
|
| 136 |
| Hardware | RTX 3060 12 GB |
|
| 137 |
|
| 138 |
-
### 8.34B Scale-Up
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 139 |
|
| 140 |
| | |
|
| 141 |
|---|---|
|
|
|
|
| 29 |
- name: Top-1 Accuracy (114M, 10K steps)
|
| 30 |
type: accuracy
|
| 31 |
value: 56.8
|
| 32 |
+
- name: Val Perplexity (8.34B / 4-layer, 10K steps)
|
| 33 |
type: perplexity
|
| 34 |
value: 4.52
|
| 35 |
+
- name: Top-1 Accuracy (8.34B / 4-layer, 10K steps)
|
| 36 |
type: accuracy
|
| 37 |
value: 61.4
|
| 38 |
---
|
|
|
|
| 44 |
Part of the **cflow** project — a CPU-first streaming inference engine written in
|
| 45 |
Rust.
|
| 46 |
|
| 47 |
+
> **Hosted weights:** this repository hosts `model.cflow` (17.39 GB) — the
|
| 48 |
+
> **arch2_4_8k_16l** model: 16 layers, hidden 8192, **~31B parameters**
|
| 49 |
+
> (top-2-of-8 MoE, ~20B active/token), Q4. This is the model benchmarked at
|
| 50 |
+
> 5.94 tok/s below. The **8.34B** figures in this card refer to a *smaller
|
| 51 |
+
> 4-layer scale point* (`arch2_4_8k_4l`) used for quality and cache-locality
|
| 52 |
+
> validation (val ppl 4.52); that checkpoint is not hosted here.
|
| 53 |
+
|
| 54 |
## Key Results
|
| 55 |
|
| 56 |
| Metric | Value |
|
| 57 |
|---|---|
|
| 58 |
+
| CPU decode throughput (~31B / 16-layer, Q4, 32 threads) | **5.94 tok/s** |
|
| 59 |
| Effective memory bandwidth | 61 GB/s (30% of 204.8 GB/s peak) |
|
| 60 |
| Bandwidth reduction from pipelining | **2.00x** (16.50 → 4.50 MB/token) |
|
| 61 |
| Test perplexity (114M, TinyStories, 10K steps) | 6.50 |
|
| 62 |
+
| Val perplexity (8.34B / 4-layer, TinyStories, 10K steps) | 4.52 |
|
| 63 |
|
| 64 |
### CPU Decode Benchmark (AWS r6i.8xlarge, Ice Lake Xeon, 256 GB DDR4)
|
| 65 |
|
| 66 |
| Engine | Model | Quant | tok/s |
|
| 67 |
|---|---|---|---|
|
| 68 |
+
| **cflow** | arch2_4_8k_16l (~31B MoE, ~20B active) | Q4 | **5.94** |
|
| 69 |
| Ollama (llama.cpp) | Qwen2.5-32B (32B dense) | Q4 GGUF | 4.75 |
|
| 70 |
| vLLM CPU | Qwen2.5-32B-Instruct (32B dense) | GPTQ-Int4 | 1.65 |
|
| 71 |
|
| 72 |
+
> **Note:** cflow and the baselines run different models — cflow's ~31B MoE has
|
| 73 |
+
> ~20B active params per token vs 32B dense. The total parameter counts are
|
| 74 |
+
> comparable (31B vs 32B), but the architectures and training differ, so the
|
| 75 |
+
> cflow number shows what a co-designed architecture + streaming runtime achieves,
|
| 76 |
+
> not a quality-matched result.
|
| 77 |
|
| 78 |
## Model Description
|
| 79 |
|
|
|
|
| 91 |
|
| 92 |
### Architecture Details
|
| 93 |
|
| 94 |
+
| Parameter | 114M (screening) | ~31B (16-layer, hosted) |
|
| 95 |
|---|---|---|
|
| 96 |
| Hidden dim | 512 | 8,192 |
|
| 97 |
| Layers | 6 | 16 |
|
|
|
|
| 144 |
| Precision | float32 |
|
| 145 |
| Hardware | RTX 3060 12 GB |
|
| 146 |
|
| 147 |
+
### 8.34B Scale-Up (4-layer — quality & cache validation)
|
| 148 |
+
|
| 149 |
+
This is the smaller scale point: `arch2_4_8k_4l`, 4 layers, 8.34B params. It
|
| 150 |
+
provides the quality numbers (val ppl 4.52, top-1 61.4%) and the PMU cache-locality
|
| 151 |
+
result. The hosted decode-benchmark model (`arch2_4_8k_16l`, ~31B) shares this
|
| 152 |
+
per-layer geometry but has 16 layers.
|
| 153 |
|
| 154 |
| | |
|
| 155 |
|---|---|
|