tompoper
/

cflow

@@ -29,10 +29,10 @@ model-index:
           - name: Top-1 Accuracy (114M, 10K steps)
             type: accuracy
             value: 56.8
-          - name: Val Perplexity (8.34B, 10K steps)
             type: perplexity
             value: 4.52
-          - name: Top-1 Accuracy (8.34B, 10K steps)
             type: accuracy
             value: 61.4
 ---
@@ -44,27 +44,36 @@ designed so its inter-layer dependency graph permits vertical pipelining on CPU.
 Part of the **cflow** project — a CPU-first streaming inference engine written in
 Rust.
 ## Key Results
 | Metric | Value |
 |---|---|
-| CPU decode throughput (8.34B, Q4, 32 threads) | **5.94 tok/s** |
 | Effective memory bandwidth | 61 GB/s (30% of 204.8 GB/s peak) |
 | Bandwidth reduction from pipelining | **2.00x** (16.50 → 4.50 MB/token) |
 | Test perplexity (114M, TinyStories, 10K steps) | 6.50 |
-| Test perplexity (8.34B, TinyStories, 10K steps) | 4.52 |
 ### CPU Decode Benchmark (AWS r6i.8xlarge, Ice Lake Xeon, 256 GB DDR4)
 | Engine | Model | Quant | tok/s |
 |---|---|---|---|
-| **cflow** | arch2_4_8k_16l (8.34B MoE, ~3–4B active) | Q4 | **5.94** |
 | Ollama (llama.cpp) | Qwen2.5-32B (32B dense) | Q4 GGUF | 4.75 |
 | vLLM CPU | Qwen2.5-32B-Instruct (32B dense) | GPTQ-Int4 | 1.65 |
-> **Note:** cflow and the baselines run different models — cflow's 8.34B MoE has
-> ~3–4B active params per token vs 32B dense. The cflow number shows what a
-> co-designed architecture + streaming runtime achieves.
 ## Model Description
@@ -82,7 +91,7 @@ maintaining competitive perplexity.
 ### Architecture Details
-| Parameter | 114M (screening) | 8.34B (scaled) |
 |---|---|---|
 | Hidden dim | 512 | 8,192 |
 | Layers | 6 | 16 |
@@ -135,7 +144,12 @@ expert_out = moe(ffn_norm(x))                  # router sees CURRENT residual
 | Precision | float32 |
 | Hardware | RTX 3060 12 GB |
-### 8.34B Scale-Up
 | | |
 |---|---|

           - name: Top-1 Accuracy (114M, 10K steps)
             type: accuracy
             value: 56.8
+          - name: Val Perplexity (8.34B / 4-layer, 10K steps)
             type: perplexity
             value: 4.52
+          - name: Top-1 Accuracy (8.34B / 4-layer, 10K steps)
             type: accuracy
             value: 61.4
 ---
 Part of the **cflow** project — a CPU-first streaming inference engine written in
 Rust.
+> **Hosted weights:** this repository hosts `model.cflow` (17.39 GB) — the
+> **arch2_4_8k_16l** model: 16 layers, hidden 8192, **~31B parameters**
+> (top-2-of-8 MoE, ~20B active/token), Q4. This is the model benchmarked at
+> 5.94 tok/s below. The **8.34B** figures in this card refer to a *smaller
+> 4-layer scale point* (`arch2_4_8k_4l`) used for quality and cache-locality
+> validation (val ppl 4.52); that checkpoint is not hosted here.
 ## Key Results
 | Metric | Value |
 |---|---|
+| CPU decode throughput (~31B / 16-layer, Q4, 32 threads) | **5.94 tok/s** |
 | Effective memory bandwidth | 61 GB/s (30% of 204.8 GB/s peak) |
 | Bandwidth reduction from pipelining | **2.00x** (16.50 → 4.50 MB/token) |
 | Test perplexity (114M, TinyStories, 10K steps) | 6.50 |
+| Val perplexity (8.34B / 4-layer, TinyStories, 10K steps) | 4.52 |
 ### CPU Decode Benchmark (AWS r6i.8xlarge, Ice Lake Xeon, 256 GB DDR4)
 | Engine | Model | Quant | tok/s |
 |---|---|---|---|
+| **cflow** | arch2_4_8k_16l (~31B MoE, ~20B active) | Q4 | **5.94** |
 | Ollama (llama.cpp) | Qwen2.5-32B (32B dense) | Q4 GGUF | 4.75 |
 | vLLM CPU | Qwen2.5-32B-Instruct (32B dense) | GPTQ-Int4 | 1.65 |
+> **Note:** cflow and the baselines run different models — cflow's ~31B MoE has
+> ~20B active params per token vs 32B dense. The total parameter counts are
+> comparable (31B vs 32B), but the architectures and training differ, so the
+> cflow number shows what a co-designed architecture + streaming runtime achieves,
+> not a quality-matched result.
 ## Model Description
 ### Architecture Details
+| Parameter | 114M (screening) | ~31B (16-layer, hosted) |
 |---|---|---|
 | Hidden dim | 512 | 8,192 |
 | Layers | 6 | 16 |
 | Precision | float32 |
 | Hardware | RTX 3060 12 GB |
+### 8.34B Scale-Up (4-layer — quality & cache validation)
+This is the smaller scale point: `arch2_4_8k_4l`, 4 layers, 8.34B params. It
+provides the quality numbers (val ppl 4.52, top-1 61.4%) and the PMU cache-locality
+result. The hosted decode-benchmark model (`arch2_4_8k_16l`, ~31B) shares this
+per-layer geometry but has 16 layers.
 | | |
 |---|---|