tompoper commited on
Commit
3f24d2b
·
verified ·
1 Parent(s): d189fdd

Fix model size: hosted model is arch2_4_8k_16l (~31B/16L), not 8.34B (that is the 4-layer scale point)

Browse files
Files changed (1) hide show
  1. README.md +24 -10
README.md CHANGED
@@ -29,10 +29,10 @@ model-index:
29
  - name: Top-1 Accuracy (114M, 10K steps)
30
  type: accuracy
31
  value: 56.8
32
- - name: Val Perplexity (8.34B, 10K steps)
33
  type: perplexity
34
  value: 4.52
35
- - name: Top-1 Accuracy (8.34B, 10K steps)
36
  type: accuracy
37
  value: 61.4
38
  ---
@@ -44,27 +44,36 @@ designed so its inter-layer dependency graph permits vertical pipelining on CPU.
44
  Part of the **cflow** project — a CPU-first streaming inference engine written in
45
  Rust.
46
 
 
 
 
 
 
 
 
47
  ## Key Results
48
 
49
  | Metric | Value |
50
  |---|---|
51
- | CPU decode throughput (8.34B, Q4, 32 threads) | **5.94 tok/s** |
52
  | Effective memory bandwidth | 61 GB/s (30% of 204.8 GB/s peak) |
53
  | Bandwidth reduction from pipelining | **2.00x** (16.50 → 4.50 MB/token) |
54
  | Test perplexity (114M, TinyStories, 10K steps) | 6.50 |
55
- | Test perplexity (8.34B, TinyStories, 10K steps) | 4.52 |
56
 
57
  ### CPU Decode Benchmark (AWS r6i.8xlarge, Ice Lake Xeon, 256 GB DDR4)
58
 
59
  | Engine | Model | Quant | tok/s |
60
  |---|---|---|---|
61
- | **cflow** | arch2_4_8k_16l (8.34B MoE, ~3–4B active) | Q4 | **5.94** |
62
  | Ollama (llama.cpp) | Qwen2.5-32B (32B dense) | Q4 GGUF | 4.75 |
63
  | vLLM CPU | Qwen2.5-32B-Instruct (32B dense) | GPTQ-Int4 | 1.65 |
64
 
65
- > **Note:** cflow and the baselines run different models — cflow's 8.34B MoE has
66
- > ~3–4B active params per token vs 32B dense. The cflow number shows what a
67
- > co-designed architecture + streaming runtime achieves.
 
 
68
 
69
  ## Model Description
70
 
@@ -82,7 +91,7 @@ maintaining competitive perplexity.
82
 
83
  ### Architecture Details
84
 
85
- | Parameter | 114M (screening) | 8.34B (scaled) |
86
  |---|---|---|
87
  | Hidden dim | 512 | 8,192 |
88
  | Layers | 6 | 16 |
@@ -135,7 +144,12 @@ expert_out = moe(ffn_norm(x)) # router sees CURRENT residual
135
  | Precision | float32 |
136
  | Hardware | RTX 3060 12 GB |
137
 
138
- ### 8.34B Scale-Up
 
 
 
 
 
139
 
140
  | | |
141
  |---|---|
 
29
  - name: Top-1 Accuracy (114M, 10K steps)
30
  type: accuracy
31
  value: 56.8
32
+ - name: Val Perplexity (8.34B / 4-layer, 10K steps)
33
  type: perplexity
34
  value: 4.52
35
+ - name: Top-1 Accuracy (8.34B / 4-layer, 10K steps)
36
  type: accuracy
37
  value: 61.4
38
  ---
 
44
  Part of the **cflow** project — a CPU-first streaming inference engine written in
45
  Rust.
46
 
47
+ > **Hosted weights:** this repository hosts `model.cflow` (17.39 GB) — the
48
+ > **arch2_4_8k_16l** model: 16 layers, hidden 8192, **~31B parameters**
49
+ > (top-2-of-8 MoE, ~20B active/token), Q4. This is the model benchmarked at
50
+ > 5.94 tok/s below. The **8.34B** figures in this card refer to a *smaller
51
+ > 4-layer scale point* (`arch2_4_8k_4l`) used for quality and cache-locality
52
+ > validation (val ppl 4.52); that checkpoint is not hosted here.
53
+
54
  ## Key Results
55
 
56
  | Metric | Value |
57
  |---|---|
58
+ | CPU decode throughput (~31B / 16-layer, Q4, 32 threads) | **5.94 tok/s** |
59
  | Effective memory bandwidth | 61 GB/s (30% of 204.8 GB/s peak) |
60
  | Bandwidth reduction from pipelining | **2.00x** (16.50 → 4.50 MB/token) |
61
  | Test perplexity (114M, TinyStories, 10K steps) | 6.50 |
62
+ | Val perplexity (8.34B / 4-layer, TinyStories, 10K steps) | 4.52 |
63
 
64
  ### CPU Decode Benchmark (AWS r6i.8xlarge, Ice Lake Xeon, 256 GB DDR4)
65
 
66
  | Engine | Model | Quant | tok/s |
67
  |---|---|---|---|
68
+ | **cflow** | arch2_4_8k_16l (~31B MoE, ~20B active) | Q4 | **5.94** |
69
  | Ollama (llama.cpp) | Qwen2.5-32B (32B dense) | Q4 GGUF | 4.75 |
70
  | vLLM CPU | Qwen2.5-32B-Instruct (32B dense) | GPTQ-Int4 | 1.65 |
71
 
72
+ > **Note:** cflow and the baselines run different models — cflow's ~31B MoE has
73
+ > ~20B active params per token vs 32B dense. The total parameter counts are
74
+ > comparable (31B vs 32B), but the architectures and training differ, so the
75
+ > cflow number shows what a co-designed architecture + streaming runtime achieves,
76
+ > not a quality-matched result.
77
 
78
  ## Model Description
79
 
 
91
 
92
  ### Architecture Details
93
 
94
+ | Parameter | 114M (screening) | ~31B (16-layer, hosted) |
95
  |---|---|---|
96
  | Hidden dim | 512 | 8,192 |
97
  | Layers | 6 | 16 |
 
144
  | Precision | float32 |
145
  | Hardware | RTX 3060 12 GB |
146
 
147
+ ### 8.34B Scale-Up (4-layer — quality & cache validation)
148
+
149
+ This is the smaller scale point: `arch2_4_8k_4l`, 4 layers, 8.34B params. It
150
+ provides the quality numbers (val ppl 4.52, top-1 61.4%) and the PMU cache-locality
151
+ result. The hosted decode-benchmark model (`arch2_4_8k_16l`, ~31B) shares this
152
+ per-layer geometry but has 16 layers.
153
 
154
  | | |
155
  |---|---|