| --- |
| license: mit |
| language: |
| - en |
| library_name: cflow |
| tags: |
| - moe |
| - cpu-inference |
| - rust |
| - custom-architecture |
| - pipeline-native |
| - avx-512 |
| datasets: |
| - roneneldan/TinyStories |
| - HuggingFaceFW/fineweb-edu |
| pipeline_tag: text-generation |
| model-index: |
| - name: arch2_4_combined |
| results: |
| - task: |
| type: text-generation |
| dataset: |
| name: TinyStories |
| type: roneneldan/TinyStories |
| metrics: |
| - name: Test Perplexity (114M, 10K steps) |
| type: perplexity |
| value: 6.50 |
| - name: Top-1 Accuracy (114M, 10K steps) |
| type: accuracy |
| value: 56.8 |
| - name: Val Perplexity (8.34B / 4-layer, 10K steps) |
| type: perplexity |
| value: 4.52 |
| - name: Top-1 Accuracy (8.34B / 4-layer, 10K steps) |
| type: accuracy |
| value: 61.4 |
| --- |
| |
| # arch2_4_combined — Pipeline-Native MoE for CPU Inference |
|
|
| A custom decoder-only transformer with delayed dense FFN + delayed MoE experts, |
| designed so its inter-layer dependency graph permits vertical pipelining on CPU. |
| Part of the **cflow** project — a CPU-first streaming inference engine written in |
| Rust. |
|
|
| > **Hosted weights:** this repository hosts `model.cflow` (17.39 GB) — the |
| > **arch2_4_8k_16l** model: 16 layers, hidden 8192, **~31B parameters** |
| > (top-2-of-8 MoE, ~20B active/token), Q4. This is the model benchmarked at |
| > 5.94 tok/s below. The **8.34B** figures in this card refer to a *smaller |
| > 4-layer scale point* (`arch2_4_8k_4l`) used for quality and cache-locality |
| > validation (val ppl 4.52); that checkpoint is not hosted here. |
| |
| ## Key Results |
| |
| | Metric | Value | |
| |---|---| |
| | CPU decode throughput (~31B / 16-layer, Q4, 32 threads) | **5.94 tok/s** | |
| | Effective memory bandwidth | 61 GB/s (30% of 204.8 GB/s peak) | |
| | Bandwidth reduction from pipelining | **2.00x** (9.00 → 4.50 MB/token) | |
| | Test perplexity (114M, TinyStories, 10K steps) | 6.50 | |
| | Val perplexity (8.34B / 4-layer, TinyStories, 10K steps) | 4.52 | |
|
|
| ### CPU Decode Benchmark (AWS r6i.8xlarge, Ice Lake Xeon, 256 GB DDR4) |
|
|
| | Engine | Model | Quant | tok/s | |
| |---|---|---|---| |
| | **cflow** | arch2_4_8k_16l (~31B MoE, ~20B active) | Q4 | **5.94** | |
| | Ollama (llama.cpp) | Qwen2.5-32B (32B dense) | Q4 GGUF | 4.75 | |
| | vLLM CPU | Qwen2.5-32B-Instruct (32B dense) | GPTQ-Int4 | 1.65 | |
| |
| > **Note:** cflow and the baselines run different models — cflow's ~31B MoE has |
| > ~20B active params per token vs 32B dense. The total parameter counts are |
| > comparable (31B vs 32B), but the architectures and training differ, so the |
| > cflow number shows what a co-designed architecture + streaming runtime achieves, |
| > not a quality-matched result. |
| |
| ## Model Description |
| |
| **arch2_4_combined** is a pre-norm decoder-only transformer with a parallel dense |
| FFN + sparse MoE block per layer, using delayed residual injection: |
| |
| - The **dense FFN** reads from a delayed residual (1 layer behind) |
| - The **MoE experts** are routed on the current residual but injected 2 layers later |
| - This creates a dependency DAG where dense and expert weight reads for layer N |
| can overlap with compute for layer N-1, reducing critical-path memory bandwidth |
| |
| The architecture was selected from a screen of 5 pipeline-native candidates. It |
| is the only design that achieves a measured bandwidth reduction (2.00x) while |
| maintaining competitive perplexity. |
| |
| ### Architecture Details |
| |
| | Parameter | 114M (screening) | ~31B (16-layer, hosted) | |
| |---|---|---| |
| | Hidden dim | 512 | 8,192 | |
| | Layers | 6 | 16 | |
| | Attention heads | 8 | 128 | |
| | Head dim | 64 | 64 | |
| | Dense FFN hidden | 2,048 | 32,768 | |
| | Expert FFN hidden | 512 | 4,096 | |
| | Experts / top-k | 8 / 2 | 8 / 2 | |
| | Dense delay | 1 | 1 | |
| | Expert delay | 2 | 2 | |
| | Vocab | 50,257 (GPT-2 BPE) | 50,257 (GPT-2 BPE) | |
| | Max seq len | 512 | 2,048 | |
| |
| ### Per-Layer Forward Pass |
| |
| ``` |
| attn_out = attention(attn_norm(x)) |
| x = x + attn_out # residual connection |
| x = x + dense_ffn(ffn_norm(delayed_x)) # dense reads DELAYED residual |
| if queued_expert: x = x + queued_expert # inject expert from 2 layers ago |
| expert_out = moe(ffn_norm(x)) # router sees CURRENT residual |
| # expert_out queued for injection at layer + expert_delay |
| ``` |
| |
| ### Components |
| |
| - **Attention:** Multi-head (not GQA), Q/K/V/O projections (no bias), standard |
| RoPE (base=10000, half-interleave), causal masking, KV cache |
| - **Dense FFN:** GeGLU — `down(gelu(gate(x)) * up(x))` |
| - **MoE:** Linear router → top-k selection → softmax over selected → per-expert |
| GeGLU FFN → weighted sum. No auxiliary/load-balancing loss. |
| - **Normalization:** RMSNorm (eps=1e-6) at attn input, FFN input, and pre-lm_head |
| - **Combine style:** `DelayedSum` — dense and router share `ffn_norm` but read |
| different residual snapshots |
|
|
| ## Training |
|
|
| ### 114M Screening (5 architectures) |
|
|
| | | | |
| |---|---| |
| | Dataset | TinyStories (431M train tokens, 24M test tokens) | |
| | Tokenizer | GPT-2 BPE (50,257 vocab) | |
| | Sequence length | 512 | |
| | Optimizer | AdamW (betas=0.9/0.95, eps=1e-8, weight_decay=0.1) | |
| | Learning rate | 3e-4 with linear warmup (200 steps) + cosine decay to 1e-5 | |
| | Gradient clipping | Global norm 1.0 | |
| | Batch size | 8 | |
| | Steps | 10,000 | |
| | Precision | float32 | |
| | Hardware | RTX 3060 12 GB | |
| |
| ### 8.34B Scale-Up (4-layer — quality & cache validation) |
| |
| This is the smaller scale point: `arch2_4_8k_4l`, 4 layers, 8.34B params. It |
| provides the quality numbers (val ppl 4.52, top-1 61.4%) and the PMU cache-locality |
| result. The hosted decode-benchmark model (`arch2_4_8k_16l`, ~31B) shares this |
| per-layer geometry but has 16 layers. |
|
|
| | | | |
| |---|---| |
| | Dataset | TinyStories (same splits) | |
| | Optimizer | 8-bit AdamW (bitsandbytes) | |
| | Learning rate | 1e-4 with linear warmup (500 steps) + cosine decay to 1e-6 | |
| | Batch size | 4 per GPU (global 32) | |
| | Steps | 10,000 | |
| | Precision | bf16 | |
| | Parallelism | FSDP (FULL_SHARD / ZeRO-3) | |
| | Gradient checkpointing | Per `DelayedMoELayer`, non-reentrant | |
| | Hardware | 8x A100 SXM4 80 GB (Lambda Cloud) | |
| |
| ### Architecture Comparison (114M, TinyStories, 10K steps) |
| |
| | Architecture | dense_delay | expert_delay | Test PPL | Top-1 Acc | BW Reduction | |
| |---|---|---|---|---|---| |
| | arch1_decoupled_streams | 0 | 0 | 7.21 | 54.9% | 1.00x | |
| | **arch2_4_combined** | **1** | **2** | **6.50** | **56.8%** | **2.00x** | |
| | arch3_pipeline_registers | 0 | 0 | 7.24 | 55.1% | 1.00x | |
| | arch4_async_experts | 0 | 2 | **6.26** | **57.6%** | 1.00x | |
| | arch5_fixed_point | 0 | 0 | 6.77 | 56.2% | 1.00x | |
| |
| **Key insight:** Dense delay is the bandwidth knob; expert delay is the quality |
| knob. arch4_async_experts gets the best perplexity by routing off pre-dense |
| activations (cleaner router signal) but sacrifices the bandwidth win that |
| arch2_4 achieves by also delaying the dense read. |
|
|
| ## Inference with cflow |
|
|
| cflow is a Rust inference engine that reads `.cflow` (per-layer streaming) or |
| `.vflow` (vertical pipeline) weight files. Weights are stored as pre-tiled Q4 |
| (128x256 tiles, ~18 KB each, sized to fit L2 cache). |
|
|
| ```bash |
| # Build |
| cargo build --release --bin cflow-run |
| |
| # Convert safetensors → .cflow |
| cargo run --release --bin cflow-convert -- \ |
| --input checkpoint.safetensors \ |
| --output model.cflow \ |
| --model arch2_4 |
| |
| # Run inference |
| CFLOW_THREADS=32 ./target/release/cflow-run \ |
| model.cflow 32 \ |
| --prompt "Once upon a time" \ |
| --tokenizer tokenizer.json \ |
| --temperature 0.8 |
| ``` |
|
|
| ### SIMD Support |
|
|
| The runtime auto-detects and dispatches to the best available instruction set: |
|
|
| | ISA | Kernel | Notes | |
| |---|---|---| |
| | AVX-512 + VNNI | Q4×Q8 `vpdpbusd` | Best path (Ice Lake+) | |
| | AVX-512F | Q4×f32 FMA | Skylake-X+ | |
| | AVX2 + FMA | Q4×f32 FMA | Haswell+ | |
| | AVX + SSE4.1 | Q4×f32 | Sandy Bridge+ | |
| | Scalar | Q4×f32 | Fallback | |
|
|
| ## Limitations |
|
|
| - **Not a general-purpose LLM.** Trained on TinyStories / FineWeb-Edu subsets at |
| 10K steps — this is an architecture and runtime research artifact, not a |
| production language model. |
| - **Custom architecture.** Cannot be loaded in Hugging Face Transformers, vLLM, |
| or llama.cpp without adaptation. Requires the cflow Rust runtime or the |
| PyTorch reference in `pipeline_native/`. |
| - **CPU-only.** The runtime targets x86-64 CPUs with AVX2 or AVX-512. No GPU |
| backend. |
| - **Single-token decode optimized.** Batch/prefill throughput is not the focus. |
|
|
| ## Thesis Scorecard |
|
|
| The cflow project tests 8 claims about CPU inference optimization: |
|
|
| | # | Claim | Result | |
| |---|---|---| |
| | 1 | Conditional expert reading (top-k only) | **Proven** | |
| | 2 | Tile-streaming L1/L2 cache locality | **Proven** (7.29x fewer L1-d misses, PMU-measured) | |
| | 3 | AVX2/AVX-512 Q4 SIMD kernels | **Proven** | |
| | 4 | Fused QKV and gate+up projections | **Proven** | |
| | 5 | Compute-order file layout | **Proven** | |
| | 6 | Software prefetch (`_mm_prefetch`) | **Disproven** (no benefit; slightly harmful) | |
| | 7 | Vertical pipeline via delayed dependencies | **Validated** (2.00x bandwidth reduction) | |
| | 8 | Stage-major disk layout readahead | **Disproven** (no isolated benefit) | |
|
|
| ## Citation |
|
|
| ```bibtex |
| @software{poperszky2026cflow, |
| author = {Poperszky, Tom}, |
| title = {cflow: CPU-First Streaming Inference for Pipeline-Native Transformers}, |
| year = {2026} |
| } |
| ``` |
|
|
| ## License |
|
|
| MIT |
|
|