--- license: mit language: - en library_name: cflow tags: - moe - cpu-inference - rust - custom-architecture - pipeline-native - avx-512 datasets: - roneneldan/TinyStories - HuggingFaceFW/fineweb-edu pipeline_tag: text-generation model-index: - name: arch2_4_combined results: - task: type: text-generation dataset: name: TinyStories type: roneneldan/TinyStories metrics: - name: Test Perplexity (114M, 10K steps) type: perplexity value: 6.50 - name: Top-1 Accuracy (114M, 10K steps) type: accuracy value: 56.8 - name: Val Perplexity (8.34B / 4-layer, 10K steps) type: perplexity value: 4.52 - name: Top-1 Accuracy (8.34B / 4-layer, 10K steps) type: accuracy value: 61.4 --- # arch2_4_combined — Pipeline-Native MoE for CPU Inference A custom decoder-only transformer with delayed dense FFN + delayed MoE experts, designed so its inter-layer dependency graph permits vertical pipelining on CPU. Part of the **cflow** project — a CPU-first streaming inference engine written in Rust. > **Hosted weights:** this repository hosts `model.cflow` (17.39 GB) — the > **arch2_4_8k_16l** model: 16 layers, hidden 8192, **~31B parameters** > (top-2-of-8 MoE, ~20B active/token), Q4. This is the model benchmarked at > 5.94 tok/s below. The **8.34B** figures in this card refer to a *smaller > 4-layer scale point* (`arch2_4_8k_4l`) used for quality and cache-locality > validation (val ppl 4.52); that checkpoint is not hosted here. ## Key Results | Metric | Value | |---|---| | CPU decode throughput (~31B / 16-layer, Q4, 32 threads) | **5.94 tok/s** | | Effective memory bandwidth | 61 GB/s (30% of 204.8 GB/s peak) | | Bandwidth reduction from pipelining | **2.00x** (9.00 → 4.50 MB/token) | | Test perplexity (114M, TinyStories, 10K steps) | 6.50 | | Val perplexity (8.34B / 4-layer, TinyStories, 10K steps) | 4.52 | ### CPU Decode Benchmark (AWS r6i.8xlarge, Ice Lake Xeon, 256 GB DDR4) | Engine | Model | Quant | tok/s | |---|---|---|---| | **cflow** | arch2_4_8k_16l (~31B MoE, ~20B active) | Q4 | **5.94** | | Ollama (llama.cpp) | Qwen2.5-32B (32B dense) | Q4 GGUF | 4.75 | | vLLM CPU | Qwen2.5-32B-Instruct (32B dense) | GPTQ-Int4 | 1.65 | > **Note:** cflow and the baselines run different models — cflow's ~31B MoE has > ~20B active params per token vs 32B dense. The total parameter counts are > comparable (31B vs 32B), but the architectures and training differ, so the > cflow number shows what a co-designed architecture + streaming runtime achieves, > not a quality-matched result. ## Model Description **arch2_4_combined** is a pre-norm decoder-only transformer with a parallel dense FFN + sparse MoE block per layer, using delayed residual injection: - The **dense FFN** reads from a delayed residual (1 layer behind) - The **MoE experts** are routed on the current residual but injected 2 layers later - This creates a dependency DAG where dense and expert weight reads for layer N can overlap with compute for layer N-1, reducing critical-path memory bandwidth The architecture was selected from a screen of 5 pipeline-native candidates. It is the only design that achieves a measured bandwidth reduction (2.00x) while maintaining competitive perplexity. ### Architecture Details | Parameter | 114M (screening) | ~31B (16-layer, hosted) | |---|---|---| | Hidden dim | 512 | 8,192 | | Layers | 6 | 16 | | Attention heads | 8 | 128 | | Head dim | 64 | 64 | | Dense FFN hidden | 2,048 | 32,768 | | Expert FFN hidden | 512 | 4,096 | | Experts / top-k | 8 / 2 | 8 / 2 | | Dense delay | 1 | 1 | | Expert delay | 2 | 2 | | Vocab | 50,257 (GPT-2 BPE) | 50,257 (GPT-2 BPE) | | Max seq len | 512 | 2,048 | ### Per-Layer Forward Pass ``` attn_out = attention(attn_norm(x)) x = x + attn_out # residual connection x = x + dense_ffn(ffn_norm(delayed_x)) # dense reads DELAYED residual if queued_expert: x = x + queued_expert # inject expert from 2 layers ago expert_out = moe(ffn_norm(x)) # router sees CURRENT residual # expert_out queued for injection at layer + expert_delay ``` ### Components - **Attention:** Multi-head (not GQA), Q/K/V/O projections (no bias), standard RoPE (base=10000, half-interleave), causal masking, KV cache - **Dense FFN:** GeGLU — `down(gelu(gate(x)) * up(x))` - **MoE:** Linear router → top-k selection → softmax over selected → per-expert GeGLU FFN → weighted sum. No auxiliary/load-balancing loss. - **Normalization:** RMSNorm (eps=1e-6) at attn input, FFN input, and pre-lm_head - **Combine style:** `DelayedSum` — dense and router share `ffn_norm` but read different residual snapshots ## Training ### 114M Screening (5 architectures) | | | |---|---| | Dataset | TinyStories (431M train tokens, 24M test tokens) | | Tokenizer | GPT-2 BPE (50,257 vocab) | | Sequence length | 512 | | Optimizer | AdamW (betas=0.9/0.95, eps=1e-8, weight_decay=0.1) | | Learning rate | 3e-4 with linear warmup (200 steps) + cosine decay to 1e-5 | | Gradient clipping | Global norm 1.0 | | Batch size | 8 | | Steps | 10,000 | | Precision | float32 | | Hardware | RTX 3060 12 GB | ### 8.34B Scale-Up (4-layer — quality & cache validation) This is the smaller scale point: `arch2_4_8k_4l`, 4 layers, 8.34B params. It provides the quality numbers (val ppl 4.52, top-1 61.4%) and the PMU cache-locality result. The hosted decode-benchmark model (`arch2_4_8k_16l`, ~31B) shares this per-layer geometry but has 16 layers. | | | |---|---| | Dataset | TinyStories (same splits) | | Optimizer | 8-bit AdamW (bitsandbytes) | | Learning rate | 1e-4 with linear warmup (500 steps) + cosine decay to 1e-6 | | Batch size | 4 per GPU (global 32) | | Steps | 10,000 | | Precision | bf16 | | Parallelism | FSDP (FULL_SHARD / ZeRO-3) | | Gradient checkpointing | Per `DelayedMoELayer`, non-reentrant | | Hardware | 8x A100 SXM4 80 GB (Lambda Cloud) | ### Architecture Comparison (114M, TinyStories, 10K steps) | Architecture | dense_delay | expert_delay | Test PPL | Top-1 Acc | BW Reduction | |---|---|---|---|---|---| | arch1_decoupled_streams | 0 | 0 | 7.21 | 54.9% | 1.00x | | **arch2_4_combined** | **1** | **2** | **6.50** | **56.8%** | **2.00x** | | arch3_pipeline_registers | 0 | 0 | 7.24 | 55.1% | 1.00x | | arch4_async_experts | 0 | 2 | **6.26** | **57.6%** | 1.00x | | arch5_fixed_point | 0 | 0 | 6.77 | 56.2% | 1.00x | **Key insight:** Dense delay is the bandwidth knob; expert delay is the quality knob. arch4_async_experts gets the best perplexity by routing off pre-dense activations (cleaner router signal) but sacrifices the bandwidth win that arch2_4 achieves by also delaying the dense read. ## Inference with cflow cflow is a Rust inference engine that reads `.cflow` (per-layer streaming) or `.vflow` (vertical pipeline) weight files. Weights are stored as pre-tiled Q4 (128x256 tiles, ~18 KB each, sized to fit L2 cache). ```bash # Build cargo build --release --bin cflow-run # Convert safetensors → .cflow cargo run --release --bin cflow-convert -- \ --input checkpoint.safetensors \ --output model.cflow \ --model arch2_4 # Run inference CFLOW_THREADS=32 ./target/release/cflow-run \ model.cflow 32 \ --prompt "Once upon a time" \ --tokenizer tokenizer.json \ --temperature 0.8 ``` ### SIMD Support The runtime auto-detects and dispatches to the best available instruction set: | ISA | Kernel | Notes | |---|---|---| | AVX-512 + VNNI | Q4×Q8 `vpdpbusd` | Best path (Ice Lake+) | | AVX-512F | Q4×f32 FMA | Skylake-X+ | | AVX2 + FMA | Q4×f32 FMA | Haswell+ | | AVX + SSE4.1 | Q4×f32 | Sandy Bridge+ | | Scalar | Q4×f32 | Fallback | ## Limitations - **Not a general-purpose LLM.** Trained on TinyStories / FineWeb-Edu subsets at 10K steps — this is an architecture and runtime research artifact, not a production language model. - **Custom architecture.** Cannot be loaded in Hugging Face Transformers, vLLM, or llama.cpp without adaptation. Requires the cflow Rust runtime or the PyTorch reference in `pipeline_native/`. - **CPU-only.** The runtime targets x86-64 CPUs with AVX2 or AVX-512. No GPU backend. - **Single-token decode optimized.** Batch/prefill throughput is not the focus. ## Thesis Scorecard The cflow project tests 8 claims about CPU inference optimization: | # | Claim | Result | |---|---|---| | 1 | Conditional expert reading (top-k only) | **Proven** | | 2 | Tile-streaming L1/L2 cache locality | **Proven** (7.29x fewer L1-d misses, PMU-measured) | | 3 | AVX2/AVX-512 Q4 SIMD kernels | **Proven** | | 4 | Fused QKV and gate+up projections | **Proven** | | 5 | Compute-order file layout | **Proven** | | 6 | Software prefetch (`_mm_prefetch`) | **Disproven** (no benefit; slightly harmful) | | 7 | Vertical pipeline via delayed dependencies | **Validated** (2.00x bandwidth reduction) | | 8 | Stage-major disk layout readahead | **Disproven** (no isolated benefit) | ## Citation ```bibtex @software{poperszky2026cflow, author = {Poperszky, Tom}, title = {cflow: CPU-First Streaming Inference for Pipeline-Native Transformers}, year = {2026} } ``` ## License MIT