cflow / README.md

Remove non-public repo URL from citation

5f54691 verified 6 days ago

9.28 kB

	---
	license: mit
	language:
	- en
	library_name: cflow
	tags:
	- moe
	- cpu-inference
	- rust
	- custom-architecture
	- pipeline-native
	- avx-512
	datasets:
	- roneneldan/TinyStories
	- HuggingFaceFW/fineweb-edu
	pipeline_tag: text-generation
	model-index:
	- name: arch2_4_combined
	results:
	- task:
	type: text-generation
	dataset:
	name: TinyStories
	type: roneneldan/TinyStories
	metrics:
	- name: Test Perplexity (114M, 10K steps)
	type: perplexity
	value: 6.50
	- name: Top-1 Accuracy (114M, 10K steps)
	type: accuracy
	value: 56.8
	- name: Val Perplexity (8.34B / 4-layer, 10K steps)
	type: perplexity
	value: 4.52
	- name: Top-1 Accuracy (8.34B / 4-layer, 10K steps)
	type: accuracy
	value: 61.4
	---

	# arch2_4_combined — Pipeline-Native MoE for CPU Inference

	A custom decoder-only transformer with delayed dense FFN + delayed MoE experts,
	designed so its inter-layer dependency graph permits vertical pipelining on CPU.
	Part of the cflow project — a CPU-first streaming inference engine written in
	Rust.

	> Hosted weights: this repository hosts `model.cflow` (17.39 GB) — the
	> arch2_4_8k_16l model: 16 layers, hidden 8192, ~31B parameters
	> (top-2-of-8 MoE, ~20B active/token), Q4. This is the model benchmarked at
	> 5.94 tok/s below. The 8.34B figures in this card refer to a *smaller
	> 4-layer scale point* (`arch2_4_8k_4l`) used for quality and cache-locality
	> validation (val ppl 4.52); that checkpoint is not hosted here.

	## Key Results

	\| Metric \| Value \|
	\|---\|---\|
	\| CPU decode throughput (~31B / 16-layer, Q4, 32 threads) \| 5.94 tok/s \|
	\| Effective memory bandwidth \| 61 GB/s (30% of 204.8 GB/s peak) \|
	\| Bandwidth reduction from pipelining \| 2.00x (9.00 → 4.50 MB/token) \|
	\| Test perplexity (114M, TinyStories, 10K steps) \| 6.50 \|
	\| Val perplexity (8.34B / 4-layer, TinyStories, 10K steps) \| 4.52 \|

	### CPU Decode Benchmark (AWS r6i.8xlarge, Ice Lake Xeon, 256 GB DDR4)

	\| Engine \| Model \| Quant \| tok/s \|
	\|---\|---\|---\|---\|
	\| cflow \| arch2_4_8k_16l (~31B MoE, ~20B active) \| Q4 \| 5.94 \|
	\| Ollama (llama.cpp) \| Qwen2.5-32B (32B dense) \| Q4 GGUF \| 4.75 \|
	\| vLLM CPU \| Qwen2.5-32B-Instruct (32B dense) \| GPTQ-Int4 \| 1.65 \|

	> Note: cflow and the baselines run different models — cflow's ~31B MoE has
	> ~20B active params per token vs 32B dense. The total parameter counts are
	> comparable (31B vs 32B), but the architectures and training differ, so the
	> cflow number shows what a co-designed architecture + streaming runtime achieves,
	> not a quality-matched result.

	## Model Description

	arch2_4_combined is a pre-norm decoder-only transformer with a parallel dense
	FFN + sparse MoE block per layer, using delayed residual injection:

	- The dense FFN reads from a delayed residual (1 layer behind)
	- The MoE experts are routed on the current residual but injected 2 layers later
	- This creates a dependency DAG where dense and expert weight reads for layer N
	can overlap with compute for layer N-1, reducing critical-path memory bandwidth

	The architecture was selected from a screen of 5 pipeline-native candidates. It
	is the only design that achieves a measured bandwidth reduction (2.00x) while
	maintaining competitive perplexity.

	### Architecture Details

	\| Parameter \| 114M (screening) \| ~31B (16-layer, hosted) \|
	\|---\|---\|---\|
	\| Hidden dim \| 512 \| 8,192 \|
	\| Layers \| 6 \| 16 \|
	\| Attention heads \| 8 \| 128 \|
	\| Head dim \| 64 \| 64 \|
	\| Dense FFN hidden \| 2,048 \| 32,768 \|
	\| Expert FFN hidden \| 512 \| 4,096 \|
	\| Experts / top-k \| 8 / 2 \| 8 / 2 \|
	\| Dense delay \| 1 \| 1 \|
	\| Expert delay \| 2 \| 2 \|
	\| Vocab \| 50,257 (GPT-2 BPE) \| 50,257 (GPT-2 BPE) \|
	\| Max seq len \| 512 \| 2,048 \|

	### Per-Layer Forward Pass

	```
	attn_out = attention(attn_norm(x))
	x = x + attn_out # residual connection
	x = x + dense_ffn(ffn_norm(delayed_x)) # dense reads DELAYED residual
	if queued_expert: x = x + queued_expert # inject expert from 2 layers ago
	expert_out = moe(ffn_norm(x)) # router sees CURRENT residual
	# expert_out queued for injection at layer + expert_delay
	```

	### Components

	- Attention: Multi-head (not GQA), Q/K/V/O projections (no bias), standard
	RoPE (base=10000, half-interleave), causal masking, KV cache
	- Dense FFN: GeGLU — `down(gelu(gate(x)) * up(x))`
	- MoE: Linear router → top-k selection → softmax over selected → per-expert
	GeGLU FFN → weighted sum. No auxiliary/load-balancing loss.
	- Normalization: RMSNorm (eps=1e-6) at attn input, FFN input, and pre-lm_head
	- Combine style: `DelayedSum` — dense and router share `ffn_norm` but read
	different residual snapshots

	## Training

	### 114M Screening (5 architectures)

	\| \| \|
	\|---\|---\|
	\| Dataset \| TinyStories (431M train tokens, 24M test tokens) \|
	\| Tokenizer \| GPT-2 BPE (50,257 vocab) \|
	\| Sequence length \| 512 \|
	\| Optimizer \| AdamW (betas=0.9/0.95, eps=1e-8, weight_decay=0.1) \|
	\| Learning rate \| 3e-4 with linear warmup (200 steps) + cosine decay to 1e-5 \|
	\| Gradient clipping \| Global norm 1.0 \|
	\| Batch size \| 8 \|
	\| Steps \| 10,000 \|
	\| Precision \| float32 \|
	\| Hardware \| RTX 3060 12 GB \|

	### 8.34B Scale-Up (4-layer — quality & cache validation)

	This is the smaller scale point: `arch2_4_8k_4l`, 4 layers, 8.34B params. It
	provides the quality numbers (val ppl 4.52, top-1 61.4%) and the PMU cache-locality
	result. The hosted decode-benchmark model (`arch2_4_8k_16l`, ~31B) shares this
	per-layer geometry but has 16 layers.

	\| \| \|
	\|---\|---\|
	\| Dataset \| TinyStories (same splits) \|
	\| Optimizer \| 8-bit AdamW (bitsandbytes) \|
	\| Learning rate \| 1e-4 with linear warmup (500 steps) + cosine decay to 1e-6 \|
	\| Batch size \| 4 per GPU (global 32) \|
	\| Steps \| 10,000 \|
	\| Precision \| bf16 \|
	\| Parallelism \| FSDP (FULL_SHARD / ZeRO-3) \|
	\| Gradient checkpointing \| Per `DelayedMoELayer`, non-reentrant \|
	\| Hardware \| 8x A100 SXM4 80 GB (Lambda Cloud) \|

	### Architecture Comparison (114M, TinyStories, 10K steps)

	\| Architecture \| dense_delay \| expert_delay \| Test PPL \| Top-1 Acc \| BW Reduction \|
	\|---\|---\|---\|---\|---\|---\|
	\| arch1_decoupled_streams \| 0 \| 0 \| 7.21 \| 54.9% \| 1.00x \|
	\| arch2_4_combined \| 1 \| 2 \| 6.50 \| 56.8% \| 2.00x \|
	\| arch3_pipeline_registers \| 0 \| 0 \| 7.24 \| 55.1% \| 1.00x \|
	\| arch4_async_experts \| 0 \| 2 \| 6.26 \| 57.6% \| 1.00x \|
	\| arch5_fixed_point \| 0 \| 0 \| 6.77 \| 56.2% \| 1.00x \|

	Key insight: Dense delay is the bandwidth knob; expert delay is the quality
	knob. arch4_async_experts gets the best perplexity by routing off pre-dense
	activations (cleaner router signal) but sacrifices the bandwidth win that
	arch2_4 achieves by also delaying the dense read.

	## Inference with cflow

	cflow is a Rust inference engine that reads `.cflow` (per-layer streaming) or
	`.vflow` (vertical pipeline) weight files. Weights are stored as pre-tiled Q4
	(128x256 tiles, ~18 KB each, sized to fit L2 cache).

	```bash
	# Build
	cargo build --release --bin cflow-run

	# Convert safetensors → .cflow
	cargo run --release --bin cflow-convert -- \
	--input checkpoint.safetensors \
	--output model.cflow \
	--model arch2_4

	# Run inference
	CFLOW_THREADS=32 ./target/release/cflow-run \
	model.cflow 32 \
	--prompt "Once upon a time" \
	--tokenizer tokenizer.json \
	--temperature 0.8
	```

	### SIMD Support

	The runtime auto-detects and dispatches to the best available instruction set:

	\| ISA \| Kernel \| Notes \|
	\|---\|---\|---\|
	\| AVX-512 + VNNI \| Q4×Q8 `vpdpbusd` \| Best path (Ice Lake+) \|
	\| AVX-512F \| Q4×f32 FMA \| Skylake-X+ \|
	\| AVX2 + FMA \| Q4×f32 FMA \| Haswell+ \|
	\| AVX + SSE4.1 \| Q4×f32 \| Sandy Bridge+ \|
	\| Scalar \| Q4×f32 \| Fallback \|

	## Limitations

	- Not a general-purpose LLM. Trained on TinyStories / FineWeb-Edu subsets at
	10K steps — this is an architecture and runtime research artifact, not a
	production language model.
	- Custom architecture. Cannot be loaded in Hugging Face Transformers, vLLM,
	or llama.cpp without adaptation. Requires the cflow Rust runtime or the
	PyTorch reference in `pipeline_native/`.
	- CPU-only. The runtime targets x86-64 CPUs with AVX2 or AVX-512. No GPU
	backend.
	- Single-token decode optimized. Batch/prefill throughput is not the focus.

	## Thesis Scorecard

	The cflow project tests 8 claims about CPU inference optimization:

	\| # \| Claim \| Result \|
	\|---\|---\|---\|
	\| 1 \| Conditional expert reading (top-k only) \| Proven \|
	\| 2 \| Tile-streaming L1/L2 cache locality \| Proven (7.29x fewer L1-d misses, PMU-measured) \|
	\| 3 \| AVX2/AVX-512 Q4 SIMD kernels \| Proven \|
	\| 4 \| Fused QKV and gate+up projections \| Proven \|
	\| 5 \| Compute-order file layout \| Proven \|
	\| 6 \| Software prefetch (`_mm_prefetch`) \| Disproven (no benefit; slightly harmful) \|
	\| 7 \| Vertical pipeline via delayed dependencies \| Validated (2.00x bandwidth reduction) \|
	\| 8 \| Stage-major disk layout readahead \| Disproven (no isolated benefit) \|

	## Citation

	```bibtex
	@software{poperszky2026cflow,
	author = {Poperszky, Tom},
	title = {cflow: CPU-First Streaming Inference for Pipeline-Native Transformers},
	year = {2026}
	}
	```

	## License

	MIT