File size: 9,282 Bytes

---
license: mit
language:
  - en
library_name: cflow
tags:
  - moe
  - cpu-inference
  - rust
  - custom-architecture
  - pipeline-native
  - avx-512
datasets:
  - roneneldan/TinyStories
  - HuggingFaceFW/fineweb-edu
pipeline_tag: text-generation
model-index:
  - name: arch2_4_combined
    results:
      - task:
          type: text-generation
        dataset:
          name: TinyStories
          type: roneneldan/TinyStories
        metrics:
          - name: Test Perplexity (114M, 10K steps)
            type: perplexity
            value: 6.50
          - name: Top-1 Accuracy (114M, 10K steps)
            type: accuracy
            value: 56.8
          - name: Val Perplexity (8.34B / 4-layer, 10K steps)
            type: perplexity
            value: 4.52
          - name: Top-1 Accuracy (8.34B / 4-layer, 10K steps)
            type: accuracy
            value: 61.4
---

# arch2_4_combined — Pipeline-Native MoE for CPU Inference

A custom decoder-only transformer with delayed dense FFN + delayed MoE experts,
designed so its inter-layer dependency graph permits vertical pipelining on CPU.
Part of the **cflow** project — a CPU-first streaming inference engine written in
Rust.

> **Hosted weights:** this repository hosts `model.cflow` (17.39 GB) — the
> **arch2_4_8k_16l** model: 16 layers, hidden 8192, **~31B parameters**
> (top-2-of-8 MoE, ~20B active/token), Q4. This is the model benchmarked at
> 5.94 tok/s below. The **8.34B** figures in this card refer to a *smaller
> 4-layer scale point* (`arch2_4_8k_4l`) used for quality and cache-locality
> validation (val ppl 4.52); that checkpoint is not hosted here.

## Key Results

| Metric | Value |
|---|---|
| CPU decode throughput (~31B / 16-layer, Q4, 32 threads) | **5.94 tok/s** |
| Effective memory bandwidth | 61 GB/s (30% of 204.8 GB/s peak) |
| Bandwidth reduction from pipelining | **2.00x** (9.00 → 4.50 MB/token) |
| Test perplexity (114M, TinyStories, 10K steps) | 6.50 |
| Val perplexity (8.34B / 4-layer, TinyStories, 10K steps) | 4.52 |

### CPU Decode Benchmark (AWS r6i.8xlarge, Ice Lake Xeon, 256 GB DDR4)

| Engine | Model | Quant | tok/s |
|---|---|---|---|
| **cflow** | arch2_4_8k_16l (~31B MoE, ~20B active) | Q4 | **5.94** |
| Ollama (llama.cpp) | Qwen2.5-32B (32B dense) | Q4 GGUF | 4.75 |
| vLLM CPU | Qwen2.5-32B-Instruct (32B dense) | GPTQ-Int4 | 1.65 |

> **Note:** cflow and the baselines run different models — cflow's ~31B MoE has
> ~20B active params per token vs 32B dense. The total parameter counts are
> comparable (31B vs 32B), but the architectures and training differ, so the
> cflow number shows what a co-designed architecture + streaming runtime achieves,
> not a quality-matched result.

## Model Description

**arch2_4_combined** is a pre-norm decoder-only transformer with a parallel dense
FFN + sparse MoE block per layer, using delayed residual injection:

- The **dense FFN** reads from a delayed residual (1 layer behind)
- The **MoE experts** are routed on the current residual but injected 2 layers later
- This creates a dependency DAG where dense and expert weight reads for layer N
  can overlap with compute for layer N-1, reducing critical-path memory bandwidth

The architecture was selected from a screen of 5 pipeline-native candidates. It
is the only design that achieves a measured bandwidth reduction (2.00x) while
maintaining competitive perplexity.

### Architecture Details

| Parameter | 114M (screening) | ~31B (16-layer, hosted) |
|---|---|---|
| Hidden dim | 512 | 8,192 |
| Layers | 6 | 16 |
| Attention heads | 8 | 128 |
| Head dim | 64 | 64 |
| Dense FFN hidden | 2,048 | 32,768 |
| Expert FFN hidden | 512 | 4,096 |
| Experts / top-k | 8 / 2 | 8 / 2 |
| Dense delay | 1 | 1 |
| Expert delay | 2 | 2 |
| Vocab | 50,257 (GPT-2 BPE) | 50,257 (GPT-2 BPE) |
| Max seq len | 512 | 2,048 |

### Per-Layer Forward Pass

```
attn_out = attention(attn_norm(x))
x = x + attn_out                              # residual connection
x = x + dense_ffn(ffn_norm(delayed_x))        # dense reads DELAYED residual
if queued_expert: x = x + queued_expert        # inject expert from 2 layers ago
expert_out = moe(ffn_norm(x))                  # router sees CURRENT residual
# expert_out queued for injection at layer + expert_delay
```

### Components

- **Attention:** Multi-head (not GQA), Q/K/V/O projections (no bias), standard
  RoPE (base=10000, half-interleave), causal masking, KV cache
- **Dense FFN:** GeGLU — `down(gelu(gate(x)) * up(x))`
- **MoE:** Linear router → top-k selection → softmax over selected → per-expert
  GeGLU FFN → weighted sum. No auxiliary/load-balancing loss.
- **Normalization:** RMSNorm (eps=1e-6) at attn input, FFN input, and pre-lm_head
- **Combine style:** `DelayedSum` — dense and router share `ffn_norm` but read
  different residual snapshots

## Training

### 114M Screening (5 architectures)

| | |
|---|---|
| Dataset | TinyStories (431M train tokens, 24M test tokens) |
| Tokenizer | GPT-2 BPE (50,257 vocab) |
| Sequence length | 512 |
| Optimizer | AdamW (betas=0.9/0.95, eps=1e-8, weight_decay=0.1) |
| Learning rate | 3e-4 with linear warmup (200 steps) + cosine decay to 1e-5 |
| Gradient clipping | Global norm 1.0 |
| Batch size | 8 |
| Steps | 10,000 |
| Precision | float32 |
| Hardware | RTX 3060 12 GB |

### 8.34B Scale-Up (4-layer — quality & cache validation)

This is the smaller scale point: `arch2_4_8k_4l`, 4 layers, 8.34B params. It
provides the quality numbers (val ppl 4.52, top-1 61.4%) and the PMU cache-locality
result. The hosted decode-benchmark model (`arch2_4_8k_16l`, ~31B) shares this
per-layer geometry but has 16 layers.

| | |
|---|---|
| Dataset | TinyStories (same splits) |
| Optimizer | 8-bit AdamW (bitsandbytes) |
| Learning rate | 1e-4 with linear warmup (500 steps) + cosine decay to 1e-6 |
| Batch size | 4 per GPU (global 32) |
| Steps | 10,000 |
| Precision | bf16 |
| Parallelism | FSDP (FULL_SHARD / ZeRO-3) |
| Gradient checkpointing | Per `DelayedMoELayer`, non-reentrant |
| Hardware | 8x A100 SXM4 80 GB (Lambda Cloud) |

### Architecture Comparison (114M, TinyStories, 10K steps)

| Architecture | dense_delay | expert_delay | Test PPL | Top-1 Acc | BW Reduction |
|---|---|---|---|---|---|
| arch1_decoupled_streams | 0 | 0 | 7.21 | 54.9% | 1.00x |
| **arch2_4_combined** | **1** | **2** | **6.50** | **56.8%** | **2.00x** |
| arch3_pipeline_registers | 0 | 0 | 7.24 | 55.1% | 1.00x |
| arch4_async_experts | 0 | 2 | **6.26** | **57.6%** | 1.00x |
| arch5_fixed_point | 0 | 0 | 6.77 | 56.2% | 1.00x |

**Key insight:** Dense delay is the bandwidth knob; expert delay is the quality
knob. arch4_async_experts gets the best perplexity by routing off pre-dense
activations (cleaner router signal) but sacrifices the bandwidth win that
arch2_4 achieves by also delaying the dense read.

## Inference with cflow

cflow is a Rust inference engine that reads `.cflow` (per-layer streaming) or
`.vflow` (vertical pipeline) weight files. Weights are stored as pre-tiled Q4
(128x256 tiles, ~18 KB each, sized to fit L2 cache).

```bash
# Build
cargo build --release --bin cflow-run

# Convert safetensors → .cflow
cargo run --release --bin cflow-convert -- \
  --input checkpoint.safetensors \
  --output model.cflow \
  --model arch2_4

# Run inference
CFLOW_THREADS=32 ./target/release/cflow-run \
  model.cflow 32 \
  --prompt "Once upon a time" \
  --tokenizer tokenizer.json \
  --temperature 0.8
```

### SIMD Support

The runtime auto-detects and dispatches to the best available instruction set:

| ISA | Kernel | Notes |
|---|---|---|
| AVX-512 + VNNI | Q4×Q8 `vpdpbusd` | Best path (Ice Lake+) |
| AVX-512F | Q4×f32 FMA | Skylake-X+ |
| AVX2 + FMA | Q4×f32 FMA | Haswell+ |
| AVX + SSE4.1 | Q4×f32 | Sandy Bridge+ |
| Scalar | Q4×f32 | Fallback |

## Limitations

- **Not a general-purpose LLM.** Trained on TinyStories / FineWeb-Edu subsets at
  10K steps — this is an architecture and runtime research artifact, not a
  production language model.
- **Custom architecture.** Cannot be loaded in Hugging Face Transformers, vLLM,
  or llama.cpp without adaptation. Requires the cflow Rust runtime or the
  PyTorch reference in `pipeline_native/`.
- **CPU-only.** The runtime targets x86-64 CPUs with AVX2 or AVX-512. No GPU
  backend.
- **Single-token decode optimized.** Batch/prefill throughput is not the focus.

## Thesis Scorecard

The cflow project tests 8 claims about CPU inference optimization:

| # | Claim | Result |
|---|---|---|
| 1 | Conditional expert reading (top-k only) | **Proven** |
| 2 | Tile-streaming L1/L2 cache locality | **Proven** (7.29x fewer L1-d misses, PMU-measured) |
| 3 | AVX2/AVX-512 Q4 SIMD kernels | **Proven** |
| 4 | Fused QKV and gate+up projections | **Proven** |
| 5 | Compute-order file layout | **Proven** |
| 6 | Software prefetch (`_mm_prefetch`) | **Disproven** (no benefit; slightly harmful) |
| 7 | Vertical pipeline via delayed dependencies | **Validated** (2.00x bandwidth reduction) |
| 8 | Stage-major disk layout readahead | **Disproven** (no isolated benefit) |

## Citation

```bibtex
@software{poperszky2026cflow,
  author = {Poperszky, Tom},
  title = {cflow: CPU-First Streaming Inference for Pipeline-Native Transformers},
  year = {2026}
}
```

## License

MIT