--- license: apache-2.0 tags: - llm-inference - cpu-inference - memory-bandwidth - transformer - quantization - research --- # AIOS: A CPU-Native Inference Architecture for Large Language Models **This is not a model.** This is the framework paper and specification for AIOS — a memory residency controller for CPU-native LLM inference. ## Paper **Title:** AIOS: A CPU-Native Inference Architecture for Large Language Models **Author:** Anand Casavaraju **Published:** March 2026 **SSRN:** https://ssrn.com/abstract=6467298 **GitHub:** https://github.com/acasavaraju/AIOS ## What AIOS Is AIOS is a memory residency controller that sits between inference engines (llama.cpp, Ollama, vLLM) and hardware, managing how weight data moves from DRAM to CPU. It addresses four resource dimensions: - **Weight reads** — aliasing + sparsity maps - **KV cache reads** — MQA/GQA + tiered residency - **Activation spill** — chunked prefill - **Attention compute** — sparsity map ## Current State Framework and specification published. Runtime not yet implemented. All performance projections are analytical. Empirical validation tracked at github.com/acasavaraju/AIOS/issues. ## Citation ```bibtex @misc{casavaraju2026aios, title = {AIOS: A CPU-Native Inference Architecture for Large Language Models}, author = {Casavaraju, Anand}, year = {2026}, url = {https://ssrn.com/abstract=6467298} } ```