metadata
license: apache-2.0
tags:
- llm-inference
- cpu-inference
- memory-bandwidth
- transformer
- quantization
- research
AIOS: A CPU-Native Inference Architecture for Large Language Models
This is not a model. This is the framework paper and specification for AIOS — a memory residency controller for CPU-native LLM inference.
Paper
Title: AIOS: A CPU-Native Inference Architecture for Large Language Models
Author: Anand Casavaraju
Published: March 2026
SSRN: https://ssrn.com/abstract=6467298
GitHub: https://github.com/acasavaraju/AIOS
What AIOS Is
AIOS is a memory residency controller that sits between inference engines (llama.cpp, Ollama, vLLM) and hardware, managing how weight data moves from DRAM to CPU. It addresses four resource dimensions:
- Weight reads — aliasing + sparsity maps
- KV cache reads — MQA/GQA + tiered residency
- Activation spill — chunked prefill
- Attention compute — sparsity map
Current State
Framework and specification published. Runtime not yet implemented. All performance projections are analytical. Empirical validation tracked at github.com/acasavaraju/AIOS/issues.
Citation
@misc{casavaraju2026aios,
title = {AIOS: A CPU-Native Inference Architecture for Large Language Models},
author = {Casavaraju, Anand},
year = {2026},
url = {https://ssrn.com/abstract=6467298}
}