title: tiny_vllm
emoji: πͺΆ
colorFrom: gray
colorTo: green
sdk: docker
app_port: 7860
pinned: false
short_description: Minimal continuous-batching engine β paged KV + SSE
tiny_vllm
A minimal continuous-batching LLM engine built to be read end-to-end. It re-implements the load-bearing ideas of vLLM / SGLang in ~1.5k lines of Python:
- Paged KV cache with logical block tables β physical blocks are a flat pool; per-sequence block tables map logical positions β physical slots.
- Automatic prefix caching via content-addressed hashes β two requests with the same prompt prefix share KV blocks.
- Continuous batching with chunked prefill β each scheduling step packs a budget of tokens from any mix of new prefills and ongoing decodes; long prompts are sliced so they don't starve the decoders.
- Recompute-style preemption β when the pool runs dry, the youngest running sequence is evicted and re-enqueued.
- SSE streaming over a thin FastAPI layer β both token deltas
(
/generate, OpenAI-compatible/v1/completions) and a parallel engine event stream (/engine/events) the demo page subscribes to. - A visualization demo page that renders the block pool, scheduler queues, per-sequence block tables, and live tokens as the engine runs.
It is not vLLM. Attention runs in plain PyTorch SDPA (per-sequence loop), there are no fused or paged-attention kernels, and CPU is the default device. This is a learning artifact, not a serving stack.
Quick start
pip install -r requirements.txt
# or: pip install -e .
python -m tiny_vllm.server --model Qwen/Qwen2.5-0.5B-Instruct --device cpu
Open http://localhost:8000 for the live visualization, or hit the API directly:
# OpenAI-style streaming
curl -N http://localhost:8000/v1/completions \
-H 'content-type: application/json' \
-d '{"prompt":"In two sentences, what is paged attention?","max_tokens":80,"stream":true}'
# A simpler endpoint
curl -N http://localhost:8000/generate \
-H 'content-type: application/json' \
-d '{"prompt":"haiku about KV caches","max_tokens":48,"stream":true}'
Smoke test with concurrent requests:
python examples/smoke_client.py # 4 prompts in parallel
python examples/smoke_client.py --prefix-demo # show prefix-cache speedup
The pieces
| File | What |
|---|---|
tiny_vllm/config.py |
EngineConfig, SamplingParams |
tiny_vllm/request.py |
Sequence, status enum, KV bookkeeping fields |
tiny_vllm/block_manager.py |
Physical block pool, refcounts, prefix-cache (hash-chain) |
tiny_vllm/scheduler.py |
Continuous batching + chunked prefill + preemption |
tiny_vllm/paged_kv.py |
The actual KV tensors that block ids point into |
tiny_vllm/model_runner.py |
Minimal Qwen2 forward (RoPE, RMSNorm, GQA) using the paged cache |
tiny_vllm/sampler.py |
Greedy / top-k / top-p |
tiny_vllm/engine.py |
Orchestrator: scheduler βΆ model βΆ sampler βΆ outputs + events |
tiny_vllm/server.py |
FastAPI: /generate, /v1/completions, /engine/events, / |
web/ |
Static demo page (vanilla HTML/CSS/JS, no framework) |
The model-free parts (block manager, scheduler) have unit tests:
pip install pytest
python -m pytest tests/
Hugging Face Space β live demo
For a live (not recorded) demo you can talk to from any browser, deploy this repo as a Docker-based Hugging Face Space. HF's free CPU tier (16 GB RAM, 2 vCPU) fits Qwen2.5-0.5B comfortably.
One-time setup:
- Create the Space. Go to huggingface.co/new-space:
- Owner: your HF username
- Space name: e.g.
tiny-vllm(must matchHF_SPACE_NAMEbelow) - SDK: Docker
- License: MIT
- Generate a write-access token at huggingface.co/settings/tokens β New token β role Write.
- Add three secrets to this GitHub repo (Settings β Secrets and variables
β Actions β New repository secret):
HF_TOKENβ the token from step 2HF_USERNAMEβ your HF usernameHF_SPACE_NAMEβ e.g.tiny-vllm
On the next push to main, the Sync to Hugging Face Space workflow mirrors
the repo to the Space. HF then builds the Docker image (~3β5 min on first
build because of the pre-fetched model) and the Space goes live at:
https://<lowercased-HF_USERNAME>-<HF_SPACE_NAME>.hf.space
(HF normalises subdomains to lowercase β enCoder/tiny-vllm becomes
encoder-tiny-vllm.hf.space.)
The GH Pages page links to this URL as a "try live β" pill in the
topbar β update data-hf-space on <body> in web/index.html if your
Space URL differs.
HF Spaces cost: free. Cold-start (after ~48 h of inactivity) takes ~30 s while the container wakes; subsequent requests are warm.
Files involved:
Dockerfileβ CPU-only torch, pre-downloads the model at build time.README.mdfrontmatter β HF readssdk: docker,app_port: 7860, etc..github/workflows/sync-huggingface.ymlβ mirrors GitHub β HF Spaces.- CORS is enabled on the server so the GH Pages frontend can call the HF
backend cross-origin (
?mode=live&backend=https://...hf.spaceis a potential future addition).
GitHub Pages demo (replay mode)
The visualization can run as a static page on GitHub Pages with no
backend. It plays back a recorded session from web/events.jsonl:
- The repo ships a fabricated
web/events.jsonlso the page works on first deploy (runpython scripts/make_demo_recording.py > web/events.jsonlto regenerate). - To use a real recording instead, run the server with
--record:python -m tiny_vllm.server --record web/events.jsonl # β¦submit some prompts via the UI or smoke_clientβ¦ # Ctrl-C the server. events.jsonl now contains the full session. git add web/events.jsonl && git commit -m "fresh demo recording" && git push - Enable Pages once: repo β Settings β Pages β Source: "GitHub Actions".
The workflow in
.github/workflows/deploy-pages.ymlthen publishesweb/on every push tomainthat touches it.
The page auto-detects mode:
- Tries
/engine/eventsSSE first; if it responds within 2s it's live. - Otherwise falls back to replay, fetching
events.jsonlfrom the same directory and playing it back with original timing (speed control / pause / restart in the controls row). - Force a mode with
?mode=replayor?mode=live; point at a different recording with?session=URL.
What the demo page shows
| Panel | What you're looking at |
|---|---|
| Block pool | One cell per physical block. Color = state (free / cached-evictable / in-use / shared). Orange border = the block has been hashed and is discoverable in the prefix cache. |
| Scheduler | Live stats: tokens this step, prefill-vs-decode split, step latency, prefix-cache hit-rate, preemption count. Step log scrolls below. |
| Sequences | Every active sequence's block table (cell per block, blue = prefix-cache hit, purple = shared), status, generated text. |
Click Send Γ2 to fire the same prompt twice β the second send should prefix-cache the entire prompt and start decoding almost immediately.
Reading order
If you want to learn the system:
request.pyβ what a request becomes.block_manager.pyβ readadmit()and_take_free_block(); the prefix cache lives here.scheduler.pyβ readschedule(); the two-phase loop is the heart of continuous batching.model_runner.pyβQwen2Attention.forwardβ see how Q/K/V get written into and read out of the paged cache.engine.py::_run_loopβ how everything is wired step-by-step.server.pyβ the SSE surface.
Known limitations
- CPU-friendly defaults; no custom CUDA / Triton kernels.
- Per-sequence attention loop inside each layer (not packed/varlen-fused).
- Only Llama/Qwen2-style decoder architectures (RMSNorm + RoPE + GQA + SwiGLU MLP).
- Single-prompt completions (
n=1); no beam search. - No tensor parallel, no quantization.
- Prefix-cache eviction is LRU on the free list β not the full reference-counted radix tree vLLM ships.
License
MIT.