tiny-vllm / README.md
enCoder's picture
"Shorten HF short_description to <60 chars"
33432f7
---
title: tiny_vllm
emoji: πŸͺΆ
colorFrom: gray
colorTo: green
sdk: docker
app_port: 7860
pinned: false
short_description: Minimal continuous-batching engine β€” paged KV + SSE
---
# tiny_vllm
A **minimal continuous-batching LLM engine** built to be read end-to-end. It
re-implements the load-bearing ideas of vLLM / SGLang in ~1.5k lines of
Python:
- **Paged KV cache** with logical block tables β€” physical blocks are a flat
pool; per-sequence block tables map logical positions β†’ physical slots.
- **Automatic prefix caching** via content-addressed hashes β€” two requests
with the same prompt prefix share KV blocks.
- **Continuous batching with chunked prefill** β€” each scheduling step packs a
budget of tokens from any mix of new prefills and ongoing decodes; long
prompts are sliced so they don't starve the decoders.
- **Recompute-style preemption** β€” when the pool runs dry, the youngest
running sequence is evicted and re-enqueued.
- **SSE streaming** over a thin FastAPI layer β€” both token deltas
(`/generate`, OpenAI-compatible `/v1/completions`) and a parallel engine
event stream (`/engine/events`) the demo page subscribes to.
- A **visualization demo page** that renders the block pool, scheduler
queues, per-sequence block tables, and live tokens as the engine runs.
It is **not** vLLM. Attention runs in plain PyTorch SDPA (per-sequence loop),
there are no fused or paged-attention kernels, and CPU is the default device.
This is a learning artifact, not a serving stack.
## Quick start
```bash
pip install -r requirements.txt
# or: pip install -e .
python -m tiny_vllm.server --model Qwen/Qwen2.5-0.5B-Instruct --device cpu
```
Open [http://localhost:8000](http://localhost:8000) for the live
visualization, or hit the API directly:
```bash
# OpenAI-style streaming
curl -N http://localhost:8000/v1/completions \
-H 'content-type: application/json' \
-d '{"prompt":"In two sentences, what is paged attention?","max_tokens":80,"stream":true}'
# A simpler endpoint
curl -N http://localhost:8000/generate \
-H 'content-type: application/json' \
-d '{"prompt":"haiku about KV caches","max_tokens":48,"stream":true}'
```
Smoke test with concurrent requests:
```bash
python examples/smoke_client.py # 4 prompts in parallel
python examples/smoke_client.py --prefix-demo # show prefix-cache speedup
```
## The pieces
| File | What |
|---|---|
| `tiny_vllm/config.py` | `EngineConfig`, `SamplingParams` |
| `tiny_vllm/request.py` | `Sequence`, status enum, KV bookkeeping fields |
| `tiny_vllm/block_manager.py` | Physical block pool, refcounts, prefix-cache (hash-chain) |
| `tiny_vllm/scheduler.py` | Continuous batching + chunked prefill + preemption |
| `tiny_vllm/paged_kv.py` | The actual KV tensors that block ids point into |
| `tiny_vllm/model_runner.py` | Minimal Qwen2 forward (RoPE, RMSNorm, GQA) using the paged cache |
| `tiny_vllm/sampler.py` | Greedy / top-k / top-p |
| `tiny_vllm/engine.py` | Orchestrator: scheduler ⟢ model ⟢ sampler ⟢ outputs + events |
| `tiny_vllm/server.py` | FastAPI: `/generate`, `/v1/completions`, `/engine/events`, `/` |
| `web/` | Static demo page (vanilla HTML/CSS/JS, no framework) |
The model-free parts (block manager, scheduler) have unit tests:
```bash
pip install pytest
python -m pytest tests/
```
## Hugging Face Space β€” live demo
For a *live* (not recorded) demo you can talk to from any browser, deploy this
repo as a Docker-based Hugging Face Space. HF's free CPU tier (16 GB RAM,
2 vCPU) fits Qwen2.5-0.5B comfortably.
**One-time setup:**
1. **Create the Space.** Go to [huggingface.co/new-space](https://huggingface.co/new-space):
- Owner: your HF username
- Space name: e.g. `tiny-vllm` (must match `HF_SPACE_NAME` below)
- SDK: **Docker**
- License: MIT
2. **Generate a write-access token** at
[huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) β†’ New
token β†’ role **Write**.
3. **Add three secrets** to this GitHub repo (Settings β†’ Secrets and variables
β†’ Actions β†’ New repository secret):
- `HF_TOKEN` β€” the token from step 2
- `HF_USERNAME` β€” your HF username
- `HF_SPACE_NAME` β€” e.g. `tiny-vllm`
On the next push to `main`, the `Sync to Hugging Face Space` workflow mirrors
the repo to the Space. HF then builds the Docker image (~3–5 min on first
build because of the pre-fetched model) and the Space goes live at:
```
https://<lowercased-HF_USERNAME>-<HF_SPACE_NAME>.hf.space
```
(HF normalises subdomains to lowercase β€” `enCoder/tiny-vllm` becomes
`encoder-tiny-vllm.hf.space`.)
The GH Pages page links to this URL as a **"try live β†—"** pill in the
topbar β€” update `data-hf-space` on `<body>` in `web/index.html` if your
Space URL differs.
**HF Spaces cost: free.** Cold-start (after ~48 h of inactivity) takes ~30 s
while the container wakes; subsequent requests are warm.
**Files involved:**
- `Dockerfile` β€” CPU-only torch, pre-downloads the model at build time.
- `README.md` frontmatter β€” HF reads `sdk: docker`, `app_port: 7860`, etc.
- `.github/workflows/sync-huggingface.yml` β€” mirrors GitHub β†’ HF Spaces.
- CORS is enabled on the server so the GH Pages frontend can call the HF
backend cross-origin (`?mode=live&backend=https://...hf.space` is a
potential future addition).
## GitHub Pages demo (replay mode)
The visualization can run as a **static page** on GitHub Pages with no
backend. It plays back a recorded session from `web/events.jsonl`:
1. The repo ships a fabricated `web/events.jsonl` so the page works on first
deploy (run `python scripts/make_demo_recording.py > web/events.jsonl` to
regenerate).
2. To use a **real** recording instead, run the server with `--record`:
```bash
python -m tiny_vllm.server --record web/events.jsonl
# …submit some prompts via the UI or smoke_client…
# Ctrl-C the server. events.jsonl now contains the full session.
git add web/events.jsonl && git commit -m "fresh demo recording" && git push
```
3. Enable Pages once: **repo β†’ Settings β†’ Pages β†’ Source: "GitHub Actions"**.
The workflow in `.github/workflows/deploy-pages.yml` then publishes
`web/` on every push to `main` that touches it.
The page auto-detects mode:
- Tries `/engine/events` SSE first; if it responds within 2s it's **live**.
- Otherwise falls back to **replay**, fetching `events.jsonl` from the same
directory and playing it back with original timing (speed control / pause
/ restart in the controls row).
- Force a mode with `?mode=replay` or `?mode=live`; point at a different
recording with `?session=URL`.
## What the demo page shows
| Panel | What you're looking at |
|---|---|
| **Block pool** | One cell per physical block. Color = state (free / cached-evictable / in-use / shared). Orange border = the block has been hashed and is discoverable in the prefix cache. |
| **Scheduler** | Live stats: tokens this step, prefill-vs-decode split, step latency, prefix-cache hit-rate, preemption count. Step log scrolls below. |
| **Sequences** | Every active sequence's block table (cell per block, blue = prefix-cache hit, purple = shared), status, generated text. |
Click **Send Γ—2** to fire the same prompt twice β€” the second send should
prefix-cache the entire prompt and start decoding almost immediately.
## Reading order
If you want to learn the system:
1. `request.py` β€” what a request becomes.
2. `block_manager.py` β€” read `admit()` and `_take_free_block()`; the prefix
cache lives here.
3. `scheduler.py` β€” read `schedule()`; the two-phase loop is the heart of
continuous batching.
4. `model_runner.py` β†’ `Qwen2Attention.forward` β€” see how Q/K/V get written
into and read out of the paged cache.
5. `engine.py::_run_loop` β€” how everything is wired step-by-step.
6. `server.py` β€” the SSE surface.
## Known limitations
- CPU-friendly defaults; no custom CUDA / Triton kernels.
- Per-sequence attention loop inside each layer (not packed/varlen-fused).
- Only Llama/Qwen2-style decoder architectures (RMSNorm + RoPE + GQA + SwiGLU MLP).
- Single-prompt completions (`n=1`); no beam search.
- No tensor parallel, no quantization.
- Prefix-cache eviction is LRU on the free list β€” not the full
reference-counted radix tree vLLM ships.
## License
MIT.