| --- |
| title: tiny_vllm |
| emoji: πͺΆ |
| colorFrom: gray |
| colorTo: green |
| sdk: docker |
| app_port: 7860 |
| pinned: false |
| short_description: Minimal continuous-batching engine β paged KV + SSE |
| --- |
| |
| # tiny_vllm |
| |
| A **minimal continuous-batching LLM engine** built to be read end-to-end. It |
| re-implements the load-bearing ideas of vLLM / SGLang in ~1.5k lines of |
| Python: |
| |
| - **Paged KV cache** with logical block tables β physical blocks are a flat |
| pool; per-sequence block tables map logical positions β physical slots. |
| - **Automatic prefix caching** via content-addressed hashes β two requests |
| with the same prompt prefix share KV blocks. |
| - **Continuous batching with chunked prefill** β each scheduling step packs a |
| budget of tokens from any mix of new prefills and ongoing decodes; long |
| prompts are sliced so they don't starve the decoders. |
| - **Recompute-style preemption** β when the pool runs dry, the youngest |
| running sequence is evicted and re-enqueued. |
| - **SSE streaming** over a thin FastAPI layer β both token deltas |
| (`/generate`, OpenAI-compatible `/v1/completions`) and a parallel engine |
| event stream (`/engine/events`) the demo page subscribes to. |
| - A **visualization demo page** that renders the block pool, scheduler |
| queues, per-sequence block tables, and live tokens as the engine runs. |
| |
| It is **not** vLLM. Attention runs in plain PyTorch SDPA (per-sequence loop), |
| there are no fused or paged-attention kernels, and CPU is the default device. |
| This is a learning artifact, not a serving stack. |
| |
| ## Quick start |
| |
| ```bash |
| pip install -r requirements.txt |
| # or: pip install -e . |
| |
| python -m tiny_vllm.server --model Qwen/Qwen2.5-0.5B-Instruct --device cpu |
| ``` |
| |
| Open [http://localhost:8000](http://localhost:8000) for the live |
| visualization, or hit the API directly: |
| |
| ```bash |
| # OpenAI-style streaming |
| curl -N http://localhost:8000/v1/completions \ |
| -H 'content-type: application/json' \ |
| -d '{"prompt":"In two sentences, what is paged attention?","max_tokens":80,"stream":true}' |
| |
| # A simpler endpoint |
| curl -N http://localhost:8000/generate \ |
| -H 'content-type: application/json' \ |
| -d '{"prompt":"haiku about KV caches","max_tokens":48,"stream":true}' |
| ``` |
| |
| Smoke test with concurrent requests: |
| |
| ```bash |
| python examples/smoke_client.py # 4 prompts in parallel |
| python examples/smoke_client.py --prefix-demo # show prefix-cache speedup |
| ``` |
| |
| ## The pieces |
| |
| | File | What | |
| |---|---| |
| | `tiny_vllm/config.py` | `EngineConfig`, `SamplingParams` | |
| | `tiny_vllm/request.py` | `Sequence`, status enum, KV bookkeeping fields | |
| | `tiny_vllm/block_manager.py` | Physical block pool, refcounts, prefix-cache (hash-chain) | |
| | `tiny_vllm/scheduler.py` | Continuous batching + chunked prefill + preemption | |
| | `tiny_vllm/paged_kv.py` | The actual KV tensors that block ids point into | |
| | `tiny_vllm/model_runner.py` | Minimal Qwen2 forward (RoPE, RMSNorm, GQA) using the paged cache | |
| | `tiny_vllm/sampler.py` | Greedy / top-k / top-p | |
| | `tiny_vllm/engine.py` | Orchestrator: scheduler βΆ model βΆ sampler βΆ outputs + events | |
| | `tiny_vllm/server.py` | FastAPI: `/generate`, `/v1/completions`, `/engine/events`, `/` | |
| | `web/` | Static demo page (vanilla HTML/CSS/JS, no framework) | |
| |
| The model-free parts (block manager, scheduler) have unit tests: |
| |
| ```bash |
| pip install pytest |
| python -m pytest tests/ |
| ``` |
| |
| ## Hugging Face Space β live demo |
| |
| For a *live* (not recorded) demo you can talk to from any browser, deploy this |
| repo as a Docker-based Hugging Face Space. HF's free CPU tier (16 GB RAM, |
| 2 vCPU) fits Qwen2.5-0.5B comfortably. |
| |
| **One-time setup:** |
| |
| 1. **Create the Space.** Go to [huggingface.co/new-space](https://huggingface.co/new-space): |
| - Owner: your HF username |
| - Space name: e.g. `tiny-vllm` (must match `HF_SPACE_NAME` below) |
| - SDK: **Docker** |
| - License: MIT |
| 2. **Generate a write-access token** at |
| [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) β New |
| token β role **Write**. |
| 3. **Add three secrets** to this GitHub repo (Settings β Secrets and variables |
| β Actions β New repository secret): |
| - `HF_TOKEN` β the token from step 2 |
| - `HF_USERNAME` β your HF username |
| - `HF_SPACE_NAME` β e.g. `tiny-vllm` |
| |
| On the next push to `main`, the `Sync to Hugging Face Space` workflow mirrors |
| the repo to the Space. HF then builds the Docker image (~3β5 min on first |
| build because of the pre-fetched model) and the Space goes live at: |
| |
| ``` |
| https://<lowercased-HF_USERNAME>-<HF_SPACE_NAME>.hf.space |
| ``` |
| |
| (HF normalises subdomains to lowercase β `enCoder/tiny-vllm` becomes |
| `encoder-tiny-vllm.hf.space`.) |
| |
| The GH Pages page links to this URL as a **"try live β"** pill in the |
| topbar β update `data-hf-space` on `<body>` in `web/index.html` if your |
| Space URL differs. |
| |
| **HF Spaces cost: free.** Cold-start (after ~48 h of inactivity) takes ~30 s |
| while the container wakes; subsequent requests are warm. |
| |
| **Files involved:** |
| - `Dockerfile` β CPU-only torch, pre-downloads the model at build time. |
| - `README.md` frontmatter β HF reads `sdk: docker`, `app_port: 7860`, etc. |
| - `.github/workflows/sync-huggingface.yml` β mirrors GitHub β HF Spaces. |
| - CORS is enabled on the server so the GH Pages frontend can call the HF |
| backend cross-origin (`?mode=live&backend=https://...hf.space` is a |
| potential future addition). |
| |
| ## GitHub Pages demo (replay mode) |
| |
| The visualization can run as a **static page** on GitHub Pages with no |
| backend. It plays back a recorded session from `web/events.jsonl`: |
| |
| 1. The repo ships a fabricated `web/events.jsonl` so the page works on first |
| deploy (run `python scripts/make_demo_recording.py > web/events.jsonl` to |
| regenerate). |
| 2. To use a **real** recording instead, run the server with `--record`: |
| ```bash |
| python -m tiny_vllm.server --record web/events.jsonl |
| # β¦submit some prompts via the UI or smoke_clientβ¦ |
| # Ctrl-C the server. events.jsonl now contains the full session. |
| git add web/events.jsonl && git commit -m "fresh demo recording" && git push |
| ``` |
| 3. Enable Pages once: **repo β Settings β Pages β Source: "GitHub Actions"**. |
| The workflow in `.github/workflows/deploy-pages.yml` then publishes |
| `web/` on every push to `main` that touches it. |
| |
| The page auto-detects mode: |
| - Tries `/engine/events` SSE first; if it responds within 2s it's **live**. |
| - Otherwise falls back to **replay**, fetching `events.jsonl` from the same |
| directory and playing it back with original timing (speed control / pause |
| / restart in the controls row). |
| - Force a mode with `?mode=replay` or `?mode=live`; point at a different |
| recording with `?session=URL`. |
| |
| ## What the demo page shows |
| |
| | Panel | What you're looking at | |
| |---|---| |
| | **Block pool** | One cell per physical block. Color = state (free / cached-evictable / in-use / shared). Orange border = the block has been hashed and is discoverable in the prefix cache. | |
| | **Scheduler** | Live stats: tokens this step, prefill-vs-decode split, step latency, prefix-cache hit-rate, preemption count. Step log scrolls below. | |
| | **Sequences** | Every active sequence's block table (cell per block, blue = prefix-cache hit, purple = shared), status, generated text. | |
| |
| Click **Send Γ2** to fire the same prompt twice β the second send should |
| prefix-cache the entire prompt and start decoding almost immediately. |
| |
| ## Reading order |
| |
| If you want to learn the system: |
| |
| 1. `request.py` β what a request becomes. |
| 2. `block_manager.py` β read `admit()` and `_take_free_block()`; the prefix |
| cache lives here. |
| 3. `scheduler.py` β read `schedule()`; the two-phase loop is the heart of |
| continuous batching. |
| 4. `model_runner.py` β `Qwen2Attention.forward` β see how Q/K/V get written |
| into and read out of the paged cache. |
| 5. `engine.py::_run_loop` β how everything is wired step-by-step. |
| 6. `server.py` β the SSE surface. |
| |
| ## Known limitations |
| |
| - CPU-friendly defaults; no custom CUDA / Triton kernels. |
| - Per-sequence attention loop inside each layer (not packed/varlen-fused). |
| - Only Llama/Qwen2-style decoder architectures (RMSNorm + RoPE + GQA + SwiGLU MLP). |
| - Single-prompt completions (`n=1`); no beam search. |
| - No tensor parallel, no quantization. |
| - Prefix-cache eviction is LRU on the free list β not the full |
| reference-counted radix tree vLLM ships. |
| |
| ## License |
| |
| MIT. |
| |