File size: 8,328 Bytes
8fa0f9d 33432f7 8fa0f9d c32c359 8fa0f9d 39fa862 c32c359 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 | ---
title: tiny_vllm
emoji: πͺΆ
colorFrom: gray
colorTo: green
sdk: docker
app_port: 7860
pinned: false
short_description: Minimal continuous-batching engine β paged KV + SSE
---
# tiny_vllm
A **minimal continuous-batching LLM engine** built to be read end-to-end. It
re-implements the load-bearing ideas of vLLM / SGLang in ~1.5k lines of
Python:
- **Paged KV cache** with logical block tables β physical blocks are a flat
pool; per-sequence block tables map logical positions β physical slots.
- **Automatic prefix caching** via content-addressed hashes β two requests
with the same prompt prefix share KV blocks.
- **Continuous batching with chunked prefill** β each scheduling step packs a
budget of tokens from any mix of new prefills and ongoing decodes; long
prompts are sliced so they don't starve the decoders.
- **Recompute-style preemption** β when the pool runs dry, the youngest
running sequence is evicted and re-enqueued.
- **SSE streaming** over a thin FastAPI layer β both token deltas
(`/generate`, OpenAI-compatible `/v1/completions`) and a parallel engine
event stream (`/engine/events`) the demo page subscribes to.
- A **visualization demo page** that renders the block pool, scheduler
queues, per-sequence block tables, and live tokens as the engine runs.
It is **not** vLLM. Attention runs in plain PyTorch SDPA (per-sequence loop),
there are no fused or paged-attention kernels, and CPU is the default device.
This is a learning artifact, not a serving stack.
## Quick start
```bash
pip install -r requirements.txt
# or: pip install -e .
python -m tiny_vllm.server --model Qwen/Qwen2.5-0.5B-Instruct --device cpu
```
Open [http://localhost:8000](http://localhost:8000) for the live
visualization, or hit the API directly:
```bash
# OpenAI-style streaming
curl -N http://localhost:8000/v1/completions \
-H 'content-type: application/json' \
-d '{"prompt":"In two sentences, what is paged attention?","max_tokens":80,"stream":true}'
# A simpler endpoint
curl -N http://localhost:8000/generate \
-H 'content-type: application/json' \
-d '{"prompt":"haiku about KV caches","max_tokens":48,"stream":true}'
```
Smoke test with concurrent requests:
```bash
python examples/smoke_client.py # 4 prompts in parallel
python examples/smoke_client.py --prefix-demo # show prefix-cache speedup
```
## The pieces
| File | What |
|---|---|
| `tiny_vllm/config.py` | `EngineConfig`, `SamplingParams` |
| `tiny_vllm/request.py` | `Sequence`, status enum, KV bookkeeping fields |
| `tiny_vllm/block_manager.py` | Physical block pool, refcounts, prefix-cache (hash-chain) |
| `tiny_vllm/scheduler.py` | Continuous batching + chunked prefill + preemption |
| `tiny_vllm/paged_kv.py` | The actual KV tensors that block ids point into |
| `tiny_vllm/model_runner.py` | Minimal Qwen2 forward (RoPE, RMSNorm, GQA) using the paged cache |
| `tiny_vllm/sampler.py` | Greedy / top-k / top-p |
| `tiny_vllm/engine.py` | Orchestrator: scheduler βΆ model βΆ sampler βΆ outputs + events |
| `tiny_vllm/server.py` | FastAPI: `/generate`, `/v1/completions`, `/engine/events`, `/` |
| `web/` | Static demo page (vanilla HTML/CSS/JS, no framework) |
The model-free parts (block manager, scheduler) have unit tests:
```bash
pip install pytest
python -m pytest tests/
```
## Hugging Face Space β live demo
For a *live* (not recorded) demo you can talk to from any browser, deploy this
repo as a Docker-based Hugging Face Space. HF's free CPU tier (16 GB RAM,
2 vCPU) fits Qwen2.5-0.5B comfortably.
**One-time setup:**
1. **Create the Space.** Go to [huggingface.co/new-space](https://huggingface.co/new-space):
- Owner: your HF username
- Space name: e.g. `tiny-vllm` (must match `HF_SPACE_NAME` below)
- SDK: **Docker**
- License: MIT
2. **Generate a write-access token** at
[huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) β New
token β role **Write**.
3. **Add three secrets** to this GitHub repo (Settings β Secrets and variables
β Actions β New repository secret):
- `HF_TOKEN` β the token from step 2
- `HF_USERNAME` β your HF username
- `HF_SPACE_NAME` β e.g. `tiny-vllm`
On the next push to `main`, the `Sync to Hugging Face Space` workflow mirrors
the repo to the Space. HF then builds the Docker image (~3β5 min on first
build because of the pre-fetched model) and the Space goes live at:
```
https://<lowercased-HF_USERNAME>-<HF_SPACE_NAME>.hf.space
```
(HF normalises subdomains to lowercase β `enCoder/tiny-vllm` becomes
`encoder-tiny-vllm.hf.space`.)
The GH Pages page links to this URL as a **"try live β"** pill in the
topbar β update `data-hf-space` on `<body>` in `web/index.html` if your
Space URL differs.
**HF Spaces cost: free.** Cold-start (after ~48 h of inactivity) takes ~30 s
while the container wakes; subsequent requests are warm.
**Files involved:**
- `Dockerfile` β CPU-only torch, pre-downloads the model at build time.
- `README.md` frontmatter β HF reads `sdk: docker`, `app_port: 7860`, etc.
- `.github/workflows/sync-huggingface.yml` β mirrors GitHub β HF Spaces.
- CORS is enabled on the server so the GH Pages frontend can call the HF
backend cross-origin (`?mode=live&backend=https://...hf.space` is a
potential future addition).
## GitHub Pages demo (replay mode)
The visualization can run as a **static page** on GitHub Pages with no
backend. It plays back a recorded session from `web/events.jsonl`:
1. The repo ships a fabricated `web/events.jsonl` so the page works on first
deploy (run `python scripts/make_demo_recording.py > web/events.jsonl` to
regenerate).
2. To use a **real** recording instead, run the server with `--record`:
```bash
python -m tiny_vllm.server --record web/events.jsonl
# β¦submit some prompts via the UI or smoke_clientβ¦
# Ctrl-C the server. events.jsonl now contains the full session.
git add web/events.jsonl && git commit -m "fresh demo recording" && git push
```
3. Enable Pages once: **repo β Settings β Pages β Source: "GitHub Actions"**.
The workflow in `.github/workflows/deploy-pages.yml` then publishes
`web/` on every push to `main` that touches it.
The page auto-detects mode:
- Tries `/engine/events` SSE first; if it responds within 2s it's **live**.
- Otherwise falls back to **replay**, fetching `events.jsonl` from the same
directory and playing it back with original timing (speed control / pause
/ restart in the controls row).
- Force a mode with `?mode=replay` or `?mode=live`; point at a different
recording with `?session=URL`.
## What the demo page shows
| Panel | What you're looking at |
|---|---|
| **Block pool** | One cell per physical block. Color = state (free / cached-evictable / in-use / shared). Orange border = the block has been hashed and is discoverable in the prefix cache. |
| **Scheduler** | Live stats: tokens this step, prefill-vs-decode split, step latency, prefix-cache hit-rate, preemption count. Step log scrolls below. |
| **Sequences** | Every active sequence's block table (cell per block, blue = prefix-cache hit, purple = shared), status, generated text. |
Click **Send Γ2** to fire the same prompt twice β the second send should
prefix-cache the entire prompt and start decoding almost immediately.
## Reading order
If you want to learn the system:
1. `request.py` β what a request becomes.
2. `block_manager.py` β read `admit()` and `_take_free_block()`; the prefix
cache lives here.
3. `scheduler.py` β read `schedule()`; the two-phase loop is the heart of
continuous batching.
4. `model_runner.py` β `Qwen2Attention.forward` β see how Q/K/V get written
into and read out of the paged cache.
5. `engine.py::_run_loop` β how everything is wired step-by-step.
6. `server.py` β the SSE surface.
## Known limitations
- CPU-friendly defaults; no custom CUDA / Triton kernels.
- Per-sequence attention loop inside each layer (not packed/varlen-fused).
- Only Llama/Qwen2-style decoder architectures (RMSNorm + RoPE + GQA + SwiGLU MLP).
- Single-prompt completions (`n=1`); no beam search.
- No tensor parallel, no quantization.
- Prefix-cache eviction is LRU on the free list β not the full
reference-counted radix tree vLLM ships.
## License
MIT.
|