File size: 8,328 Bytes
8fa0f9d
 
 
 
 
 
 
 
33432f7
8fa0f9d
 
c32c359
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8fa0f9d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39fa862
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c32c359
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
---
title: tiny_vllm
emoji: πŸͺΆ
colorFrom: gray
colorTo: green
sdk: docker
app_port: 7860
pinned: false
short_description: Minimal continuous-batching engine β€” paged KV + SSE
---

# tiny_vllm

A **minimal continuous-batching LLM engine** built to be read end-to-end.  It
re-implements the load-bearing ideas of vLLM / SGLang in ~1.5k lines of
Python:

- **Paged KV cache** with logical block tables β€” physical blocks are a flat
  pool; per-sequence block tables map logical positions β†’ physical slots.
- **Automatic prefix caching** via content-addressed hashes β€” two requests
  with the same prompt prefix share KV blocks.
- **Continuous batching with chunked prefill** β€” each scheduling step packs a
  budget of tokens from any mix of new prefills and ongoing decodes; long
  prompts are sliced so they don't starve the decoders.
- **Recompute-style preemption** β€” when the pool runs dry, the youngest
  running sequence is evicted and re-enqueued.
- **SSE streaming** over a thin FastAPI layer β€” both token deltas
  (`/generate`, OpenAI-compatible `/v1/completions`) and a parallel engine
  event stream (`/engine/events`) the demo page subscribes to.
- A **visualization demo page** that renders the block pool, scheduler
  queues, per-sequence block tables, and live tokens as the engine runs.

It is **not** vLLM.  Attention runs in plain PyTorch SDPA (per-sequence loop),
there are no fused or paged-attention kernels, and CPU is the default device.
This is a learning artifact, not a serving stack.

## Quick start

```bash
pip install -r requirements.txt
# or: pip install -e .

python -m tiny_vllm.server --model Qwen/Qwen2.5-0.5B-Instruct --device cpu
```

Open [http://localhost:8000](http://localhost:8000) for the live
visualization, or hit the API directly:

```bash
# OpenAI-style streaming
curl -N http://localhost:8000/v1/completions \
  -H 'content-type: application/json' \
  -d '{"prompt":"In two sentences, what is paged attention?","max_tokens":80,"stream":true}'

# A simpler endpoint
curl -N http://localhost:8000/generate \
  -H 'content-type: application/json' \
  -d '{"prompt":"haiku about KV caches","max_tokens":48,"stream":true}'
```

Smoke test with concurrent requests:

```bash
python examples/smoke_client.py            # 4 prompts in parallel
python examples/smoke_client.py --prefix-demo   # show prefix-cache speedup
```

## The pieces

| File | What |
|---|---|
| `tiny_vllm/config.py` | `EngineConfig`, `SamplingParams` |
| `tiny_vllm/request.py` | `Sequence`, status enum, KV bookkeeping fields |
| `tiny_vllm/block_manager.py` | Physical block pool, refcounts, prefix-cache (hash-chain) |
| `tiny_vllm/scheduler.py` | Continuous batching + chunked prefill + preemption |
| `tiny_vllm/paged_kv.py` | The actual KV tensors that block ids point into |
| `tiny_vllm/model_runner.py` | Minimal Qwen2 forward (RoPE, RMSNorm, GQA) using the paged cache |
| `tiny_vllm/sampler.py` | Greedy / top-k / top-p |
| `tiny_vllm/engine.py` | Orchestrator: scheduler ⟢ model ⟢ sampler ⟢ outputs + events |
| `tiny_vllm/server.py` | FastAPI: `/generate`, `/v1/completions`, `/engine/events`, `/` |
| `web/` | Static demo page (vanilla HTML/CSS/JS, no framework) |

The model-free parts (block manager, scheduler) have unit tests:

```bash
pip install pytest
python -m pytest tests/
```

## Hugging Face Space β€” live demo

For a *live* (not recorded) demo you can talk to from any browser, deploy this
repo as a Docker-based Hugging Face Space.  HF's free CPU tier (16 GB RAM,
2 vCPU) fits Qwen2.5-0.5B comfortably.

**One-time setup:**

1. **Create the Space.**  Go to [huggingface.co/new-space](https://huggingface.co/new-space):
   - Owner: your HF username
   - Space name: e.g. `tiny-vllm` (must match `HF_SPACE_NAME` below)
   - SDK: **Docker**
   - License: MIT
2. **Generate a write-access token** at
   [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) β†’ New
   token β†’ role **Write**.
3. **Add three secrets** to this GitHub repo (Settings β†’ Secrets and variables
   β†’ Actions β†’ New repository secret):
   - `HF_TOKEN` β€” the token from step 2
   - `HF_USERNAME` β€” your HF username
   - `HF_SPACE_NAME` β€” e.g. `tiny-vllm`

On the next push to `main`, the `Sync to Hugging Face Space` workflow mirrors
the repo to the Space.  HF then builds the Docker image (~3–5 min on first
build because of the pre-fetched model) and the Space goes live at:

```
https://<lowercased-HF_USERNAME>-<HF_SPACE_NAME>.hf.space
```

(HF normalises subdomains to lowercase β€” `enCoder/tiny-vllm` becomes
`encoder-tiny-vllm.hf.space`.)

The GH Pages page links to this URL as a **"try live β†—"** pill in the
topbar β€” update `data-hf-space` on `<body>` in `web/index.html` if your
Space URL differs.

**HF Spaces cost: free.**  Cold-start (after ~48 h of inactivity) takes ~30 s
while the container wakes; subsequent requests are warm.

**Files involved:**
- `Dockerfile` β€” CPU-only torch, pre-downloads the model at build time.
- `README.md` frontmatter β€” HF reads `sdk: docker`, `app_port: 7860`, etc.
- `.github/workflows/sync-huggingface.yml` β€” mirrors GitHub β†’ HF Spaces.
- CORS is enabled on the server so the GH Pages frontend can call the HF
  backend cross-origin (`?mode=live&backend=https://...hf.space` is a
  potential future addition).

## GitHub Pages demo (replay mode)

The visualization can run as a **static page** on GitHub Pages with no
backend.  It plays back a recorded session from `web/events.jsonl`:

1. The repo ships a fabricated `web/events.jsonl` so the page works on first
   deploy (run `python scripts/make_demo_recording.py > web/events.jsonl` to
   regenerate).
2. To use a **real** recording instead, run the server with `--record`:
   ```bash
   python -m tiny_vllm.server --record web/events.jsonl
   # …submit some prompts via the UI or smoke_client…
   # Ctrl-C the server.  events.jsonl now contains the full session.
   git add web/events.jsonl && git commit -m "fresh demo recording" && git push
   ```
3. Enable Pages once: **repo β†’ Settings β†’ Pages β†’ Source: "GitHub Actions"**.
   The workflow in `.github/workflows/deploy-pages.yml` then publishes
   `web/` on every push to `main` that touches it.

The page auto-detects mode:
- Tries `/engine/events` SSE first; if it responds within 2s it's **live**.
- Otherwise falls back to **replay**, fetching `events.jsonl` from the same
  directory and playing it back with original timing (speed control / pause
  / restart in the controls row).
- Force a mode with `?mode=replay` or `?mode=live`; point at a different
  recording with `?session=URL`.

## What the demo page shows

| Panel | What you're looking at |
|---|---|
| **Block pool** | One cell per physical block.  Color = state (free / cached-evictable / in-use / shared).  Orange border = the block has been hashed and is discoverable in the prefix cache. |
| **Scheduler** | Live stats: tokens this step, prefill-vs-decode split, step latency, prefix-cache hit-rate, preemption count.  Step log scrolls below. |
| **Sequences** | Every active sequence's block table (cell per block, blue = prefix-cache hit, purple = shared), status, generated text. |

Click **Send Γ—2** to fire the same prompt twice β€” the second send should
prefix-cache the entire prompt and start decoding almost immediately.

## Reading order

If you want to learn the system:

1. `request.py` β€” what a request becomes.
2. `block_manager.py` β€” read `admit()` and `_take_free_block()`; the prefix
   cache lives here.
3. `scheduler.py` β€” read `schedule()`; the two-phase loop is the heart of
   continuous batching.
4. `model_runner.py` β†’ `Qwen2Attention.forward` β€” see how Q/K/V get written
   into and read out of the paged cache.
5. `engine.py::_run_loop` β€” how everything is wired step-by-step.
6. `server.py` β€” the SSE surface.

## Known limitations

- CPU-friendly defaults; no custom CUDA / Triton kernels.
- Per-sequence attention loop inside each layer (not packed/varlen-fused).
- Only Llama/Qwen2-style decoder architectures (RMSNorm + RoPE + GQA + SwiGLU MLP).
- Single-prompt completions (`n=1`); no beam search.
- No tensor parallel, no quantization.
- Prefix-cache eviction is LRU on the free list β€” not the full
  reference-counted radix tree vLLM ships.

## License

MIT.