Spaces:

enCoder
/

tiny-vllm

Running

App Files Files Community

tiny-vllm / README.md

enCoder

"Shorten HF short_description to <60 chars"

33432f7 11 days ago

preview code

raw

history blame contribute delete

8.33 kB

metadata

title: tiny_vllm
emoji: 🪶
colorFrom: gray
colorTo: green
sdk: docker
app_port: 7860
pinned: false
short_description: Minimal continuous-batching engine — paged KV + SSE

tiny_vllm

A minimal continuous-batching LLM engine built to be read end-to-end. It re-implements the load-bearing ideas of vLLM / SGLang in ~1.5k lines of Python:

Paged KV cache with logical block tables — physical blocks are a flat pool; per-sequence block tables map logical positions → physical slots.
Automatic prefix caching via content-addressed hashes — two requests with the same prompt prefix share KV blocks.
Continuous batching with chunked prefill — each scheduling step packs a budget of tokens from any mix of new prefills and ongoing decodes; long prompts are sliced so they don't starve the decoders.
Recompute-style preemption — when the pool runs dry, the youngest running sequence is evicted and re-enqueued.
SSE streaming over a thin FastAPI layer — both token deltas (/generate, OpenAI-compatible /v1/completions) and a parallel engine event stream (/engine/events) the demo page subscribes to.
A visualization demo page that renders the block pool, scheduler queues, per-sequence block tables, and live tokens as the engine runs.

It is not vLLM. Attention runs in plain PyTorch SDPA (per-sequence loop), there are no fused or paged-attention kernels, and CPU is the default device. This is a learning artifact, not a serving stack.

Quick start

pip install -r requirements.txt
# or: pip install -e .

python -m tiny_vllm.server --model Qwen/Qwen2.5-0.5B-Instruct --device cpu

Open http://localhost:8000 for the live visualization, or hit the API directly:

# OpenAI-style streaming
curl -N http://localhost:8000/v1/completions \
  -H 'content-type: application/json' \
  -d '{"prompt":"In two sentences, what is paged attention?","max_tokens":80,"stream":true}'

# A simpler endpoint
curl -N http://localhost:8000/generate \
  -H 'content-type: application/json' \
  -d '{"prompt":"haiku about KV caches","max_tokens":48,"stream":true}'

Smoke test with concurrent requests:

python examples/smoke_client.py            # 4 prompts in parallel
python examples/smoke_client.py --prefix-demo   # show prefix-cache speedup

The pieces

File	What
`tiny_vllm/config.py`	`EngineConfig`, `SamplingParams`
`tiny_vllm/request.py`	`Sequence`, status enum, KV bookkeeping fields
`tiny_vllm/block_manager.py`	Physical block pool, refcounts, prefix-cache (hash-chain)
`tiny_vllm/scheduler.py`	Continuous batching + chunked prefill + preemption
`tiny_vllm/paged_kv.py`	The actual KV tensors that block ids point into
`tiny_vllm/model_runner.py`	Minimal Qwen2 forward (RoPE, RMSNorm, GQA) using the paged cache
`tiny_vllm/sampler.py`	Greedy / top-k / top-p
`tiny_vllm/engine.py`	Orchestrator: scheduler ⟶ model ⟶ sampler ⟶ outputs + events
`tiny_vllm/server.py`	FastAPI: `/generate`, `/v1/completions`, `/engine/events`, `/`
`web/`	Static demo page (vanilla HTML/CSS/JS, no framework)

The model-free parts (block manager, scheduler) have unit tests:

pip install pytest
python -m pytest tests/

Hugging Face Space — live demo

For a live (not recorded) demo you can talk to from any browser, deploy this repo as a Docker-based Hugging Face Space. HF's free CPU tier (16 GB RAM, 2 vCPU) fits Qwen2.5-0.5B comfortably.

One-time setup:

Create the Space. Go to huggingface.co/new-space:
- Owner: your HF username
- Space name: e.g. tiny-vllm (must match HF_SPACE_NAME below)
- SDK: Docker
- License: MIT
Generate a write-access token at huggingface.co/settings/tokens → New token → role Write.
Add three secrets to this GitHub repo (Settings → Secrets and variables → Actions → New repository secret):
- HF_TOKEN — the token from step 2
- HF_USERNAME — your HF username
- HF_SPACE_NAME — e.g. tiny-vllm

On the next push to main, the Sync to Hugging Face Space workflow mirrors the repo to the Space. HF then builds the Docker image (~3–5 min on first build because of the pre-fetched model) and the Space goes live at:

https://<lowercased-HF_USERNAME>-<HF_SPACE_NAME>.hf.space

(HF normalises subdomains to lowercase — enCoder/tiny-vllm becomes encoder-tiny-vllm.hf.space.)

The GH Pages page links to this URL as a "try live ↗" pill in the topbar — update data-hf-space on <body> in web/index.html if your Space URL differs.

HF Spaces cost: free. Cold-start (after ~48 h of inactivity) takes ~30 s while the container wakes; subsequent requests are warm.

Files involved:

Dockerfile — CPU-only torch, pre-downloads the model at build time.
README.md frontmatter — HF reads sdk: docker, app_port: 7860, etc.
.github/workflows/sync-huggingface.yml — mirrors GitHub → HF Spaces.
CORS is enabled on the server so the GH Pages frontend can call the HF backend cross-origin (?mode=live&backend=https://...hf.space is a potential future addition).

GitHub Pages demo (replay mode)

The visualization can run as a static page on GitHub Pages with no backend. It plays back a recorded session from web/events.jsonl:

The repo ships a fabricated web/events.jsonl so the page works on first deploy (run python scripts/make_demo_recording.py > web/events.jsonl to regenerate).

To use a real recording instead, run the server with --record:

python -m tiny_vllm.server --record web/events.jsonl
# …submit some prompts via the UI or smoke_client…
# Ctrl-C the server.  events.jsonl now contains the full session.
git add web/events.jsonl && git commit -m "fresh demo recording" && git push

Enable Pages once: repo → Settings → Pages → Source: "GitHub Actions". The workflow in .github/workflows/deploy-pages.yml then publishes web/ on every push to main that touches it.

The page auto-detects mode:

Tries /engine/events SSE first; if it responds within 2s it's live.
Otherwise falls back to replay, fetching events.jsonl from the same directory and playing it back with original timing (speed control / pause / restart in the controls row).
Force a mode with ?mode=replay or ?mode=live; point at a different recording with ?session=URL.

What the demo page shows

Panel	What you're looking at
Block pool	One cell per physical block. Color = state (free / cached-evictable / in-use / shared). Orange border = the block has been hashed and is discoverable in the prefix cache.
Scheduler	Live stats: tokens this step, prefill-vs-decode split, step latency, prefix-cache hit-rate, preemption count. Step log scrolls below.
Sequences	Every active sequence's block table (cell per block, blue = prefix-cache hit, purple = shared), status, generated text.

Click Send ×2 to fire the same prompt twice — the second send should prefix-cache the entire prompt and start decoding almost immediately.

Reading order

If you want to learn the system:

request.py — what a request becomes.
block_manager.py — read admit() and _take_free_block(); the prefix cache lives here.
scheduler.py — read schedule(); the two-phase loop is the heart of continuous batching.
model_runner.py → Qwen2Attention.forward — see how Q/K/V get written into and read out of the paged cache.
engine.py::_run_loop — how everything is wired step-by-step.
server.py — the SSE surface.

Known limitations

CPU-friendly defaults; no custom CUDA / Triton kernels.
Per-sequence attention loop inside each layer (not packed/varlen-fused).
Only Llama/Qwen2-style decoder architectures (RMSNorm + RoPE + GQA + SwiGLU MLP).
Single-prompt completions (n=1); no beam search.
No tensor parallel, no quantization.
Prefix-cache eviction is LRU on the free list — not the full reference-counted radix tree vLLM ships.

License

MIT.