tiny-vllm / README.md
enCoder's picture
"Shorten HF short_description to <60 chars"
33432f7
metadata
title: tiny_vllm
emoji: πŸͺΆ
colorFrom: gray
colorTo: green
sdk: docker
app_port: 7860
pinned: false
short_description: Minimal continuous-batching engine β€” paged KV + SSE

tiny_vllm

A minimal continuous-batching LLM engine built to be read end-to-end. It re-implements the load-bearing ideas of vLLM / SGLang in ~1.5k lines of Python:

  • Paged KV cache with logical block tables β€” physical blocks are a flat pool; per-sequence block tables map logical positions β†’ physical slots.
  • Automatic prefix caching via content-addressed hashes β€” two requests with the same prompt prefix share KV blocks.
  • Continuous batching with chunked prefill β€” each scheduling step packs a budget of tokens from any mix of new prefills and ongoing decodes; long prompts are sliced so they don't starve the decoders.
  • Recompute-style preemption β€” when the pool runs dry, the youngest running sequence is evicted and re-enqueued.
  • SSE streaming over a thin FastAPI layer β€” both token deltas (/generate, OpenAI-compatible /v1/completions) and a parallel engine event stream (/engine/events) the demo page subscribes to.
  • A visualization demo page that renders the block pool, scheduler queues, per-sequence block tables, and live tokens as the engine runs.

It is not vLLM. Attention runs in plain PyTorch SDPA (per-sequence loop), there are no fused or paged-attention kernels, and CPU is the default device. This is a learning artifact, not a serving stack.

Quick start

pip install -r requirements.txt
# or: pip install -e .

python -m tiny_vllm.server --model Qwen/Qwen2.5-0.5B-Instruct --device cpu

Open http://localhost:8000 for the live visualization, or hit the API directly:

# OpenAI-style streaming
curl -N http://localhost:8000/v1/completions \
  -H 'content-type: application/json' \
  -d '{"prompt":"In two sentences, what is paged attention?","max_tokens":80,"stream":true}'

# A simpler endpoint
curl -N http://localhost:8000/generate \
  -H 'content-type: application/json' \
  -d '{"prompt":"haiku about KV caches","max_tokens":48,"stream":true}'

Smoke test with concurrent requests:

python examples/smoke_client.py            # 4 prompts in parallel
python examples/smoke_client.py --prefix-demo   # show prefix-cache speedup

The pieces

File What
tiny_vllm/config.py EngineConfig, SamplingParams
tiny_vllm/request.py Sequence, status enum, KV bookkeeping fields
tiny_vllm/block_manager.py Physical block pool, refcounts, prefix-cache (hash-chain)
tiny_vllm/scheduler.py Continuous batching + chunked prefill + preemption
tiny_vllm/paged_kv.py The actual KV tensors that block ids point into
tiny_vllm/model_runner.py Minimal Qwen2 forward (RoPE, RMSNorm, GQA) using the paged cache
tiny_vllm/sampler.py Greedy / top-k / top-p
tiny_vllm/engine.py Orchestrator: scheduler ⟢ model ⟢ sampler ⟢ outputs + events
tiny_vllm/server.py FastAPI: /generate, /v1/completions, /engine/events, /
web/ Static demo page (vanilla HTML/CSS/JS, no framework)

The model-free parts (block manager, scheduler) have unit tests:

pip install pytest
python -m pytest tests/

Hugging Face Space β€” live demo

For a live (not recorded) demo you can talk to from any browser, deploy this repo as a Docker-based Hugging Face Space. HF's free CPU tier (16 GB RAM, 2 vCPU) fits Qwen2.5-0.5B comfortably.

One-time setup:

  1. Create the Space. Go to huggingface.co/new-space:
    • Owner: your HF username
    • Space name: e.g. tiny-vllm (must match HF_SPACE_NAME below)
    • SDK: Docker
    • License: MIT
  2. Generate a write-access token at huggingface.co/settings/tokens β†’ New token β†’ role Write.
  3. Add three secrets to this GitHub repo (Settings β†’ Secrets and variables β†’ Actions β†’ New repository secret):
    • HF_TOKEN β€” the token from step 2
    • HF_USERNAME β€” your HF username
    • HF_SPACE_NAME β€” e.g. tiny-vllm

On the next push to main, the Sync to Hugging Face Space workflow mirrors the repo to the Space. HF then builds the Docker image (~3–5 min on first build because of the pre-fetched model) and the Space goes live at:

https://<lowercased-HF_USERNAME>-<HF_SPACE_NAME>.hf.space

(HF normalises subdomains to lowercase β€” enCoder/tiny-vllm becomes encoder-tiny-vllm.hf.space.)

The GH Pages page links to this URL as a "try live β†—" pill in the topbar β€” update data-hf-space on <body> in web/index.html if your Space URL differs.

HF Spaces cost: free. Cold-start (after ~48 h of inactivity) takes ~30 s while the container wakes; subsequent requests are warm.

Files involved:

  • Dockerfile β€” CPU-only torch, pre-downloads the model at build time.
  • README.md frontmatter β€” HF reads sdk: docker, app_port: 7860, etc.
  • .github/workflows/sync-huggingface.yml β€” mirrors GitHub β†’ HF Spaces.
  • CORS is enabled on the server so the GH Pages frontend can call the HF backend cross-origin (?mode=live&backend=https://...hf.space is a potential future addition).

GitHub Pages demo (replay mode)

The visualization can run as a static page on GitHub Pages with no backend. It plays back a recorded session from web/events.jsonl:

  1. The repo ships a fabricated web/events.jsonl so the page works on first deploy (run python scripts/make_demo_recording.py > web/events.jsonl to regenerate).
  2. To use a real recording instead, run the server with --record:
    python -m tiny_vllm.server --record web/events.jsonl
    # …submit some prompts via the UI or smoke_client…
    # Ctrl-C the server.  events.jsonl now contains the full session.
    git add web/events.jsonl && git commit -m "fresh demo recording" && git push
    
  3. Enable Pages once: repo β†’ Settings β†’ Pages β†’ Source: "GitHub Actions". The workflow in .github/workflows/deploy-pages.yml then publishes web/ on every push to main that touches it.

The page auto-detects mode:

  • Tries /engine/events SSE first; if it responds within 2s it's live.
  • Otherwise falls back to replay, fetching events.jsonl from the same directory and playing it back with original timing (speed control / pause / restart in the controls row).
  • Force a mode with ?mode=replay or ?mode=live; point at a different recording with ?session=URL.

What the demo page shows

Panel What you're looking at
Block pool One cell per physical block. Color = state (free / cached-evictable / in-use / shared). Orange border = the block has been hashed and is discoverable in the prefix cache.
Scheduler Live stats: tokens this step, prefill-vs-decode split, step latency, prefix-cache hit-rate, preemption count. Step log scrolls below.
Sequences Every active sequence's block table (cell per block, blue = prefix-cache hit, purple = shared), status, generated text.

Click Send Γ—2 to fire the same prompt twice β€” the second send should prefix-cache the entire prompt and start decoding almost immediately.

Reading order

If you want to learn the system:

  1. request.py β€” what a request becomes.
  2. block_manager.py β€” read admit() and _take_free_block(); the prefix cache lives here.
  3. scheduler.py β€” read schedule(); the two-phase loop is the heart of continuous batching.
  4. model_runner.py β†’ Qwen2Attention.forward β€” see how Q/K/V get written into and read out of the paged cache.
  5. engine.py::_run_loop β€” how everything is wired step-by-step.
  6. server.py β€” the SSE surface.

Known limitations

  • CPU-friendly defaults; no custom CUDA / Triton kernels.
  • Per-sequence attention loop inside each layer (not packed/varlen-fused).
  • Only Llama/Qwen2-style decoder architectures (RMSNorm + RoPE + GQA + SwiGLU MLP).
  • Single-prompt completions (n=1); no beam search.
  • No tensor parallel, no quantization.
  • Prefix-cache eviction is LRU on the free list β€” not the full reference-counted radix tree vLLM ships.

License

MIT.