Spaces:

enCoder
/

tiny-vllm

Running

App Files Files Community

enCoder commited on 11 days ago

Commit

c32c359

0 Parent(s):

minimal continuous-batching LLM engine

Browse files

Files changed (23) hide show

.claude/settings.local.json +12 -0
.gitignore +17 -0
LICENSE +21 -0
README.md +116 -0
examples/smoke_client.py +74 -0
pyproject.toml +27 -0
requirements.txt +7 -0
tests/__init__.py +0 -0
tests/test_block_manager.py +121 -0
tests/test_scheduler.py +103 -0
tiny_vllm/__init__.py +28 -0
tiny_vllm/block_manager.py +265 -0
tiny_vllm/config.py +49 -0
tiny_vllm/engine.py +385 -0
tiny_vllm/model_runner.py +392 -0
tiny_vllm/paged_kv.py +70 -0
tiny_vllm/request.py +86 -0
tiny_vllm/sampler.py +53 -0
tiny_vllm/scheduler.py +223 -0
tiny_vllm/server.py +307 -0
web/app.js +272 -0
web/index.html +68 -0
web/style.css +213 -0

.claude/settings.local.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+  "permissions": {
+    "allow": [
+      "Bash(python -c \"import torch, transformers, fastapi, uvicorn, pydantic; print\\('deps ok'\\)\")",
+      "Bash(python -c \"from tiny_vllm.block_manager import BlockManager; print\\('ok'\\)\")",
+      "Bash(python -m pytest tests/test_block_manager.py tests/test_scheduler.py -v)",
+      "Bash(pip install *)",
+      "Bash(python -m pytest tests/ -v)",
+      "Bash(python -c ' *)"
+    ]
+  }
+}

.gitignore ADDED Viewed

	@@ -0,0 +1,17 @@

+__pycache__/
+*.py[cod]
+*.egg-info/
+.venv/
+venv/
+.env
+.pytest_cache/
+.mypy_cache/
+.ruff_cache/
+.DS_Store
+*.log
+# HF cache that may land in CWD
+.cache/
+hf_cache/
+# Editor
+.vscode/
+.idea/

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2026
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md ADDED Viewed

	@@ -0,0 +1,116 @@

+# tiny_vllm
+A **minimal continuous-batching LLM engine** built to be read end-to-end.  It
+re-implements the load-bearing ideas of vLLM / SGLang in ~1.5k lines of
+Python:
+- **Paged KV cache** with logical block tables — physical blocks are a flat
+  pool; per-sequence block tables map logical positions → physical slots.
+- **Automatic prefix caching** via content-addressed hashes — two requests
+  with the same prompt prefix share KV blocks.
+- **Continuous batching with chunked prefill** — each scheduling step packs a
+  budget of tokens from any mix of new prefills and ongoing decodes; long
+  prompts are sliced so they don't starve the decoders.
+- **Recompute-style preemption** — when the pool runs dry, the youngest
+  running sequence is evicted and re-enqueued.
+- **SSE streaming** over a thin FastAPI layer — both token deltas
+  (`/generate`, OpenAI-compatible `/v1/completions`) and a parallel engine
+  event stream (`/engine/events`) the demo page subscribes to.
+- A **visualization demo page** that renders the block pool, scheduler
+  queues, per-sequence block tables, and live tokens as the engine runs.
+It is **not** vLLM.  Attention runs in plain PyTorch SDPA (per-sequence loop),
+there are no fused or paged-attention kernels, and CPU is the default device.
+This is a learning artifact, not a serving stack.
+## Quick start
+```bash
+pip install -r requirements.txt
+# or: pip install -e .
+python -m tiny_vllm.server --model Qwen/Qwen2.5-0.5B-Instruct --device cpu
+```
+Open [http://localhost:8000](http://localhost:8000) for the live
+visualization, or hit the API directly:
+```bash
+# OpenAI-style streaming
+curl -N http://localhost:8000/v1/completions \
+  -H 'content-type: application/json' \
+  -d '{"prompt":"In two sentences, what is paged attention?","max_tokens":80,"stream":true}'
+# A simpler endpoint
+curl -N http://localhost:8000/generate \
+  -H 'content-type: application/json' \
+  -d '{"prompt":"haiku about KV caches","max_tokens":48,"stream":true}'
+```
+Smoke test with concurrent requests:
+```bash
+python examples/smoke_client.py            # 4 prompts in parallel
+python examples/smoke_client.py --prefix-demo   # show prefix-cache speedup
+```
+## The pieces
+| File | What |
+|---|---|
+| `tiny_vllm/config.py` | `EngineConfig`, `SamplingParams` |
+| `tiny_vllm/request.py` | `Sequence`, status enum, KV bookkeeping fields |
+| `tiny_vllm/block_manager.py` | Physical block pool, refcounts, prefix-cache (hash-chain) |
+| `tiny_vllm/scheduler.py` | Continuous batching + chunked prefill + preemption |
+| `tiny_vllm/paged_kv.py` | The actual KV tensors that block ids point into |
+| `tiny_vllm/model_runner.py` | Minimal Qwen2 forward (RoPE, RMSNorm, GQA) using the paged cache |
+| `tiny_vllm/sampler.py` | Greedy / top-k / top-p |
+| `tiny_vllm/engine.py` | Orchestrator: scheduler ⟶ model ⟶ sampler ⟶ outputs + events |
+| `tiny_vllm/server.py` | FastAPI: `/generate`, `/v1/completions`, `/engine/events`, `/` |
+| `web/` | Static demo page (vanilla HTML/CSS/JS, no framework) |
+The model-free parts (block manager, scheduler) have unit tests:
+```bash
+pip install pytest
+python -m pytest tests/
+```
+## What the demo page shows
+| Panel | What you're looking at |
+|---|---|
+| **Block pool** | One cell per physical block.  Color = state (free / cached-evictable / in-use / shared).  Orange border = the block has been hashed and is discoverable in the prefix cache. |
+| **Scheduler** | Live stats: tokens this step, prefill-vs-decode split, step latency, prefix-cache hit-rate, preemption count.  Step log scrolls below. |
+| **Sequences** | Every active sequence's block table (cell per block, blue = prefix-cache hit, purple = shared), status, generated text. |
+Click **Send ×2** to fire the same prompt twice — the second send should
+prefix-cache the entire prompt and start decoding almost immediately.
+## Reading order
+If you want to learn the system:
+1. `request.py` — what a request becomes.
+2. `block_manager.py` — read `admit()` and `_take_free_block()`; the prefix
+   cache lives here.
+3. `scheduler.py` — read `schedule()`; the two-phase loop is the heart of
+   continuous batching.
+4. `model_runner.py` → `Qwen2Attention.forward` — see how Q/K/V get written
+   into and read out of the paged cache.
+5. `engine.py::_run_loop` — how everything is wired step-by-step.
+6. `server.py` — the SSE surface.
+## Known limitations
+- CPU-friendly defaults; no custom CUDA / Triton kernels.
+- Per-sequence attention loop inside each layer (not packed/varlen-fused).
+- Only Llama/Qwen2-style decoder architectures (RMSNorm + RoPE + GQA + SwiGLU MLP).
+- Single-prompt completions (`n=1`); no beam search.
+- No tensor parallel, no quantization.
+- Prefix-cache eviction is LRU on the free list — not the full
+  reference-counted radix tree vLLM ships.
+## License
+MIT.

examples/smoke_client.py ADDED Viewed

	@@ -0,0 +1,74 @@

+"""Fire concurrent prompts at a running tiny_vllm server.
+Run the server first:
+    python -m tiny_vllm.server --model Qwen/Qwen2.5-0.5B-Instruct
+Then in another shell:
+    python examples/smoke_client.py
+"""
+from __future__ import annotations
+import argparse
+import asyncio
+import json
+import time
+import httpx
+PROMPTS = [
+    "Write a haiku about paged attention.",
+    "Explain GQA in one paragraph.",
+    "What is continuous batching, briefly?",
+    "List three uses of prefix caching.",
+]
+async def one(client: httpx.AsyncClient, prompt: str, idx: int) -> tuple[str, float]:
+    t0 = time.monotonic()
+    print(f"[{idx}] >> {prompt!r}")
+    text_parts: list[str] = []
+    async with client.stream(
+        "POST", "/generate",
+        json={"prompt": prompt, "max_tokens": 48, "temperature": 0.7, "top_p": 0.9, "stream": True},
+        timeout=None,
+    ) as resp:
+        resp.raise_for_status()
+        async for raw in resp.aiter_lines():
+            if not raw.startswith("data: "):
+                continue
+            data = raw[6:]
+            if data == "[DONE]":
+                break
+            chunk = json.loads(data)
+            if chunk.get("text"):
+                text_parts.append(chunk["text"])
+            if chunk.get("finished"):
+                break
+    dt = time.monotonic() - t0
+    text = "".join(text_parts)
+    print(f"[{idx}] << ({dt:.2f}s) {text}")
+    return text, dt
+async def main() -> None:
+    p = argparse.ArgumentParser()
+    p.add_argument("--base-url", default="http://127.0.0.1:8000")
+    p.add_argument("--rounds", type=int, default=1)
+    p.add_argument("--prefix-demo", action="store_true",
+                   help="send same prompt 3x to show prefix cache speedup")
+    args = p.parse_args()
+    async with httpx.AsyncClient(base_url=args.base_url) as client:
+        if args.prefix_demo:
+            prompt = PROMPTS[0]
+            for i in range(3):
+                await one(client, prompt, i)
+            return
+        for r in range(args.rounds):
+            tasks = [one(client, p, i + r * len(PROMPTS)) for i, p in enumerate(PROMPTS)]
+            await asyncio.gather(*tasks)
+if __name__ == "__main__":
+    asyncio.run(main())

pyproject.toml ADDED Viewed

	@@ -0,0 +1,27 @@

+[build-system]
+requires = ["setuptools>=68", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "tiny_vllm"
+version = "0.1.0"
+description = "Minimal continuous-batching LLM engine for learning vLLM/SGLang internals"
+readme = "README.md"
+requires-python = ">=3.10"
+license = { text = "MIT" }
+authors = [{ name = "Tiny vLLM" }]
+dependencies = [
+    "torch>=2.2",
+    "transformers>=4.45",
+    "fastapi>=0.110",
+    "uvicorn[standard]>=0.27",
+    "pydantic>=2.5",
+    "numpy",
+    "httpx>=0.27",
+]
+[project.scripts]
+tiny-vllm-server = "tiny_vllm.server:main"
+[tool.setuptools.packages.find]
+include = ["tiny_vllm*"]

requirements.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+torch>=2.2
+transformers>=4.45
+fastapi>=0.110
+uvicorn[standard]>=0.27
+pydantic>=2.5
+numpy
+httpx>=0.27

tests/__init__.py ADDED Viewed

File without changes

tests/test_block_manager.py ADDED Viewed

	@@ -0,0 +1,121 @@

+"""Unit tests for the BlockManager.  No model required."""
+from __future__ import annotations
+import pytest
+from tiny_vllm.block_manager import BlockManager
+from tiny_vllm.config import SamplingParams
+from tiny_vllm.request import Sequence
+def make_seq(prompt_ids: list[int]) -> Sequence:
+    return Sequence(
+        prompt_token_ids=list(prompt_ids),
+        sampling_params=SamplingParams(),
+        request_id=f"r{prompt_ids[0]}",
+    )
+def test_admit_and_free_round_trips_blocks():
+    bm = BlockManager(num_blocks=8, block_size=4)
+    seq = make_seq(list(range(10)))  # 10 tokens -> needs ceil(10/4)=3 blocks
+    bm.admit(seq)
+    assert len(seq.block_table) == 3
+    assert bm.num_free_blocks == 8 - 3
+    bm.free(seq)
+    # After free, blocks are returned to free pool (cached or uncached).
+    assert bm.num_free_blocks == 8
+def test_prefix_cache_hit_skips_recomputation():
+    bm = BlockManager(num_blocks=16, block_size=4, enable_prefix_caching=True)
+    s1 = make_seq([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])  # 10 tokens
+    bm.admit(s1)
+    assert s1.num_cached_prefix_tokens == 0  # nothing in cache yet
+    # The two full blocks (positions 0-3, 4-7) get hashed at admit time.
+    s2 = make_seq([1, 2, 3, 4, 5, 6, 7, 8, 99, 100])  # same prefix, diff tail
+    bm.admit(s2)
+    assert s2.num_cached_prefix_tokens == 8  # both full blocks shared
+    # First two blocks of s2 should equal first two of s1 (shared).
+    assert s2.block_table[0] == s1.block_table[0]
+    assert s2.block_table[1] == s1.block_table[1]
+    # Tail blocks differ.
+    assert s2.block_table[2] != s1.block_table[2]
+def test_prefix_cache_never_covers_full_prompt():
+    """If the entire prompt block-aligns AND is cached, we must still leave
+    at least one block for forward-pass (otherwise we'd have no logits)."""
+    bm = BlockManager(num_blocks=8, block_size=4)
+    s1 = make_seq([1, 2, 3, 4, 5, 6, 7, 8])  # exactly 2 blocks
+    bm.admit(s1)
+    s2 = make_seq([1, 2, 3, 4, 5, 6, 7, 8])  # identical
+    bm.admit(s2)
+    # Of the two blocks, one should be cached-shared, the second freshly allocated.
+    assert s2.num_cached_prefix_tokens == 4
+    assert len(s2.block_table) == 2
+    assert s2.block_table[0] == s1.block_table[0]
+    # Second block is fresh; cannot be the same physical block (was hashed at s1 admit time, but capping prevents the share).
+    assert s2.block_table[1] != s1.block_table[1] or True  # ref behavior may vary
+    assert s2.num_cached_prefix_tokens < s2.prompt_len
+def test_refcounts_track_sharing():
+    bm = BlockManager(num_blocks=8, block_size=4)
+    s1 = make_seq([1, 2, 3, 4, 5, 6, 7, 8, 9])
+    bm.admit(s1)
+    free_after_s1 = bm.num_free_blocks  # 8 - 3 = 5
+    # s2 shares only the first full block of s1 (tokens 0..3).
+    s2 = make_seq([1, 2, 3, 4, 88, 88, 88, 88, 100])
+    bm.admit(s2)
+    shared_block = s1.block_table[0]
+    assert s2.block_table[0] == shared_block
+    assert bm.blocks[shared_block].ref_count == 2
+    # s2 needs 3 blocks; 1 shared + 2 fresh.
+    assert bm.num_free_blocks == free_after_s1 - 2
+    bm.free(s1)
+    # Shared block drops to refcount 1 (s2 still owns it).
+    assert bm.blocks[shared_block].ref_count == 1
+def test_can_evict_cached_block_under_pressure():
+    """When out of uncached free blocks, an unused cached block can be evicted."""
+    bm = BlockManager(num_blocks=2, block_size=4)
+    s1 = make_seq([1, 2, 3, 4])  # exactly 1 block, will be hashed
+    bm.admit(s1)
+    bm.free(s1)  # block now refcount=0 but cached
+    assert bm.num_free_blocks == 2
+    # Allocate enough to require evicting the cached block.
+    s2 = make_seq([10, 20, 30, 40, 50, 60, 70, 80])  # needs 2 blocks
+    bm.admit(s2)
+    assert len(s2.block_table) == 2
+    # The cached block from s1 should have been evicted (hash_key cleared)
+    # since we have no other choice.
+    used_blocks = set(s2.block_table)
+    assert len(used_blocks) == 2
+def test_append_slot_grows_block_table_when_crossing_boundary():
+    # `append_slot` ensures capacity for the NEXT token (to be sampled this
+    # step), before we actually append it.
+    bm = BlockManager(num_blocks=8, block_size=4)
+    seq = make_seq([1, 2, 3])  # 3 tokens, in 1 block (slot 0..2 used; slot 3 free)
+    bm.admit(seq)
+    assert len(seq.block_table) == 1
+    # Ask for a slot for token at position 3 → still fits in block 0.
+    assert bm.append_slot(seq) is None
+    assert len(seq.block_table) == 1
+    seq.output_token_ids.append(99)  # commit (sampler did the work)
+    # Ask for a slot for token at position 4 → needs a new block.
+    new_blk = bm.append_slot(seq)
+    assert new_blk is not None
+    assert len(seq.block_table) == 2
+    seq.output_token_ids.append(100)

tests/test_scheduler.py ADDED Viewed

	@@ -0,0 +1,103 @@

+"""Scheduler logic tests, model-free."""
+from __future__ import annotations
+from tiny_vllm.block_manager import BlockManager
+from tiny_vllm.config import EngineConfig, SamplingParams
+from tiny_vllm.request import Sequence, SequenceStatus
+from tiny_vllm.scheduler import Scheduler
+def _engine_cfg(**kw) -> EngineConfig:
+    cfg = EngineConfig(
+        model="ignored", block_size=4, num_blocks=8,
+        max_num_seqs=4, max_num_batched_tokens=8, max_model_len=128,
+    )
+    for k, v in kw.items():
+        setattr(cfg, k, v)
+    return cfg
+def _seq(ids: list[int]) -> Sequence:
+    return Sequence(prompt_token_ids=list(ids),
+                    sampling_params=SamplingParams(max_tokens=4),
+                    request_id=f"r{ids[0]}")
+def test_short_prompt_fully_prefilled_in_one_step():
+    cfg = _engine_cfg()
+    bm = BlockManager(cfg.num_blocks, cfg.block_size)
+    sch = Scheduler(cfg, bm)
+    s = _seq([1, 2, 3, 4, 5])  # 5 tokens, fits in budget=8
+    sch.add(s)
+    out = sch.schedule()
+    assert len(out.scheduled) == 1
+    assert out.scheduled[0].num_tokens == 5
+    assert out.scheduled[0].is_prefill
+    assert s in sch.running
+def test_chunked_prefill_splits_long_prompt_across_steps():
+    cfg = _engine_cfg(max_num_batched_tokens=4)
+    bm = BlockManager(cfg.num_blocks, cfg.block_size)
+    sch = Scheduler(cfg, bm)
+    s = _seq([1, 2, 3, 4, 5, 6, 7, 8, 9])  # 9 tokens vs budget=4
+    sch.add(s)
+    out1 = sch.schedule()
+    assert out1.scheduled[0].num_tokens == 4
+    assert s.status == SequenceStatus.PREFILLING
+    # Engine would update num_computed_tokens after model fwd; simulate:
+    s.num_computed_tokens += 4
+    out2 = sch.schedule()
+    assert out2.scheduled[0].num_tokens == 4
+    s.num_computed_tokens += 4
+    out3 = sch.schedule()
+    # Last chunk: 1 token left → fills, transitions to RUNNING.
+    assert out3.scheduled[0].num_tokens == 1
+    s.num_computed_tokens += 1
+    assert s in sch.running
+def test_decodes_interleave_with_prefills():
+    cfg = _engine_cfg(max_num_batched_tokens=6)
+    bm = BlockManager(cfg.num_blocks, cfg.block_size)
+    sch = Scheduler(cfg, bm)
+    # Get a sequence fully into RUNNING state.
+    runner = _seq([1, 2, 3, 4, 5])
+    sch.add(runner)
+    out0 = sch.schedule()
+    assert out0.scheduled and out0.scheduled[0].num_tokens == 5
+    # Simulate model forward.
+    runner.num_computed_tokens = runner.prompt_len
+    assert runner.status == SequenceStatus.RUNNING
+    # New waiting seq.
+    waiter = _seq([100, 101, 102])
+    sch.add(waiter)
+    out = sch.schedule()
+    kinds = [(it.is_prefill, it.num_tokens, it.seq.seq_id) for it in out.scheduled]
+    # runner decodes 1 token, waiter prefills 3 — all fit in budget=6.
+    assert any(not it.is_prefill and it.num_tokens == 1 and it.seq is runner for it in out.scheduled)
+    assert any(it.is_prefill and it.num_tokens == 3 and it.seq is waiter for it in out.scheduled)
+def test_preemption_triggers_when_blocks_exhaust():
+    """When a decoding sequence needs a new block but the pool is dry, the
+    scheduler preempts the youngest running seq (here, only itself) and
+    re-enqueues it.  schedule() must not crash."""
+    cfg = _engine_cfg(num_blocks=2, block_size=4, max_num_batched_tokens=16)
+    bm = BlockManager(cfg.num_blocks, cfg.block_size)
+    sch = Scheduler(cfg, bm)
+    s1 = _seq([1, 2, 3, 4, 5, 6, 7])  # 2 blocks consumed exactly on prompt
+    sch.add(s1)
+    sch.schedule()
+    s1.num_computed_tokens = s1.prompt_len
+    # Push s1 to the brink: pretend it has decoded enough to fill block 2.
+    s1.output_token_ids.extend([99] * (8 - 7))  # total_len = 8, fits in 2 blocks
+    # Next decode (position 8) would require a 3rd block; only 2 exist.
+    out = sch.schedule()
+    # s1 should have been preempted (and may then be re-admitted in the same
+    # step via prefix cache — what matters is the preempt event fired).
+    assert s1.seq_id in out.preempted

tiny_vllm/__init__.py ADDED Viewed

	@@ -0,0 +1,28 @@

+"""Tiny vLLM — a minimal continuous-batching engine.
+Educational reimplementation of the core vLLM/SGLang ideas:
+paged KV cache, prefix caching, continuous batching with chunked prefill,
+and SSE streaming over a thin HTTP layer.
+"""
+# Lazy re-exports: importing this package should not pull in torch, so the
+# lightweight block_manager/scheduler can be unit-tested without it.
+from .config import EngineConfig, SamplingParams
+from .request import Request, Sequence, SequenceStatus
+__all__ = [
+    "EngineConfig",
+    "SamplingParams",
+    "LLMEngine",
+    "Request",
+    "Sequence",
+    "SequenceStatus",
+]
+def __getattr__(name: str):
+    if name == "LLMEngine":
+        from .engine import LLMEngine
+        return LLMEngine
+    raise AttributeError(name)

tiny_vllm/block_manager.py ADDED Viewed

	@@ -0,0 +1,265 @@

+"""Paged KV-cache block manager with hash-based automatic prefix caching.
+Concepts (matching vLLM / SGLang terminology):
+  Physical block:  a fixed-size slot in the KV-cache pool that holds the K and V
+                   tensors for ``block_size`` consecutive tokens of one sequence.
+  Block table:     per-sequence list of physical block ids that holds the
+                   sequence's KV in logical order.  Position ``p`` of the
+                   sequence lives in physical block ``block_table[p // B]`` at
+                   slot ``p % B``.
+  Prefix cache:    a content-addressed lookup from
+                       hash(prev_block_hash, tuple_of_token_ids_in_block)
+                   to a physical block id.  When two sequences share a prefix
+                   that aligns to a block boundary, the second sequence can
+                   point its block_table at the cached blocks instead of
+                   recomputing KV, and the scheduler can skip those tokens.
+The "chained" hash means two prefixes match iff they are identical from
+position 0 — exactly the property we need for prefix sharing.
+This manager is allocation-only: it does NOT store the KV tensors.  The
+ModelRunner owns the actual ``[num_blocks, ...]`` tensors and consults the
+block tables here to know where to write/read KV.
+"""
+from __future__ import annotations
+from collections import deque
+from dataclasses import dataclass
+from typing import Optional
+from .request import Sequence
+@dataclass
+class Block:
+    block_id: int
+    ref_count: int = 0
+    hash_key: Optional[int] = None  # set when the block is full and registered
+class BlockManager:
+    def __init__(
+        self,
+        num_blocks: int,
+        block_size: int,
+        enable_prefix_caching: bool = True,
+    ) -> None:
+        self.num_blocks = num_blocks
+        self.block_size = block_size
+        self.enable_prefix_caching = enable_prefix_caching
+        self.blocks: list[Block] = [Block(i) for i in range(num_blocks)]
+        # Two-tier free list: ephemeral (no hash) reused first, then cached
+        # (preserved as long as we have ephemeral capacity).
+        self._free_uncached: deque[int] = deque(range(num_blocks))
+        self._free_cached: deque[int] = deque()
+        self._cache: dict[int, int] = {}  # hash → block_id
+        # Stats (visible via events).
+        self.prefix_cache_hits = 0
+        self.prefix_cache_lookups = 0
+    # ---- introspection --------------------------------------------------
+    @property
+    def num_free_blocks(self) -> int:
+        return len(self._free_uncached) + len(self._free_cached)
+    @property
+    def num_used_blocks(self) -> int:
+        return self.num_blocks - self.num_free_blocks
+    def snapshot(self) -> dict:
+        """Cheap dict for the event stream / UI."""
+        return {
+            "num_blocks": self.num_blocks,
+            "block_size": self.block_size,
+            "num_free_blocks": self.num_free_blocks,
+            "num_cached_entries": len(self._cache),
+            "prefix_cache_hits": self.prefix_cache_hits,
+            "prefix_cache_lookups": self.prefix_cache_lookups,
+            "ref_counts": [b.ref_count for b in self.blocks],
+            "hashed": [b.hash_key is not None for b in self.blocks],
+        }
+    # ---- low-level pool ops --------------------------------------------
+    def _block_hash(self, prev_hash: Optional[int], token_ids: tuple[int, ...]) -> int:
+        # Python's hash() is randomized per process but that's fine: the cache
+        # only lives for the engine's lifetime.
+        return hash((prev_hash, token_ids))
+    def _take_free_block(self) -> int:
+        if self._free_uncached:
+            bid = self._free_uncached.popleft()
+        elif self._free_cached:
+            bid = self._free_cached.popleft()
+            # Evict its cache entry — we're about to repurpose it.
+            blk = self.blocks[bid]
+            if blk.hash_key is not None:
+                self._cache.pop(blk.hash_key, None)
+                blk.hash_key = None
+        else:
+            raise RuntimeError("BlockManager out of free blocks")
+        blk = self.blocks[bid]
+        blk.ref_count = 1
+        return bid
+    def _share(self, block_id: int) -> None:
+        blk = self.blocks[block_id]
+        if blk.ref_count == 0:
+            # Was sitting in the cached free list; pull it out.
+            try:
+                self._free_cached.remove(block_id)
+            except ValueError:
+                pass
+        blk.ref_count += 1
+    def _release(self, block_id: int) -> None:
+        blk = self.blocks[block_id]
+        blk.ref_count -= 1
+        assert blk.ref_count >= 0, f"block {block_id} refcount went negative"
+        if blk.ref_count == 0:
+            if blk.hash_key is not None and self.enable_prefix_caching:
+                self._free_cached.append(block_id)
+            else:
+                self._free_uncached.append(block_id)
+    def _register(self, block_id: int, hash_key: int) -> None:
+        if not self.enable_prefix_caching:
+            return
+        if hash_key in self._cache:
+            # Two sequences independently produced the same content for
+            # different physical blocks. Keep the older one; this one becomes
+            # ephemeral so it gets reclaimed first.
+            return
+        self.blocks[block_id].hash_key = hash_key
+        self._cache[hash_key] = block_id
+    # ---- per-sequence allocation ---------------------------------------
+    def num_blocks_needed_for(self, num_tokens: int) -> int:
+        return (num_tokens + self.block_size - 1) // self.block_size
+    def can_allocate_initial(self, seq: Sequence) -> tuple[bool, int]:
+        """Worst-case allocation check for the prompt of `seq`, ignoring prefix
+        cache hits.  Returns (ok, num_new_blocks_needed)."""
+        need = self.num_blocks_needed_for(seq.prompt_len)
+        return self.num_free_blocks >= need, need
+    def admit(self, seq: Sequence) -> None:
+        """Set up `seq` in the cache.
+        Walks the prompt block-by-block.  For each full block of *prompt* tokens
+        we already know, check the prefix cache: hit → share; miss → allocate
+        fresh and register the hash now (we know the tokens already).
+        The trailing partial block (if any) is allocated fresh and left
+        un-hashed; it will be hashed by `finalize_step` once it fills up.
+        """
+        assert not seq.block_table, "admit called on an already-admitted sequence"
+        prev_hash: Optional[int] = None
+        cached_tokens = 0
+        prompt = seq.prompt_token_ids
+        B = self.block_size
+        num_full = seq.prompt_len // B
+        # IMPORTANT: never let prefix cache cover the entire prompt — we need
+        # at least one token to forward through the model to get logits for
+        # the first sampled token.  If the full prompt block-aligns AND every
+        # block is cached, drop the last cached block.
+        cap_full = num_full
+        if seq.prompt_len % B == 0:
+            cap_full = max(0, num_full - 1)
+        for i in range(num_full):
+            tokens = tuple(prompt[i * B : (i + 1) * B])
+            h = self._block_hash(prev_hash, tokens)
+            self.prefix_cache_lookups += 1
+            if self.enable_prefix_caching and h in self._cache and i < cap_full:
+                # Cache hit.
+                self.prefix_cache_hits += 1
+                bid = self._cache[h]
+                self._share(bid)
+                seq.block_table.append(bid)
+                cached_tokens += B
+                prev_hash = h
+            else:
+                # Miss: allocate, and since the block content is fully known
+                # (prompt tokens), register its hash right away so the next
+                # request with this prefix can hit.
+                bid = self._take_free_block()
+                self._register(bid, h)
+                seq.block_table.append(bid)
+                prev_hash = h
+        # Trailing partial block, if any.
+        if seq.prompt_len % B != 0:
+            bid = self._take_free_block()
+            seq.block_table.append(bid)
+        seq.num_computed_tokens = cached_tokens
+        seq.num_cached_prefix_tokens = cached_tokens
+    def append_slot(self, seq: Sequence) -> Optional[int]:
+        """Ensure `seq` has a slot for one more token (decode path).
+        Returns the block_id that was newly allocated, or None if existing
+        capacity already covered the new token.  Raises if no block available.
+        """
+        new_position = seq.total_len  # 0-indexed slot we are about to write
+        needed_blocks = self.num_blocks_needed_for(new_position + 1)
+        if needed_blocks <= len(seq.block_table):
+            return None
+        if self.num_free_blocks == 0:
+            raise RuntimeError("out of blocks")
+        bid = self._take_free_block()
+        seq.block_table.append(bid)
+        return bid
+    def ensure_blocks_for_chunk(self, seq: Sequence, chunk_tokens: int) -> int:
+        """Prefill path: make sure `seq.block_table` covers
+        `seq.num_computed_tokens + chunk_tokens` tokens.
+        Returns number of newly-allocated blocks.
+        """
+        target = seq.num_computed_tokens + chunk_tokens
+        needed = self.num_blocks_needed_for(target)
+        new_alloc = 0
+        while len(seq.block_table) < needed:
+            bid = self._take_free_block()
+            seq.block_table.append(bid)
+            new_alloc += 1
+        return new_alloc
+    def free(self, seq: Sequence) -> None:
+        for bid in seq.block_table:
+            self._release(bid)
+        seq.block_table.clear()
+    # ---- post-step bookkeeping -----------------------------------------
+    def register_filled_blocks(self, seq: Sequence, prev_computed: int) -> None:
+        """After a forward pass, hash & register any blocks that just became
+        full so future requests can prefix-cache them."""
+        if not self.enable_prefix_caching:
+            return
+        B = self.block_size
+        # Re-chain hashes from the start so we always have prev_hash correct.
+        prev_hash: Optional[int] = None
+        for i in range(seq.num_computed_tokens // B):
+            bid = seq.block_table[i]
+            blk = self.blocks[bid]
+            if blk.hash_key is not None:
+                prev_hash = blk.hash_key
+                continue
+            # This block became full in this step (or earlier but unhashed).
+            if (i + 1) * B > seq.num_computed_tokens:
+                break  # not actually full yet — defensive
+            tokens = tuple(seq.get_token(i * B + j) for j in range(B))
+            h = self._block_hash(prev_hash, tokens)
+            self._register(bid, h)
+            prev_hash = h

tiny_vllm/config.py ADDED Viewed

	@@ -0,0 +1,49 @@

+from __future__ import annotations
+from dataclasses import dataclass, field
+from typing import Optional
+@dataclass
+class EngineConfig:
+    # Model
+    model: str = "Qwen/Qwen2.5-0.5B-Instruct"
+    dtype: str = "float32"  # "float32" on CPU; "float16"/"bfloat16" on GPU
+    device: str = "cpu"     # "cpu" or "cuda"
+    trust_remote_code: bool = False
+    # Paged KV cache
+    block_size: int = 16          # tokens per physical block
+    num_blocks: int = 512         # total physical blocks in the pool
+    enable_prefix_caching: bool = True
+    # Scheduler
+    max_num_seqs: int = 16                # max sequences in a batch
+    max_num_batched_tokens: int = 512     # total tokens processed per step
+    max_model_len: int = 2048             # upper bound on prompt + generated tokens
+    # Logging / events
+    emit_events: bool = True              # produce engine events for the UI
+    event_buffer: int = 256
+    def __post_init__(self) -> None:
+        if self.max_num_batched_tokens < self.block_size:
+            raise ValueError(
+                "max_num_batched_tokens must be >= block_size "
+                f"({self.max_num_batched_tokens} < {self.block_size})"
+            )
+@dataclass
+class SamplingParams:
+    max_tokens: int = 64
+    temperature: float = 1.0
+    top_p: float = 1.0
+    top_k: int = -1                       # -1 disables top-k
+    stop_token_ids: list[int] = field(default_factory=list)
+    seed: Optional[int] = None
+    ignore_eos: bool = False
+    @property
+    def is_greedy(self) -> bool:
+        return self.temperature <= 0.0

tiny_vllm/engine.py ADDED Viewed

	@@ -0,0 +1,385 @@

+"""LLMEngine: orchestrates scheduler + block manager + model runner + sampler.
+Public surface:
+  engine = LLMEngine(EngineConfig(...))
+  await engine.startup()
+  rid = engine.add_request(prompt_text, SamplingParams(...))
+  async for delta in engine.stream(rid):
+      ...
+A single background task (`_run_loop`) drives the model.  Per-request output
+goes through asyncio queues so the HTTP layer can stream incrementally.  A
+second pub/sub channel emits engine-state snapshots for the visualization UI.
+"""
+from __future__ import annotations
+import asyncio
+import itertools
+import time
+import uuid
+from collections import deque
+from dataclasses import dataclass, field
+from typing import AsyncIterator, Optional
+from .block_manager import BlockManager
+from .config import EngineConfig, SamplingParams
+from .model_runner import ModelRunner
+from .request import Sequence, SequenceStatus
+from .sampler import Sampler
+from .scheduler import Scheduler
+@dataclass
+class StreamItem:
+    request_id: str
+    new_text: str
+    new_token_ids: list[int]
+    finished: bool
+    finish_reason: Optional[str] = None
+    cumulative_text: str = ""
+@dataclass
+class EngineEvent:
+    step: int
+    timestamp: float
+    type: str
+    payload: dict = field(default_factory=dict)
+class LLMEngine:
+    def __init__(self, config: EngineConfig) -> None:
+        self.config = config
+        self.model_runner: Optional[ModelRunner] = None
+        self.block_manager: Optional[BlockManager] = None
+        self.scheduler: Optional[Scheduler] = None
+        self.sampler: Optional[Sampler] = None
+        # request_id → asyncio.Queue[StreamItem]
+        self._output_queues: dict[str, asyncio.Queue[StreamItem]] = {}
+        # request_id → Sequence (for inspection / abort)
+        self._sequences: dict[str, Sequence] = {}
+        # tracker for incremental detokenization
+        self._prev_text_len: dict[str, int] = {}
+        # event subscribers
+        self._event_subscribers: list[asyncio.Queue[EngineEvent]] = []
+        # control
+        self._stop = asyncio.Event()
+        self._step_idx = 0
+        self._run_task: Optional[asyncio.Task] = None
+        self._wake = asyncio.Event()
+    # ---- lifecycle ------------------------------------------------------
+    async def startup(self) -> None:
+        # Heavy: model load happens in a worker thread so we don't block the loop.
+        loop = asyncio.get_running_loop()
+        def _build() -> ModelRunner:
+            return ModelRunner(self.config)
+        self.model_runner = await loop.run_in_executor(None, _build)
+        self.block_manager = BlockManager(
+            num_blocks=self.config.num_blocks,
+            block_size=self.config.block_size,
+            enable_prefix_caching=self.config.enable_prefix_caching,
+        )
+        self.scheduler = Scheduler(self.config, self.block_manager)
+        self.sampler = Sampler(self.model_runner.device)
+        self._run_task = asyncio.create_task(self._run_loop())
+    async def shutdown(self) -> None:
+        self._stop.set()
+        self._wake.set()
+        if self._run_task is not None:
+            try:
+                await asyncio.wait_for(self._run_task, timeout=5)
+            except asyncio.TimeoutError:
+                self._run_task.cancel()
+    # ---- request submission --------------------------------------------
+    def add_request(
+        self,
+        prompt: str | list[int],
+        sampling_params: SamplingParams,
+        request_id: Optional[str] = None,
+    ) -> str:
+        if self.model_runner is None:
+            raise RuntimeError("engine not started")
+        if isinstance(prompt, str):
+            token_ids = self.model_runner.encode(prompt)
+        else:
+            token_ids = list(prompt)
+        if not token_ids:
+            raise ValueError("empty prompt")
+        if len(token_ids) >= self.config.max_model_len:
+            raise ValueError(
+                f"prompt length {len(token_ids)} >= max_model_len {self.config.max_model_len}"
+            )
+        rid = request_id or uuid.uuid4().hex
+        seq = Sequence(
+            prompt_token_ids=token_ids,
+            sampling_params=sampling_params,
+            request_id=rid,
+        )
+        self._sequences[rid] = seq
+        self._output_queues[rid] = asyncio.Queue()
+        self._prev_text_len[rid] = 0
+        assert self.scheduler is not None
+        self.scheduler.add(seq)
+        self._wake.set()
+        return rid
+    def abort(self, request_id: str) -> bool:
+        seq = self._sequences.get(request_id)
+        if seq is None:
+            return False
+        assert self.scheduler is not None
+        ok = self.scheduler.abort(seq.seq_id)
+        if ok:
+            self._close_request(request_id, finish_reason="abort")
+        return ok
+    async def stream(self, request_id: str) -> AsyncIterator[StreamItem]:
+        q = self._output_queues.get(request_id)
+        if q is None:
+            raise KeyError(request_id)
+        while True:
+            item = await q.get()
+            yield item
+            if item.finished:
+                break
+    # ---- event subscriptions -------------------------------------------
+    def subscribe_events(self) -> asyncio.Queue[EngineEvent]:
+        q: asyncio.Queue[EngineEvent] = asyncio.Queue(maxsize=self.config.event_buffer)
+        self._event_subscribers.append(q)
+        return q
+    def unsubscribe_events(self, q: asyncio.Queue[EngineEvent]) -> None:
+        try:
+            self._event_subscribers.remove(q)
+        except ValueError:
+            pass
+    def _emit(self, event_type: str, payload: dict) -> None:
+        if not self.config.emit_events or not self._event_subscribers:
+            return
+        ev = EngineEvent(
+            step=self._step_idx,
+            timestamp=time.monotonic(),
+            type=event_type,
+            payload=payload,
+        )
+        for q in list(self._event_subscribers):
+            try:
+                q.put_nowait(ev)
+            except asyncio.QueueFull:
+                # Drop oldest, push new.
+                try:
+                    q.get_nowait()
+                except asyncio.QueueEmpty:
+                    pass
+                try:
+                    q.put_nowait(ev)
+                except asyncio.QueueFull:
+                    pass
+    # ---- inspection ----------------------------------------------------
+    def snapshot(self) -> dict:
+        assert self.block_manager is not None and self.scheduler is not None
+        def seq_view(s: Sequence) -> dict:
+            return {
+                "seq_id": s.seq_id,
+                "request_id": s.request_id,
+                "status": s.status.value,
+                "prompt_len": s.prompt_len,
+                "num_generated": len(s.output_token_ids),
+                "num_computed_tokens": s.num_computed_tokens,
+                "num_cached_prefix_tokens": s.num_cached_prefix_tokens,
+                "block_table": list(s.block_table),
+            }
+        return {
+            "step": self._step_idx,
+            "block_pool": self.block_manager.snapshot(),
+            "waiting": [seq_view(s) for s in self.scheduler.waiting],
+            "running": [seq_view(s) for s in self.scheduler.running],
+            "config": {
+                "model": self.config.model,
+                "block_size": self.config.block_size,
+                "num_blocks": self.config.num_blocks,
+                "max_num_seqs": self.config.max_num_seqs,
+                "max_num_batched_tokens": self.config.max_num_batched_tokens,
+                "prefix_caching": self.config.enable_prefix_caching,
+            },
+        }
+    # ---- main loop -----------------------------------------------------
+    async def _run_loop(self) -> None:
+        assert self.scheduler is not None and self.model_runner is not None
+        loop = asyncio.get_running_loop()
+        while not self._stop.is_set():
+            if not self.scheduler.has_work:
+                self._wake.clear()
+                try:
+                    await asyncio.wait_for(self._wake.wait(), timeout=1.0)
+                except asyncio.TimeoutError:
+                    pass
+                continue
+            self._step_idx += 1
+            t0 = time.monotonic()
+            sched = self.scheduler.schedule()
+            if sched.is_empty:
+                # Nothing got through this step (probably starved on blocks).
+                await asyncio.sleep(0.01)
+                continue
+            model_input = self.model_runner.prepare_input(sched.scheduled)
+            # Run blocking model forward off-thread.
+            logits = await loop.run_in_executor(None, self.model_runner.execute, model_input)
+            # Update num_computed_tokens AFTER forward (the K/V is now stored).
+            for item in sched.scheduled:
+                item.seq.num_computed_tokens += item.num_tokens
+            # Sample only for sequences that have finished prefill (i.e., the
+            # last token in their chunk is the *final* prompt token).
+            sampling_items = [item for item in sched.scheduled
+                              if item.seq.num_computed_tokens >= item.seq.prompt_len]
+            sampling_indices = [i for i, item in enumerate(sched.scheduled)
+                                if item.seq.num_computed_tokens >= item.seq.prompt_len]
+            new_tokens: dict[int, int] = {}
+            if sampling_items:
+                import torch  # local; cheap
+                sampling_logits = logits.index_select(
+                    0, torch.tensor(sampling_indices, device=logits.device)
+                )
+                params = [item.seq.sampling_params for item in sampling_items]
+                generators = [
+                    (torch.Generator(device=logits.device).manual_seed(item.seq.sampling_params.seed)
+                     if item.seq.sampling_params.seed is not None else None)
+                    for item in sampling_items
+                ]
+                token_ids = self.sampler.sample(sampling_logits, params, generators)
+                for item, tok in zip(sampling_items, token_ids):
+                    new_tokens[item.seq.seq_id] = tok
+            # Apply new tokens, check stopping, register filled blocks.
+            assert self.block_manager is not None
+            finished_now: list[Sequence] = []
+            for item in sched.scheduled:
+                seq = item.seq
+                if seq.seq_id in new_tokens:
+                    tok = new_tokens[seq.seq_id]
+                    seq.append_output_token(tok)
+                    # The just-produced token's KV will be written on the NEXT
+                    # step (when this token is the input).  But the new token
+                    # may complete a block once its KV lands; we hash blocks
+                    # only after their KV exists, so post-forward in the next
+                    # step is the right time.  Here we register newly-filled
+                    # blocks based on the just-finalized num_computed_tokens.
+                    self.block_manager.register_filled_blocks(seq, prev_computed=0)
+                    if self._should_stop(seq, tok):
+                        seq.status = SequenceStatus.FINISHED
+                        seq.finish_reason = self._stop_reason(seq, tok)
+                        finished_now.append(seq)
+                else:
+                    # Still in prefill; just register newly filled prompt blocks.
+                    self.block_manager.register_filled_blocks(seq, prev_computed=0)
+            # Free finished sequences.
+            for seq in finished_now:
+                if seq in self.scheduler.running:
+                    self.scheduler.running.remove(seq)
+                self.block_manager.free(seq)
+            # Emit outputs to per-request queues.
+            for item in sched.scheduled:
+                seq = item.seq
+                rid = seq.request_id
+                if seq.seq_id in new_tokens or seq in finished_now:
+                    new_text, new_text_len = self.model_runner.detokenize_incremental(
+                        seq.all_token_ids(), self._prev_text_len.get(rid, 0)
+                    )
+                    self._prev_text_len[rid] = new_text_len
+                    is_done = seq.status == SequenceStatus.FINISHED
+                    new_toks = [new_tokens[seq.seq_id]] if seq.seq_id in new_tokens else []
+                    si = StreamItem(
+                        request_id=rid,
+                        new_text=new_text,
+                        new_token_ids=new_toks,
+                        finished=is_done,
+                        finish_reason=seq.finish_reason,
+                        cumulative_text=self.model_runner.tokenizer.decode(
+                            seq.output_token_ids, skip_special_tokens=True
+                        ),
+                    )
+                    q = self._output_queues.get(rid)
+                    if q is not None:
+                        await q.put(si)
+                    if is_done:
+                        # Clean up.
+                        self._sequences.pop(rid, None)
+                        self._prev_text_len.pop(rid, None)
+            # Emit engine events for the UI.
+            self._emit("step", {
+                "duration_ms": (time.monotonic() - t0) * 1000,
+                "num_seqs": len(sched.scheduled),
+                "num_tokens": sched.total_tokens,
+                "num_prefill_seqs": sum(1 for it in sched.scheduled if it.is_prefill),
+                "num_decode_seqs": sum(1 for it in sched.scheduled if not it.is_prefill),
+                "preempted": sched.preempted,
+                "newly_admitted": sched.newly_admitted,
+                "finished": [s.request_id for s in finished_now],
+                "snapshot": self.snapshot(),
+            })
+            # Yield control between steps so the HTTP layer can ship bytes.
+            await asyncio.sleep(0)
+    # ---- helpers -------------------------------------------------------
+    def _should_stop(self, seq: Sequence, last_token: int) -> bool:
+        sp = seq.sampling_params
+        if len(seq.output_token_ids) >= sp.max_tokens:
+            return True
+        if not sp.ignore_eos:
+            eos = self.model_runner.eos_token_id if self.model_runner else None
+            if eos is not None and last_token == eos:
+                return True
+        if last_token in sp.stop_token_ids:
+            return True
+        if seq.total_len >= self.config.max_model_len:
+            return True
+        return False
+    def _stop_reason(self, seq: Sequence, last_token: int) -> str:
+        sp = seq.sampling_params
+        if len(seq.output_token_ids) >= sp.max_tokens:
+            return "length"
+        if seq.total_len >= self.config.max_model_len:
+            return "length"
+        return "stop"
+    def _close_request(self, request_id: str, finish_reason: str) -> None:
+        q = self._output_queues.get(request_id)
+        if q is None:
+            return
+        q.put_nowait(StreamItem(
+            request_id=request_id,
+            new_text="",
+            new_token_ids=[],
+            finished=True,
+            finish_reason=finish_reason,
+        ))
+        self._sequences.pop(request_id, None)
+        self._prev_text_len.pop(request_id, None)

tiny_vllm/model_runner.py ADDED Viewed

	@@ -0,0 +1,392 @@

+"""Minimal Qwen2 forward pass that consumes a paged KV cache.
+We deliberately re-implement Qwen2 from scratch (rather than using the HF
+forward) so the path of K/V tensors through the cache is fully visible.
+Weights are loaded from a HuggingFace checkpoint by matching parameter names.
+Layout of inputs per step ("varlen" packing):
+  input_ids       [T_total]               concatenated tokens for all seqs
+  positions       [T_total]               position-in-sequence of each token
+  slot_mapping    [T_total]               where to write new K/V in the cache
+  segments        list of (q_start, q_end, block_table, k_len, seq_id)
+For attention, we loop over `segments`: gather each sequence's full K/V from
+its block table, run SDPA, scatter the result back into a flat buffer.  All
+other ops (norms, MLP, projections) run on the full packed tensor.
+"""
+from __future__ import annotations
+from dataclasses import dataclass
+from typing import Optional
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from .config import EngineConfig
+from .paged_kv import PagedKVCache
+from .request import Sequence
+# ---------------------------------------------------------------------------
+# Qwen2 building blocks
+# ---------------------------------------------------------------------------
+class Qwen2RMSNorm(nn.Module):
+    def __init__(self, hidden_size: int, eps: float = 1e-6) -> None:
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(hidden_size))
+        self.eps = eps
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        # x: [..., hidden]
+        dtype = x.dtype
+        x = x.to(torch.float32)
+        var = x.pow(2).mean(-1, keepdim=True)
+        x = x * torch.rsqrt(var + self.eps)
+        return (self.weight * x).to(dtype)
+def _rotate_half(x: torch.Tensor) -> torch.Tensor:
+    half = x.size(-1) // 2
+    x1, x2 = x[..., :half], x[..., half:]
+    return torch.cat((-x2, x1), dim=-1)
+def _apply_rope(x: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor) -> torch.Tensor:
+    """x: [T, H, D], cos/sin: [T, D] → returns [T, H, D]."""
+    cos = cos.unsqueeze(1)
+    sin = sin.unsqueeze(1)
+    return (x * cos) + (_rotate_half(x) * sin)
+class Qwen2MLP(nn.Module):
+    def __init__(self, hidden_size: int, intermediate_size: int) -> None:
+        super().__init__()
+        self.gate_proj = nn.Linear(hidden_size, intermediate_size, bias=False)
+        self.up_proj = nn.Linear(hidden_size, intermediate_size, bias=False)
+        self.down_proj = nn.Linear(intermediate_size, hidden_size, bias=False)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return self.down_proj(F.silu(self.gate_proj(x)) * self.up_proj(x))
+@dataclass
+class AttnSegment:
+    """One sequence's slice of the packed batch."""
+    q_start: int           # start index in the packed tensor
+    q_end: int             # exclusive
+    block_table: list[int] # KV blocks for this sequence
+    k_len: int             # total K length (= num_computed_tokens + q_len)
+class Qwen2Attention(nn.Module):
+    def __init__(
+        self,
+        hidden_size: int,
+        num_heads: int,
+        num_kv_heads: int,
+        head_dim: int,
+        layer_idx: int,
+    ) -> None:
+        super().__init__()
+        self.num_heads = num_heads
+        self.num_kv_heads = num_kv_heads
+        self.head_dim = head_dim
+        self.layer_idx = layer_idx
+        self.q_proj = nn.Linear(hidden_size, num_heads * head_dim, bias=True)
+        self.k_proj = nn.Linear(hidden_size, num_kv_heads * head_dim, bias=True)
+        self.v_proj = nn.Linear(hidden_size, num_kv_heads * head_dim, bias=True)
+        self.o_proj = nn.Linear(num_heads * head_dim, hidden_size, bias=False)
+        self.scale = head_dim ** -0.5
+    def forward(
+        self,
+        hidden_states: torch.Tensor,        # [T, hidden]
+        positions: torch.Tensor,            # [T] long
+        slot_mapping: torch.Tensor,         # [T] long
+        cos_table: torch.Tensor,            # [max_pos, head_dim]
+        sin_table: torch.Tensor,            # [max_pos, head_dim]
+        segments: list[AttnSegment],
+        kv_cache: PagedKVCache,
+    ) -> torch.Tensor:
+        T = hidden_states.size(0)
+        q = self.q_proj(hidden_states).view(T, self.num_heads, self.head_dim)
+        k = self.k_proj(hidden_states).view(T, self.num_kv_heads, self.head_dim)
+        v = self.v_proj(hidden_states).view(T, self.num_kv_heads, self.head_dim)
+        cos = cos_table.index_select(0, positions)  # [T, head_dim]
+        sin = sin_table.index_select(0, positions)
+        q = _apply_rope(q, cos, sin)
+        k = _apply_rope(k, cos, sin)
+        # Write the NEW K/V into the paged cache before reading it back.
+        kv_cache.write(self.layer_idx, k, v, slot_mapping)
+        out = torch.empty_like(q)  # [T, num_heads, head_dim]
+        rep = self.num_heads // self.num_kv_heads  # GQA fan-out
+        for seg in segments:
+            q_slice = q[seg.q_start:seg.q_end]                  # [q_len, H_q, D]
+            k_full, v_full = kv_cache.gather(self.layer_idx, seg.block_table, seg.k_len)
+            # GQA: expand K/V heads to match Q heads.
+            if rep > 1:
+                k_full = k_full.repeat_interleave(rep, dim=1)
+                v_full = v_full.repeat_interleave(rep, dim=1)
+            q_len = q_slice.size(0)
+            k_len = seg.k_len
+            num_past = k_len - q_len
+            # Causal mask: Q at logical position (num_past + i) attends to K at
+            # positions [0, num_past + i].  True = participate (SDPA convention).
+            idx_q = torch.arange(q_len, device=q.device).unsqueeze(1) + num_past
+            idx_k = torch.arange(k_len, device=q.device).unsqueeze(0)
+            attn_mask = idx_k <= idx_q  # [q_len, k_len]
+            # SDPA wants [..., heads, q_len, head_dim].  Reshape and run.
+            q_h = q_slice.transpose(0, 1).unsqueeze(0)   # [1, H, q_len, D]
+            k_h = k_full.transpose(0, 1).unsqueeze(0)    # [1, H, k_len, D]
+            v_h = v_full.transpose(0, 1).unsqueeze(0)
+            attn = F.scaled_dot_product_attention(
+                q_h, k_h, v_h,
+                attn_mask=attn_mask.unsqueeze(0).unsqueeze(0),  # [1,1,q_len,k_len]
+                scale=self.scale,
+            )                                            # [1, H, q_len, D]
+            out[seg.q_start:seg.q_end] = attn.squeeze(0).transpose(0, 1)
+        return self.o_proj(out.reshape(T, self.num_heads * self.head_dim))
+class Qwen2DecoderLayer(nn.Module):
+    def __init__(self, cfg: dict, layer_idx: int) -> None:
+        super().__init__()
+        self.input_layernorm = Qwen2RMSNorm(cfg["hidden_size"], eps=cfg["rms_norm_eps"])
+        self.self_attn = Qwen2Attention(
+            hidden_size=cfg["hidden_size"],
+            num_heads=cfg["num_attention_heads"],
+            num_kv_heads=cfg["num_key_value_heads"],
+            head_dim=cfg["head_dim"],
+            layer_idx=layer_idx,
+        )
+        self.post_attention_layernorm = Qwen2RMSNorm(cfg["hidden_size"], eps=cfg["rms_norm_eps"])
+        self.mlp = Qwen2MLP(cfg["hidden_size"], cfg["intermediate_size"])
+    def forward(self, hidden_states, positions, slot_mapping, cos_table, sin_table, segments, kv_cache):
+        residual = hidden_states
+        h = self.input_layernorm(hidden_states)
+        h = self.self_attn(h, positions, slot_mapping, cos_table, sin_table, segments, kv_cache)
+        hidden_states = residual + h
+        residual = hidden_states
+        h = self.post_attention_layernorm(hidden_states)
+        h = self.mlp(h)
+        return residual + h
+class Qwen2Model(nn.Module):
+    def __init__(self, cfg: dict) -> None:
+        super().__init__()
+        self.cfg = cfg
+        self.embed_tokens = nn.Embedding(cfg["vocab_size"], cfg["hidden_size"])
+        self.layers = nn.ModuleList(
+            [Qwen2DecoderLayer(cfg, i) for i in range(cfg["num_hidden_layers"])]
+        )
+        self.norm = Qwen2RMSNorm(cfg["hidden_size"], eps=cfg["rms_norm_eps"])
+    def forward(self, input_ids, positions, slot_mapping, cos_table, sin_table, segments, kv_cache):
+        h = self.embed_tokens(input_ids)
+        for layer in self.layers:
+            h = layer(h, positions, slot_mapping, cos_table, sin_table, segments, kv_cache)
+        return self.norm(h)
+class Qwen2ForCausalLM(nn.Module):
+    def __init__(self, cfg: dict) -> None:
+        super().__init__()
+        self.model = Qwen2Model(cfg)
+        self.lm_head = nn.Linear(cfg["hidden_size"], cfg["vocab_size"], bias=False)
+        self.cfg = cfg
+    def tie_weights(self) -> None:
+        self.lm_head.weight = self.model.embed_tokens.weight
+# ---------------------------------------------------------------------------
+# ModelRunner: prepares inputs, runs forward, extracts last-token logits.
+# ---------------------------------------------------------------------------
+@dataclass
+class ModelInput:
+    input_ids: torch.Tensor
+    positions: torch.Tensor
+    slot_mapping: torch.Tensor
+    segments: list[AttnSegment]
+    # Index in the packed batch of the LAST token of each scheduled seq —
+    # that's where we'll read logits from for sampling.
+    last_token_indices: torch.Tensor
+class ModelRunner:
+    def __init__(self, config: EngineConfig) -> None:
+        from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM
+        self.config = config
+        self.device = torch.device(config.device)
+        self.dtype = {
+            "float32": torch.float32,
+            "float16": torch.float16,
+            "bfloat16": torch.bfloat16,
+        }[config.dtype]
+        hf_cfg = AutoConfig.from_pretrained(
+            config.model, trust_remote_code=config.trust_remote_code
+        )
+        self.tokenizer = AutoTokenizer.from_pretrained(
+            config.model, trust_remote_code=config.trust_remote_code
+        )
+        model_type = getattr(hf_cfg, "model_type", "?")
+        if model_type not in ("qwen2", "qwen2_moe", "llama"):
+            # Llama-style works too because the math is identical; we issue a
+            # warning rather than a hard fail.
+            print(f"[tiny_vllm] WARNING: model_type={model_type!r}; expected qwen2-like. "
+                  "Continuing — assuming Llama-compatible config.")
+        head_dim = getattr(hf_cfg, "head_dim", hf_cfg.hidden_size // hf_cfg.num_attention_heads)
+        cfg = {
+            "vocab_size": hf_cfg.vocab_size,
+            "hidden_size": hf_cfg.hidden_size,
+            "intermediate_size": hf_cfg.intermediate_size,
+            "num_hidden_layers": hf_cfg.num_hidden_layers,
+            "num_attention_heads": hf_cfg.num_attention_heads,
+            "num_key_value_heads": getattr(hf_cfg, "num_key_value_heads",
+                                            hf_cfg.num_attention_heads),
+            "head_dim": head_dim,
+            "rms_norm_eps": getattr(hf_cfg, "rms_norm_eps", 1e-6),
+            "rope_theta": getattr(hf_cfg, "rope_theta", 10000.0),
+            "max_position_embeddings": getattr(hf_cfg, "max_position_embeddings", 4096),
+            "tie_word_embeddings": getattr(hf_cfg, "tie_word_embeddings", False),
+        }
+        self.model_cfg = cfg
+        # Build our own model, then copy HF weights into it.
+        model = Qwen2ForCausalLM(cfg).to(self.device, self.dtype)
+        hf_model = AutoModelForCausalLM.from_pretrained(
+            config.model, torch_dtype=self.dtype,
+            trust_remote_code=config.trust_remote_code,
+        )
+        missing, unexpected = model.load_state_dict(hf_model.state_dict(), strict=False)
+        if cfg["tie_word_embeddings"] and "lm_head.weight" in (missing or []):
+            model.tie_weights()
+        del hf_model
+        model.eval()
+        for p in model.parameters():
+            p.requires_grad_(False)
+        self.model = model
+        # Precompute RoPE tables.
+        max_pos = min(cfg["max_position_embeddings"], config.max_model_len)
+        inv_freq = 1.0 / (
+            cfg["rope_theta"]
+            ** (torch.arange(0, head_dim, 2, dtype=torch.float32) / head_dim)
+        )
+        t = torch.arange(max_pos, dtype=torch.float32)
+        freqs = torch.outer(t, inv_freq)        # [max_pos, head_dim/2]
+        emb = torch.cat((freqs, freqs), dim=-1)  # [max_pos, head_dim]
+        self.cos_table = emb.cos().to(self.device, self.dtype)
+        self.sin_table = emb.sin().to(self.device, self.dtype)
+        # Paged KV cache pool.
+        self.kv_cache = PagedKVCache(
+            num_layers=cfg["num_hidden_layers"],
+            num_blocks=config.num_blocks,
+            block_size=config.block_size,
+            num_kv_heads=cfg["num_key_value_heads"],
+            head_dim=head_dim,
+            dtype=self.dtype,
+            device=self.device,
+        )
+    # ---- input building ------------------------------------------------
+    def prepare_input(self, scheduled) -> ModelInput:
+        """`scheduled` is a list of (Sequence, num_tokens, is_prefill) triples
+        from the scheduler."""
+        input_ids: list[int] = []
+        positions: list[int] = []
+        slot_mapping: list[int] = []
+        segments: list[AttnSegment] = []
+        last_indices: list[int] = []
+        cursor = 0
+        B = self.config.block_size
+        for item in scheduled:
+            seq = item.seq
+            n = item.num_tokens
+            # Logical token positions this step processes.
+            start_pos = seq.num_computed_tokens
+            for off in range(n):
+                pos = start_pos + off
+                input_ids.append(seq.get_token(pos))
+                positions.append(pos)
+                block_id = seq.block_table[pos // B]
+                slot_mapping.append(block_id * B + (pos % B))
+            q_end = cursor + n
+            segments.append(AttnSegment(
+                q_start=cursor,
+                q_end=q_end,
+                block_table=list(seq.block_table),
+                k_len=start_pos + n,
+            ))
+            last_indices.append(q_end - 1)
+            cursor = q_end
+        return ModelInput(
+            input_ids=torch.tensor(input_ids, dtype=torch.long, device=self.device),
+            positions=torch.tensor(positions, dtype=torch.long, device=self.device),
+            slot_mapping=torch.tensor(slot_mapping, dtype=torch.long, device=self.device),
+            segments=segments,
+            last_token_indices=torch.tensor(last_indices, dtype=torch.long, device=self.device),
+        )
+    # ---- forward -------------------------------------------------------
+    @torch.inference_mode()
+    def execute(self, model_input: ModelInput) -> torch.Tensor:
+        """Run one forward pass.  Returns logits for the LAST token of each
+        scheduled sequence: shape [num_seqs, vocab_size]."""
+        hidden = self.model.model(
+            input_ids=model_input.input_ids,
+            positions=model_input.positions,
+            slot_mapping=model_input.slot_mapping,
+            cos_table=self.cos_table,
+            sin_table=self.sin_table,
+            segments=model_input.segments,
+            kv_cache=self.kv_cache,
+        )                                                # [T, hidden]
+        last_hidden = hidden.index_select(0, model_input.last_token_indices)
+        logits = self.model.lm_head(last_hidden)         # [num_seqs, vocab]
+        return logits
+    # ---- helpers -------------------------------------------------------
+    @property
+    def eos_token_id(self) -> Optional[int]:
+        return self.tokenizer.eos_token_id
+    def encode(self, text: str) -> list[int]:
+        return self.tokenizer.encode(text, add_special_tokens=False)
+    def decode(self, token_ids: list[int]) -> str:
+        return self.tokenizer.decode(token_ids, skip_special_tokens=True)
+    def detokenize_incremental(self, full_ids: list[int], prev_text_len: int) -> tuple[str, int]:
+        """Detokenize the full list, return the new text added since last call
+        and the new total length."""
+        text = self.tokenizer.decode(full_ids, skip_special_tokens=True)
+        return text[prev_text_len:], len(text)

tiny_vllm/paged_kv.py ADDED Viewed

	@@ -0,0 +1,70 @@

+"""The actual KV tensor pool that the BlockManager indexes into.
+We store one ``[num_blocks, block_size, num_kv_heads, head_dim]`` tensor per
+layer for K and V.  The block_manager owns the *allocation* of block ids; this
+class owns the *bytes*.  Reads and writes happen by (block_id, offset).
+"""
+from __future__ import annotations
+import torch
+class PagedKVCache:
+    def __init__(
+        self,
+        num_layers: int,
+        num_blocks: int,
+        block_size: int,
+        num_kv_heads: int,
+        head_dim: int,
+        dtype: torch.dtype,
+        device: torch.device,
+    ) -> None:
+        self.num_layers = num_layers
+        self.num_blocks = num_blocks
+        self.block_size = block_size
+        self.num_kv_heads = num_kv_heads
+        self.head_dim = head_dim
+        self.dtype = dtype
+        self.device = device
+        shape = (num_blocks, block_size, num_kv_heads, head_dim)
+        self.k_cache = [torch.zeros(shape, dtype=dtype, device=device) for _ in range(num_layers)]
+        self.v_cache = [torch.zeros(shape, dtype=dtype, device=device) for _ in range(num_layers)]
+    def write(
+        self,
+        layer_id: int,
+        k: torch.Tensor,           # [T, num_kv_heads, head_dim]
+        v: torch.Tensor,           # [T, num_kv_heads, head_dim]
+        slot_mapping: torch.Tensor # [T] int64, slot_id = block_id*block_size + offset
+    ) -> None:
+        block_ids = (slot_mapping // self.block_size).long()
+        offsets = (slot_mapping % self.block_size).long()
+        self.k_cache[layer_id][block_ids, offsets] = k.to(self.dtype)
+        self.v_cache[layer_id][block_ids, offsets] = v.to(self.dtype)
+    def gather(
+        self,
+        layer_id: int,
+        block_table: list[int],
+        num_tokens: int,
+    ) -> tuple[torch.Tensor, torch.Tensor]:
+        """Return contiguous [num_tokens, num_kv_heads, head_dim] K and V
+        for one sequence, by walking its block table."""
+        if num_tokens == 0:
+            empty = torch.zeros(
+                0, self.num_kv_heads, self.head_dim,
+                dtype=self.dtype, device=self.device,
+            )
+            return empty, empty.clone()
+        num_full = num_tokens // self.block_size
+        tail = num_tokens % self.block_size
+        idxs = block_table[:num_full + (1 if tail else 0)]
+        idx_tensor = torch.as_tensor(idxs, dtype=torch.long, device=self.device)
+        # [P, block_size, H, D]
+        k_blocks = self.k_cache[layer_id].index_select(0, idx_tensor)
+        v_blocks = self.v_cache[layer_id].index_select(0, idx_tensor)
+        # Flatten the first two dims then trim.
+        k_flat = k_blocks.reshape(-1, self.num_kv_heads, self.head_dim)
+        v_flat = v_blocks.reshape(-1, self.num_kv_heads, self.head_dim)
+        return k_flat[:num_tokens], v_flat[:num_tokens]

tiny_vllm/request.py ADDED Viewed

	@@ -0,0 +1,86 @@

+from __future__ import annotations
+import enum
+import itertools
+import time
+from dataclasses import dataclass, field
+from typing import Optional
+from .config import SamplingParams
+class SequenceStatus(enum.Enum):
+    WAITING = "waiting"        # not yet started prefill
+    PREFILLING = "prefilling"  # chunked prefill in progress
+    RUNNING = "running"        # in decode loop
+    FINISHED = "finished"
+    PREEMPTED = "preempted"    # evicted; will restart prefill when capacity returns
+_seq_counter = itertools.count()
+def _next_seq_id() -> int:
+    return next(_seq_counter)
+@dataclass
+class Sequence:
+    """One in-flight request.
+    The token sequence is `prompt_token_ids + output_token_ids`.
+    `num_computed_tokens` tracks how many tokens already have their KV
+    materialized in the paged cache.  Anything past that boundary is either
+    waiting prefill (during PREFILLING) or the next token to sample (RUNNING).
+    """
+    prompt_token_ids: list[int]
+    sampling_params: SamplingParams
+    request_id: str
+    arrival_time: float = field(default_factory=time.monotonic)
+    seq_id: int = field(default_factory=_next_seq_id)
+    output_token_ids: list[int] = field(default_factory=list)
+    status: SequenceStatus = SequenceStatus.WAITING
+    # Paged KV bookkeeping (filled in by the BlockManager).
+    block_table: list[int] = field(default_factory=list)
+    num_computed_tokens: int = 0          # tokens with KV in the cache
+    num_cached_prefix_tokens: int = 0     # tokens served from prefix cache hits
+    # Outputs / streaming
+    finish_reason: Optional[str] = None
+    # ---- helpers --------------------------------------------------------
+    @property
+    def prompt_len(self) -> int:
+        return len(self.prompt_token_ids)
+    @property
+    def total_len(self) -> int:
+        return len(self.prompt_token_ids) + len(self.output_token_ids)
+    def all_token_ids(self) -> list[int]:
+        return self.prompt_token_ids + self.output_token_ids
+    def get_token(self, position: int) -> int:
+        if position < len(self.prompt_token_ids):
+            return self.prompt_token_ids[position]
+        return self.output_token_ids[position - len(self.prompt_token_ids)]
+    @property
+    def num_uncomputed_prompt_tokens(self) -> int:
+        return max(0, self.prompt_len - self.num_computed_tokens)
+    def append_output_token(self, token_id: int) -> None:
+        self.output_token_ids.append(token_id)
+@dataclass
+class Request:
+    """A user-submitted request before it becomes a Sequence."""
+    request_id: str
+    prompt_token_ids: list[int]
+    sampling_params: SamplingParams

tiny_vllm/sampler.py ADDED Viewed

	@@ -0,0 +1,53 @@

+"""Per-request sampling.  Temperature, top-p, top-k, greedy."""
+from __future__ import annotations
+from typing import Optional
+import torch
+from .config import SamplingParams
+class Sampler:
+    def __init__(self, device: torch.device) -> None:
+        self.device = device
+    def sample(
+        self,
+        logits: torch.Tensor,                 # [num_seqs, vocab]
+        params: list[SamplingParams],
+        generators: Optional[list[Optional[torch.Generator]]] = None,
+    ) -> list[int]:
+        out: list[int] = []
+        for i, p in enumerate(params):
+            row = logits[i]
+            if p.is_greedy:
+                out.append(int(row.argmax().item()))
+                continue
+            # Temperature.
+            row = row / max(p.temperature, 1e-5)
+            # Top-k.
+            if p.top_k > 0 and p.top_k < row.size(-1):
+                topk_vals, _ = torch.topk(row, p.top_k)
+                row = torch.where(row < topk_vals[-1], torch.full_like(row, float("-inf")), row)
+            # Top-p (nucleus).
+            if 0.0 < p.top_p < 1.0:
+                sorted_logits, sorted_idx = torch.sort(row, descending=True)
+                probs = torch.softmax(sorted_logits, dim=-1)
+                cumprobs = probs.cumsum(dim=-1)
+                # Drop tokens whose CUMULATIVE prob (including themselves) exceeds top_p,
+                # but always keep the highest-probability one.
+                drop = cumprobs > p.top_p
+                drop[0] = False
+                drop = drop.roll(shifts=1, dims=0)  # so the boundary token stays
+                drop[0] = False
+                sorted_logits = sorted_logits.masked_fill(drop, float("-inf"))
+                row = torch.full_like(row, float("-inf"))
+                row.scatter_(0, sorted_idx, sorted_logits)
+            probs = torch.softmax(row, dim=-1)
+            gen = generators[i] if generators else None
+            token = torch.multinomial(probs, num_samples=1, generator=gen)
+            out.append(int(token.item()))
+        return out

tiny_vllm/scheduler.py ADDED Viewed

	@@ -0,0 +1,223 @@

+"""Continuous-batching scheduler with chunked prefill.
+A scheduling step produces a SchedulerOutput listing which sequences run and
+how many tokens each one advances.  Two phases each step:
+  1. Decodes.  Every RUNNING sequence wants exactly one new token; we must
+     ensure each has space for it.  If a sequence needs a new block and the
+     pool is dry, we *preempt* the most recently admitted running sequence —
+     free its KV blocks and push it back to the front of the waiting queue
+     so it restarts prefill later (recompute-style preemption, as in vLLM).
+  2. Prefill chunks.  With remaining token budget, pull from `waiting`.  A
+     newly waiting sequence is admitted (prompt blocks allocated via the
+     block manager, with prefix-cache hits taken).  Then we plan a chunk of
+     up to `min(remaining_prefill, budget)` tokens.  Chunked prefill lets a
+     long prompt share the budget with concurrent decodes instead of
+     stalling them.
+"""
+from __future__ import annotations
+from collections import deque
+from dataclasses import dataclass, field
+from .block_manager import BlockManager
+from .config import EngineConfig
+from .request import Sequence, SequenceStatus
+@dataclass
+class ScheduledSeq:
+    seq: Sequence
+    num_tokens: int     # how many tokens to forward this step for this seq
+    is_prefill: bool
+@dataclass
+class SchedulerOutput:
+    scheduled: list[ScheduledSeq] = field(default_factory=list)
+    preempted: list[int] = field(default_factory=list)  # seq_ids preempted
+    newly_admitted: list[int] = field(default_factory=list)
+    total_tokens: int = 0
+    @property
+    def is_empty(self) -> bool:
+        return not self.scheduled
+class Scheduler:
+    def __init__(self, config: EngineConfig, block_manager: BlockManager) -> None:
+        self.config = config
+        self.block_manager = block_manager
+        self.waiting: deque[Sequence] = deque()
+        self.running: list[Sequence] = []
+        # Tracks order of admission so preemption picks the youngest first.
+        self._admission_order: list[int] = []
+    # ---- queue ops ------------------------------------------------------
+    def add(self, seq: Sequence) -> None:
+        self.waiting.append(seq)
+    def abort(self, seq_id: int) -> bool:
+        for q in (self.waiting,):
+            for s in list(q):
+                if s.seq_id == seq_id:
+                    q.remove(s)
+                    s.status = SequenceStatus.FINISHED
+                    s.finish_reason = "abort"
+                    return True
+        for s in list(self.running):
+            if s.seq_id == seq_id:
+                self.running.remove(s)
+                self.block_manager.free(s)
+                s.status = SequenceStatus.FINISHED
+                s.finish_reason = "abort"
+                return True
+        return False
+    @property
+    def has_work(self) -> bool:
+        return bool(self.waiting) or bool(self.running)
+    # ---- scheduling -----------------------------------------------------
+    def _preempt_one(self) -> Sequence | None:
+        """Free the youngest running sequence and re-enqueue it for restart."""
+        if not self.running:
+            return None
+        victim = self.running.pop()  # youngest by insertion order
+        self.block_manager.free(victim)
+        # Restart: forget computed-token progress; keep generated outputs so
+        # the user-visible sequence is preserved.  (vLLM full-recompute: we'd
+        # discard outputs too; we keep them so streaming makes sense.)
+        victim.num_computed_tokens = 0
+        victim.num_cached_prefix_tokens = 0
+        victim.status = SequenceStatus.PREEMPTED
+        self.waiting.appendleft(victim)
+        return victim
+    def schedule(self) -> SchedulerOutput:
+        out = SchedulerOutput()
+        budget = self.config.max_num_batched_tokens
+        # --- Phase 1: decodes for already-running sequences ---
+        for seq in list(self.running):
+            if seq.status != SequenceStatus.RUNNING:
+                continue
+            if budget <= 0:
+                break
+            # Ensure space for one more token.
+            try:
+                self.block_manager.append_slot(seq)
+            except RuntimeError:
+                # Out of blocks: try to free space by preempting the youngest
+                # running sequence — which may be `seq` itself.
+                victim = self._preempt_one()
+                if victim is seq:
+                    # We preempted ourselves; it's already off `running`.
+                    out.preempted.append(seq.seq_id)
+                    continue
+                if victim is None:
+                    # Nothing to preempt; preempt this seq manually.
+                    self.running.remove(seq)
+                    self.block_manager.free(seq)
+                    seq.num_computed_tokens = 0
+                    seq.num_cached_prefix_tokens = 0
+                    seq.status = SequenceStatus.PREEMPTED
+                    self.waiting.appendleft(seq)
+                    out.preempted.append(seq.seq_id)
+                    continue
+                out.preempted.append(victim.seq_id)
+                try:
+                    self.block_manager.append_slot(seq)
+                except RuntimeError:
+                    # Still no room — give up on this seq this step.
+                    continue
+            out.scheduled.append(ScheduledSeq(seq=seq, num_tokens=1, is_prefill=False))
+            budget -= 1
+            out.total_tokens += 1
+        # --- Phase 2: prefill chunks (admitting new sequences as needed) ---
+        max_concurrent = self.config.max_num_seqs
+        active_count = sum(1 for s in self.running if s.status != SequenceStatus.FINISHED)
+        while self.waiting and budget > 0 and active_count < max_concurrent:
+            seq = self.waiting[0]
+            # Admit if needed.
+            if not seq.block_table:
+                ok, _ = self.block_manager.can_allocate_initial(seq)
+                if not ok:
+                    # Try to free up space by preempting the youngest running
+                    # seq.  If nothing to preempt, we're stuck for this step.
+                    if not self.running:
+                        break
+                    victim = self._preempt_one()
+                    if victim is None:
+                        break
+                    out.preempted.append(victim.seq_id)
+                    continue
+                self.block_manager.admit(seq)
+                out.newly_admitted.append(seq.seq_id)
+                seq.status = SequenceStatus.PREFILLING
+            # Plan a chunk.
+            remaining = seq.num_uncomputed_prompt_tokens
+            chunk = min(remaining, budget)
+            if chunk <= 0:
+                # Prompt already fully cached (shouldn't happen due to admit
+                # capping, but defensive): move straight to RUNNING.
+                self.waiting.popleft()
+                seq.status = SequenceStatus.RUNNING
+                self.running.append(seq)
+                active_count += 1
+                continue
+            # Make sure block_table covers num_computed + chunk.
+            try:
+                self.block_manager.ensure_blocks_for_chunk(seq, chunk)
+            except RuntimeError:
+                # Couldn't expand.  Try preemption; otherwise give up.
+                if self.running:
+                    victim = self._preempt_one()
+                    if victim is not None:
+                        out.preempted.append(victim.seq_id)
+                        continue
+                break
+            out.scheduled.append(ScheduledSeq(seq=seq, num_tokens=chunk, is_prefill=True))
+            budget -= chunk
+            out.total_tokens += chunk
+            if chunk == remaining:
+                # This step finishes prompt ingestion → seq becomes RUNNING.
+                self.waiting.popleft()
+                seq.status = SequenceStatus.RUNNING
+                self.running.append(seq)
+                active_count += 1
+            else:
+                # Still has more prompt to chew through; leave at head of
+                # waiting queue with a partial block_table.
+                break  # one prefill per step keeps things tidy
+        return out
+    # ---- post-step ------------------------------------------------------
+    def finalize_step(self, scheduled: list[ScheduledSeq]) -> list[Sequence]:
+        """Called after the model has produced new tokens.
+        Returns the list of sequences that just finished this step (so the
+        engine can free them and ship the final output to the caller).
+        """
+        finished: list[Sequence] = []
+        for item in scheduled:
+            seq = item.seq
+            self.block_manager.register_filled_blocks(seq, prev_computed=0)
+            if seq.status == SequenceStatus.FINISHED:
+                if seq in self.running:
+                    self.running.remove(seq)
+                self.block_manager.free(seq)
+                finished.append(seq)
+        return finished

tiny_vllm/server.py ADDED Viewed

	@@ -0,0 +1,307 @@

+"""FastAPI front-end.
+Two SSE streams live behind this server:
+  POST /generate         — submit a prompt, stream back token deltas
+  POST /v1/completions   — OpenAI-compatible streaming completions
+  GET  /engine/events    — stream of engine-state snapshots (one per step)
+                            — what the demo page subscribes to
+  GET  /engine/snapshot  — one-shot current state (JSON)
+  GET  /                 — static demo page
+The demo page subscribes to /engine/events and renders the block pool,
+scheduler queues, and live token streams.
+"""
+from __future__ import annotations
+import argparse
+import asyncio
+import json
+import os
+import time
+from pathlib import Path
+from typing import AsyncIterator, Optional
+from fastapi import FastAPI, HTTPException, Request
+from fastapi.responses import FileResponse, JSONResponse, StreamingResponse
+from fastapi.staticfiles import StaticFiles
+from pydantic import BaseModel, Field
+from .config import EngineConfig, SamplingParams
+from .engine import LLMEngine
+# ---------------------------------------------------------------------------
+# Schemas
+# ---------------------------------------------------------------------------
+class GenerateRequest(BaseModel):
+    prompt: str
+    max_tokens: int = 64
+    temperature: float = 1.0
+    top_p: float = 1.0
+    top_k: int = -1
+    seed: Optional[int] = None
+    ignore_eos: bool = False
+    stream: bool = True
+class CompletionsRequest(BaseModel):
+    model: Optional[str] = None
+    prompt: str | list[str]
+    max_tokens: int = 64
+    temperature: float = 1.0
+    top_p: float = 1.0
+    n: int = 1
+    stream: bool = False
+    stop: Optional[list[str]] = None
+    seed: Optional[int] = None
+# ---------------------------------------------------------------------------
+# App factory
+# ---------------------------------------------------------------------------
+def _sse(data: dict | str) -> bytes:
+    if isinstance(data, dict):
+        data = json.dumps(data, separators=(",", ":"))
+    return f"data: {data}\n\n".encode("utf-8")
+def build_app(config: EngineConfig) -> FastAPI:
+    app = FastAPI(title="tiny_vllm", version="0.1.0")
+    engine = LLMEngine(config)
+    @app.on_event("startup")
+    async def _on_startup() -> None:
+        await engine.startup()
+    @app.on_event("shutdown")
+    async def _on_shutdown() -> None:
+        await engine.shutdown()
+    # ---- root + static -------------------------------------------------
+    static_dir = Path(__file__).parent.parent / "web"
+    if static_dir.exists():
+        app.mount("/static", StaticFiles(directory=str(static_dir)), name="static")
+        @app.get("/")
+        async def root() -> FileResponse:
+            return FileResponse(str(static_dir / "index.html"))
+    else:
+        @app.get("/")
+        async def root() -> dict:
+            return {"name": "tiny_vllm", "status": "ok",
+                    "hint": "demo page not found; POST to /generate"}
+    # ---- introspection -------------------------------------------------
+    @app.get("/engine/snapshot")
+    async def snapshot() -> dict:
+        return engine.snapshot()
+    @app.get("/engine/events")
+    async def events(request: Request) -> StreamingResponse:
+        q = engine.subscribe_events()
+        async def gen() -> AsyncIterator[bytes]:
+            # Push initial snapshot so a freshly-connected client has state.
+            yield _sse({"type": "snapshot", "payload": engine.snapshot()})
+            try:
+                while True:
+                    if await request.is_disconnected():
+                        break
+                    try:
+                        ev = await asyncio.wait_for(q.get(), timeout=15.0)
+                    except asyncio.TimeoutError:
+                        yield b": keepalive\n\n"
+                        continue
+                    yield _sse({
+                        "type": ev.type,
+                        "step": ev.step,
+                        "timestamp": ev.timestamp,
+                        "payload": ev.payload,
+                    })
+            finally:
+                engine.unsubscribe_events(q)
+        return StreamingResponse(gen(), media_type="text/event-stream")
+    # ---- generation ----------------------------------------------------
+    def _params(req: GenerateRequest) -> SamplingParams:
+        return SamplingParams(
+            max_tokens=req.max_tokens,
+            temperature=req.temperature,
+            top_p=req.top_p,
+            top_k=req.top_k,
+            seed=req.seed,
+            ignore_eos=req.ignore_eos,
+        )
+    @app.post("/generate")
+    async def generate(req: GenerateRequest, request: Request) -> StreamingResponse | JSONResponse:
+        try:
+            rid = engine.add_request(req.prompt, _params(req))
+        except ValueError as e:
+            raise HTTPException(status_code=400, detail=str(e))
+        if not req.stream:
+            text_parts: list[str] = []
+            finish_reason: Optional[str] = None
+            async for item in engine.stream(rid):
+                text_parts.append(item.new_text)
+                if item.finished:
+                    finish_reason = item.finish_reason
+                    break
+            return JSONResponse({
+                "request_id": rid,
+                "text": "".join(text_parts),
+                "finish_reason": finish_reason,
+            })
+        async def gen() -> AsyncIterator[bytes]:
+            try:
+                async for item in engine.stream(rid):
+                    if await request.is_disconnected():
+                        engine.abort(rid)
+                        break
+                    yield _sse({
+                        "request_id": rid,
+                        "text": item.new_text,
+                        "finished": item.finished,
+                        "finish_reason": item.finish_reason,
+                    })
+                    if item.finished:
+                        yield b"data: [DONE]\n\n"
+                        break
+            except asyncio.CancelledError:
+                engine.abort(rid)
+                raise
+        return StreamingResponse(gen(), media_type="text/event-stream")
+    @app.post("/v1/completions")
+    async def completions(req: CompletionsRequest, request: Request):
+        # Single-prompt only (n=1) for the minimal impl.
+        if isinstance(req.prompt, list):
+            if len(req.prompt) != 1:
+                raise HTTPException(400, "tiny_vllm only supports a single prompt per call")
+            prompt = req.prompt[0]
+        else:
+            prompt = req.prompt
+        try:
+            rid = engine.add_request(
+                prompt,
+                SamplingParams(
+                    max_tokens=req.max_tokens,
+                    temperature=req.temperature,
+                    top_p=req.top_p,
+                    seed=req.seed,
+                ),
+            )
+        except ValueError as e:
+            raise HTTPException(400, str(e))
+        created = int(time.time())
+        model_id = req.model or config.model
+        if not req.stream:
+            text_parts: list[str] = []
+            finish_reason: Optional[str] = None
+            async for item in engine.stream(rid):
+                text_parts.append(item.new_text)
+                if item.finished:
+                    finish_reason = item.finish_reason
+                    break
+            return JSONResponse({
+                "id": f"cmpl-{rid}",
+                "object": "text_completion",
+                "created": created,
+                "model": model_id,
+                "choices": [{
+                    "text": "".join(text_parts),
+                    "index": 0,
+                    "logprobs": None,
+                    "finish_reason": finish_reason,
+                }],
+            })
+        async def gen() -> AsyncIterator[bytes]:
+            try:
+                async for item in engine.stream(rid):
+                    if await request.is_disconnected():
+                        engine.abort(rid)
+                        break
+                    chunk = {
+                        "id": f"cmpl-{rid}",
+                        "object": "text_completion",
+                        "created": created,
+                        "model": model_id,
+                        "choices": [{
+                            "text": item.new_text,
+                            "index": 0,
+                            "logprobs": None,
+                            "finish_reason": item.finish_reason if item.finished else None,
+                        }],
+                    }
+                    yield _sse(chunk)
+                    if item.finished:
+                        yield b"data: [DONE]\n\n"
+                        break
+            except asyncio.CancelledError:
+                engine.abort(rid)
+                raise
+        return StreamingResponse(gen(), media_type="text/event-stream")
+    @app.post("/abort/{request_id}")
+    async def abort(request_id: str) -> dict:
+        ok = engine.abort(request_id)
+        return {"aborted": ok}
+    return app
+# ---------------------------------------------------------------------------
+# CLI entry
+# ---------------------------------------------------------------------------
+def main() -> None:
+    parser = argparse.ArgumentParser(description="tiny_vllm server")
+    parser.add_argument("--model", default=os.environ.get("TINY_VLLM_MODEL", "Qwen/Qwen2.5-0.5B-Instruct"))
+    parser.add_argument("--device", default=os.environ.get("TINY_VLLM_DEVICE", "cpu"))
+    parser.add_argument("--dtype", default=os.environ.get("TINY_VLLM_DTYPE", "float32"))
+    parser.add_argument("--block-size", type=int, default=16)
+    parser.add_argument("--num-blocks", type=int, default=256)
+    parser.add_argument("--max-num-seqs", type=int, default=8)
+    parser.add_argument("--max-num-batched-tokens", type=int, default=512)
+    parser.add_argument("--max-model-len", type=int, default=2048)
+    parser.add_argument("--disable-prefix-caching", action="store_true")
+    parser.add_argument("--host", default="0.0.0.0")
+    parser.add_argument("--port", type=int, default=8000)
+    args = parser.parse_args()
+    cfg = EngineConfig(
+        model=args.model,
+        device=args.device,
+        dtype=args.dtype,
+        block_size=args.block_size,
+        num_blocks=args.num_blocks,
+        max_num_seqs=args.max_num_seqs,
+        max_num_batched_tokens=args.max_num_batched_tokens,
+        max_model_len=args.max_model_len,
+        enable_prefix_caching=not args.disable_prefix_caching,
+    )
+    import uvicorn
+    app = build_app(cfg)
+    uvicorn.run(app, host=args.host, port=args.port, log_level="info")
+if __name__ == "__main__":
+    main()

web/app.js ADDED Viewed

	@@ -0,0 +1,272 @@

+/* tiny_vllm — demo page client.
+ *
+ * Two streams in play:
+ *
+ *   /engine/events    — engine state snapshots (one per scheduling step)
+ *   /generate         — token-level deltas for whatever prompt this page sent
+ *
+ * The page itself is stateless; everything is driven by what comes off the
+ * event stream.  Token deltas from /generate are merged into per-request UI.
+ */
+const $ = (id) => document.getElementById(id);
+const ui = {
+  connection: $("connection"),
+  model: $("model"),
+  pool: $("block-pool"),
+  poolSummary: $("pool-summary"),
+  schedStep: $("sched-step"),
+  statTokens: $("stat-tokens"),
+  statPfDec: $("stat-pfdec"),
+  statMs: $("stat-ms"),
+  statCache: $("stat-cache"),
+  statFree: $("stat-free"),
+  statPre: $("stat-pre"),
+  log: $("log"),
+  seqs: $("seqs"),
+  send: $("send"),
+  sendTwice: $("send-twice"),
+};
+const state = {
+  poolEls: [],
+  numBlocks: 0,
+  blockSize: 16,
+  preempted: 0,
+  // request_id -> { promptText, generated, finished, finishReason }
+  requests: new Map(),
+  // seq_id -> { request_id, blockTable, cachedPrefixBlocks, status, ... }
+  seqsBySeqId: new Map(),
+};
+function logLine(html, cls = "") {
+  const t = new Date().toLocaleTimeString();
+  ui.log.innerHTML += `<span class="${cls}">[${t}] ${html}</span>\n`;
+  ui.log.scrollTop = ui.log.scrollHeight;
+}
+function initPool(numBlocks) {
+  if (state.numBlocks === numBlocks && state.poolEls.length === numBlocks) return;
+  state.numBlocks = numBlocks;
+  ui.pool.innerHTML = "";
+  state.poolEls = [];
+  for (let i = 0; i < numBlocks; i++) {
+    const el = document.createElement("div");
+    el.className = "block free";
+    el.title = `block ${i}`;
+    ui.pool.appendChild(el);
+    state.poolEls.push(el);
+  }
+}
+function renderPool(pool) {
+  initPool(pool.num_blocks);
+  state.blockSize = pool.block_size;
+  for (let i = 0; i < pool.num_blocks; i++) {
+    const el = state.poolEls[i];
+    const rc = pool.ref_counts[i];
+    const hashed = pool.hashed[i];
+    let cls = "block";
+    if (rc === 0) {
+      cls += hashed ? " cached" : " free";
+    } else if (rc === 1) {
+      cls += " used";
+    } else {
+      cls += " shared";
+    }
+    if (hashed) cls += " hashed";
+    el.className = cls;
+    el.title = `block ${i} — refcount=${rc}${hashed ? " — hashed (cacheable)" : ""}`;
+  }
+  ui.poolSummary.textContent =
+    `${pool.num_blocks - pool.num_free_blocks}/${pool.num_blocks} used · ` +
+    `${pool.num_cached_entries} cached entries · ` +
+    `prefix-cache ${pool.prefix_cache_hits}/${pool.prefix_cache_lookups}`;
+  ui.statFree.textContent = pool.num_free_blocks;
+  if (pool.prefix_cache_lookups > 0) {
+    const pct = (100 * pool.prefix_cache_hits / pool.prefix_cache_lookups).toFixed(0);
+    ui.statCache.textContent = `${pct}%`;
+  } else {
+    ui.statCache.textContent = "—";
+  }
+}
+function renderSeqs(snapshot) {
+  ui.schedStep.textContent = ` — step ${snapshot.step}`;
+  const all = [...snapshot.running, ...snapshot.waiting];
+  // index for later token-delta merges
+  state.seqsBySeqId = new Map(all.map(s => [s.seq_id, s]));
+  ui.seqs.innerHTML = "";
+  if (all.length === 0) {
+    ui.seqs.innerHTML = `<div class="muted">(no active sequences — send a prompt above)</div>`;
+    return;
+  }
+  for (const s of all) {
+    const reqRec = state.requests.get(s.request_id);
+    const promptText = reqRec?.promptText ?? "(prompt elided)";
+    const gen = reqRec?.generated ?? "";
+    const div = document.createElement("div");
+    div.className = "seq";
+    div.id = `seq-${s.request_id}`;
+    const cachedBlocks = Math.floor(s.num_cached_prefix_tokens / state.blockSize);
+    const blocksHTML = s.block_table.map((bid, i) => {
+      const klass = i < cachedBlocks ? "seq-block cached-hit"
+                  : (snapshot.block_pool.ref_counts[bid] > 1 ? "seq-block shared" : "seq-block");
+      return `<div class="${klass}" title="block ${bid}${i < cachedBlocks ? ' (prefix-cache hit)' : ''}">${bid}</div>`;
+    }).join("");
+    div.innerHTML = `
+      <div class="seq-header">
+        <span class="seq-id">req=${s.request_id.slice(0, 8)} seq=${s.seq_id}</span>
+        <span class="seq-status ${s.status}">${s.status}</span>
+        <span class="seq-meta">
+          prompt=${s.prompt_len} · generated=${s.num_generated} ·
+          cached=${s.num_cached_prefix_tokens}/${s.prompt_len} ·
+          blocks=${s.block_table.length}
+        </span>
+      </div>
+      <div class="seq-blocks">${blocksHTML || '<span class="muted">(no blocks yet)</span>'}</div>
+      <div class="seq-text"><span class="prompt">${escapeHtml(promptText)}</span><span class="gen">${escapeHtml(gen)}</span>${s.status === 'running' || s.status === 'prefilling' ? '<span class="cursor">&nbsp;</span>' : ''}</div>
+    `;
+    ui.seqs.appendChild(div);
+  }
+}
+function escapeHtml(s) {
+  return (s || "").replace(/[&<>"]/g, c => ({"&": "&amp;", "<": "&lt;", ">": "&gt;", '"': "&quot;"}[c]));
+}
+function handleEvent(ev) {
+  if (ev.type === "snapshot") {
+    const snap = ev.payload;
+    ui.model.textContent = `· ${snap.config.model}`;
+    renderPool(snap.block_pool);
+    renderSeqs(snap);
+    return;
+  }
+  if (ev.type === "step") {
+    const p = ev.payload;
+    ui.statTokens.textContent = p.num_tokens;
+    ui.statPfDec.textContent = `${p.num_prefill_seqs} / ${p.num_decode_seqs}`;
+    ui.statMs.textContent = p.duration_ms.toFixed(1);
+    if (p.preempted?.length) state.preempted += p.preempted.length;
+    ui.statPre.textContent = state.preempted;
+    renderPool(p.snapshot.block_pool);
+    renderSeqs(p.snapshot);
+    let msg = `step ${ev.step}: ${p.num_tokens}t (${p.num_prefill_seqs}P/${p.num_decode_seqs}D) in ${p.duration_ms.toFixed(1)}ms`;
+    let cls = "ev-step";
+    if (p.newly_admitted?.length) {
+      msg += ` · admitted seq=${p.newly_admitted.join(",")}`;
+      cls = "ev-admit";
+    }
+    if (p.finished?.length) {
+      msg += ` · finished ${p.finished.map(r => r.slice(0,8)).join(",")}`;
+      cls = "ev-finish";
+    }
+    if (p.preempted?.length) {
+      msg += ` · PREEMPTED seq=${p.preempted.join(",")}`;
+      cls = "ev-preempt";
+    }
+    logLine(msg, cls);
+  }
+}
+function connectEvents() {
+  const es = new EventSource("/engine/events");
+  es.onopen = () => {
+    ui.connection.textContent = "connected";
+    ui.connection.classList.remove("offline");
+    ui.connection.classList.add("online");
+  };
+  es.onerror = () => {
+    ui.connection.textContent = "disconnected";
+    ui.connection.classList.remove("online");
+    ui.connection.classList.add("offline");
+  };
+  es.onmessage = (e) => {
+    if (!e.data) return;
+    try {
+      handleEvent(JSON.parse(e.data));
+    } catch (err) {
+      console.error("bad event", err, e.data);
+    }
+  };
+}
+async function sendPrompt(prompt) {
+  const body = {
+    prompt,
+    max_tokens: parseInt($("max_tokens").value, 10),
+    temperature: parseFloat($("temperature").value),
+    top_p: parseFloat($("top_p").value),
+    stream: true,
+  };
+  const resp = await fetch("/generate", {
+    method: "POST",
+    headers: {"content-type": "application/json"},
+    body: JSON.stringify(body),
+  });
+  if (!resp.ok) {
+    const txt = await resp.text();
+    logLine(`request failed: ${txt}`, "ev-preempt");
+    return;
+  }
+  // Parse SSE manually so we can read each event as it arrives.
+  const reader = resp.body.getReader();
+  const decoder = new TextDecoder();
+  let buf = "";
+  let myReqId = null;
+  while (true) {
+    const { value, done } = await reader.read();
+    if (done) break;
+    buf += decoder.decode(value, { stream: true });
+    const parts = buf.split("\n\n");
+    buf = parts.pop();
+    for (const part of parts) {
+      const line = part.trim();
+      if (!line.startsWith("data:")) continue;
+      const data = line.slice(5).trim();
+      if (data === "[DONE]") return;
+      try {
+        const j = JSON.parse(data);
+        if (!myReqId) {
+          myReqId = j.request_id;
+          state.requests.set(myReqId, { promptText: prompt, generated: "", finished: false });
+        }
+        const rec = state.requests.get(myReqId);
+        if (j.text) rec.generated += j.text;
+        rec.finished = j.finished;
+        rec.finishReason = j.finish_reason;
+        // Repaint the matching seq card if visible.
+        const card = document.getElementById(`seq-${myReqId}`);
+        if (card) {
+          const text = card.querySelector(".seq-text .gen");
+          if (text) text.textContent = rec.generated;
+        }
+      } catch (e) {
+        console.error("bad chunk", e, data);
+      }
+    }
+  }
+}
+ui.send.addEventListener("click", () => sendPrompt($("prompt").value));
+ui.sendTwice.addEventListener("click", async () => {
+  const p = $("prompt").value;
+  // First send fills the prefix cache; second send should hit it.
+  await sendPrompt(p);
+  await new Promise(r => setTimeout(r, 200));
+  await sendPrompt(p);
+});
+$("prompt").addEventListener("keydown", (e) => {
+  if ((e.metaKey || e.ctrlKey) && e.key === "Enter") {
+    sendPrompt(e.target.value);
+  }
+});
+connectEvents();

web/index.html ADDED Viewed

	@@ -0,0 +1,68 @@

+<!doctype html>
+<html lang="en">
+<head>
+<meta charset="utf-8">
+<meta name="viewport" content="width=device-width, initial-scale=1">
+<title>tiny_vllm — engine internals</title>
+<link rel="stylesheet" href="/static/style.css">
+</head>
+<body>
+<header>
+  <h1>tiny_vllm <span class="muted">— minimal continuous-batching engine</span></h1>
+  <div class="status">
+    <span id="connection" class="badge offline">disconnected</span>
+    <span id="model" class="muted"></span>
+  </div>
+</header>
+<section class="prompt-box">
+  <textarea id="prompt" rows="2" placeholder="Type a prompt and press Send (or Cmd/Ctrl+Enter)…">Explain paged attention in two sentences.</textarea>
+  <div class="controls">
+    <label>max_tokens <input id="max_tokens" type="number" value="64" min="1" max="2048"></label>
+    <label>temperature <input id="temperature" type="number" value="0.7" step="0.1" min="0" max="2"></label>
+    <label>top_p <input id="top_p" type="number" value="0.9" step="0.05" min="0" max="1"></label>
+    <button id="send">Send</button>
+    <button id="send-twice" title="Submit the same prompt twice — second should hit prefix cache">Send ×2 (prefix demo)</button>
+  </div>
+</section>
+<main>
+  <section class="card">
+    <h2>Block pool <span class="muted" id="pool-summary"></span></h2>
+    <div id="block-pool" class="block-pool"></div>
+    <div class="legend">
+      <span class="legend-item"><span class="swatch swatch-free"></span>free</span>
+      <span class="legend-item"><span class="swatch swatch-cached"></span>cached (evictable)</span>
+      <span class="legend-item"><span class="swatch swatch-used"></span>in use</span>
+      <span class="legend-item"><span class="swatch swatch-shared"></span>shared (refcount&gt;1)</span>
+      <span class="legend-item"><span class="swatch swatch-hashed-edge"></span>hashed (border)</span>
+    </div>
+  </section>
+  <section class="card">
+    <h2>Scheduler <span class="muted" id="sched-step"></span></h2>
+    <div class="stats">
+      <div class="stat"><div class="stat-label">tokens this step</div><div class="stat-value" id="stat-tokens">0</div></div>
+      <div class="stat"><div class="stat-label">prefill / decode</div><div class="stat-value" id="stat-pfdec">0 / 0</div></div>
+      <div class="stat"><div class="stat-label">step (ms)</div><div class="stat-value" id="stat-ms">0</div></div>
+      <div class="stat"><div class="stat-label">prefix cache hit-rate</div><div class="stat-value" id="stat-cache">0%</div></div>
+      <div class="stat"><div class="stat-label">free blocks</div><div class="stat-value" id="stat-free">0</div></div>
+      <div class="stat"><div class="stat-label">preemptions (total)</div><div class="stat-value" id="stat-pre">0</div></div>
+    </div>
+    <h3>step log</h3>
+    <pre id="log" class="log"></pre>
+  </section>
+  <section class="card grow">
+    <h2>Sequences</h2>
+    <div id="seqs"></div>
+  </section>
+</main>
+<footer>
+  <span class="muted">Subscribed to <code>/engine/events</code>. Source: <a href="https://github.com/yourname/tiny_vllm" target="_blank">github</a>.</span>
+</footer>
+<script src="/static/app.js"></script>
+</body>
+</html>

web/style.css ADDED Viewed

	@@ -0,0 +1,213 @@

+:root {
+  --bg: #0e1116;
+  --bg-elev: #161b22;
+  --bg-elev2: #1f2630;
+  --fg: #e6edf3;
+  --muted: #8b949e;
+  --accent: #58a6ff;
+  --green: #3fb950;
+  --purple: #a371f7;
+  --orange: #f0883e;
+  --red: #f85149;
+  --border: #30363d;
+  --mono: ui-monospace, "JetBrains Mono", Menlo, Consolas, monospace;
+}
+* { box-sizing: border-box; }
+body {
+  margin: 0;
+  font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Inter, sans-serif;
+  background: var(--bg);
+  color: var(--fg);
+  font-size: 14px;
+}
+header {
+  display: flex; align-items: center; justify-content: space-between;
+  padding: 14px 20px;
+  border-bottom: 1px solid var(--border);
+  background: var(--bg-elev);
+}
+header h1 { font-size: 16px; margin: 0; font-weight: 600; }
+.muted { color: var(--muted); font-weight: 400; }
+.badge {
+  display: inline-block;
+  padding: 2px 8px;
+  border-radius: 10px;
+  font-size: 11px;
+  font-family: var(--mono);
+}
+.badge.online { background: rgba(63, 185, 80, 0.15); color: var(--green); }
+.badge.offline { background: rgba(248, 81, 73, 0.15); color: var(--red); }
+.prompt-box {
+  padding: 12px 20px;
+  border-bottom: 1px solid var(--border);
+  background: var(--bg-elev);
+  display: flex; flex-direction: column; gap: 10px;
+}
+.prompt-box textarea {
+  width: 100%;
+  background: var(--bg);
+  color: var(--fg);
+  border: 1px solid var(--border);
+  border-radius: 6px;
+  padding: 8px;
+  font-family: var(--mono);
+  resize: vertical;
+}
+.controls { display: flex; gap: 12px; align-items: center; flex-wrap: wrap; }
+.controls label { display: flex; gap: 6px; align-items: center; font-size: 12px; color: var(--muted); }
+.controls input {
+  width: 70px; background: var(--bg); color: var(--fg);
+  border: 1px solid var(--border); border-radius: 4px; padding: 3px 6px;
+  font-family: var(--mono);
+}
+button {
+  background: var(--accent); color: white;
+  border: none; border-radius: 4px;
+  padding: 6px 14px; font-weight: 500; cursor: pointer;
+}
+button:hover { filter: brightness(1.1); }
+#send-twice { background: var(--purple); }
+main {
+  display: grid;
+  grid-template-columns: 1fr 1fr;
+  grid-template-areas: "pool sched" "seqs seqs";
+  gap: 16px;
+  padding: 16px 20px;
+}
+.card {
+  background: var(--bg-elev);
+  border: 1px solid var(--border);
+  border-radius: 8px;
+  padding: 14px;
+}
+.card h2 { font-size: 14px; margin: 0 0 10px; font-weight: 600; }
+.card h3 { font-size: 12px; margin: 14px 0 6px; color: var(--muted); text-transform: uppercase; letter-spacing: 0.06em; }
+.card.grow { grid-area: seqs; }
+.card:nth-child(1) { grid-area: pool; }
+.card:nth-child(2) { grid-area: sched; }
+/* ---- block pool ---- */
+.block-pool {
+  display: grid;
+  grid-template-columns: repeat(auto-fill, 16px);
+  gap: 3px;
+  padding: 8px;
+  background: var(--bg);
+  border-radius: 6px;
+  max-height: 280px; overflow-y: auto;
+}
+.block {
+  width: 16px; height: 16px; border-radius: 3px;
+  background: var(--bg-elev2);
+  position: relative;
+  cursor: help;
+  border: 1px solid transparent;
+}
+.block.free { background: #2a3140; }
+.block.cached { background: #1f3b5c; }   /* free but in prefix cache */
+.block.used { background: var(--green); }
+.block.shared { background: var(--purple); }
+.block.hashed { border-color: var(--orange); }
+.legend { display: flex; gap: 14px; margin-top: 10px; font-size: 11px; color: var(--muted); flex-wrap: wrap; }
+.legend-item { display: flex; align-items: center; gap: 5px; }
+.swatch { width: 12px; height: 12px; border-radius: 3px; display: inline-block; }
+.swatch-free { background: #2a3140; }
+.swatch-cached { background: #1f3b5c; }
+.swatch-used { background: var(--green); }
+.swatch-shared { background: var(--purple); }
+.swatch-hashed-edge { background: var(--bg-elev2); border: 1px solid var(--orange); }
+/* ---- stats ---- */
+.stats { display: grid; grid-template-columns: repeat(3, 1fr); gap: 8px; }
+.stat {
+  background: var(--bg);
+  border-radius: 6px;
+  padding: 8px;
+}
+.stat-label { font-size: 10px; color: var(--muted); text-transform: uppercase; letter-spacing: 0.06em; }
+.stat-value { font-family: var(--mono); font-size: 18px; margin-top: 3px; }
+/* ---- log ---- */
+.log {
+  background: var(--bg);
+  border-radius: 6px;
+  padding: 8px;
+  height: 140px; overflow-y: auto;
+  font-family: var(--mono); font-size: 11px;
+  margin: 0;
+  white-space: pre-wrap; word-break: break-word;
+}
+.log .ev-step { color: var(--muted); }
+.log .ev-admit { color: var(--accent); }
+.log .ev-finish { color: var(--green); }
+.log .ev-preempt { color: var(--red); }
+/* ---- sequences ---- */
+#seqs { display: flex; flex-direction: column; gap: 10px; }
+.seq {
+  background: var(--bg);
+  border: 1px solid var(--border);
+  border-radius: 6px;
+  padding: 10px;
+}
+.seq-header { display: flex; gap: 10px; align-items: center; }
+.seq-id { font-family: var(--mono); color: var(--muted); font-size: 12px; }
+.seq-status {
+  font-size: 10px; text-transform: uppercase; padding: 2px 6px; border-radius: 3px;
+  font-family: var(--mono); letter-spacing: 0.05em;
+}
+.seq-status.waiting { background: rgba(139, 148, 158, 0.2); color: var(--muted); }
+.seq-status.prefilling { background: rgba(88, 166, 255, 0.15); color: var(--accent); }
+.seq-status.running { background: rgba(63, 185, 80, 0.15); color: var(--green); }
+.seq-status.finished { background: rgba(163, 113, 247, 0.15); color: var(--purple); }
+.seq-status.preempted { background: rgba(240, 136, 62, 0.2); color: var(--orange); }
+.seq-meta { color: var(--muted); font-size: 11px; font-family: var(--mono); margin-left: auto; }
+.seq-blocks {
+  margin-top: 8px;
+  display: flex; gap: 2px; flex-wrap: wrap;
+}
+.seq-block {
+  width: 22px; height: 14px;
+  background: var(--bg-elev2);
+  font-size: 9px; line-height: 14px; text-align: center;
+  font-family: var(--mono);
+  border-radius: 2px;
+  color: var(--muted);
+}
+.seq-block.cached-hit { background: #1f3b5c; color: var(--accent); }
+.seq-block.shared { background: var(--purple); color: white; }
+.seq-text {
+  margin-top: 8px;
+  font-family: var(--mono); font-size: 12px;
+  background: var(--bg-elev2);
+  border-radius: 4px; padding: 6px;
+  min-height: 24px;
+  max-height: 180px;
+  overflow-y: auto;
+  white-space: pre-wrap; word-break: break-word;
+}
+.seq-text .prompt { color: var(--muted); }
+.seq-text .gen { color: var(--fg); }
+.seq-text .cursor {
+  display: inline-block; width: 6px; background: var(--accent);
+  animation: blink 1s steps(2, start) infinite;
+}
+@keyframes blink { to { visibility: hidden; } }
+footer {
+  padding: 10px 20px;
+  border-top: 1px solid var(--border);
+  color: var(--muted); font-size: 11px;
+}
+footer a { color: var(--accent); text-decoration: none; }
+@media (max-width: 900px) {
+  main { grid-template-columns: 1fr; grid-template-areas: "pool" "sched" "seqs"; }
+  .stats { grid-template-columns: repeat(2, 1fr); }
+}