enCoder commited on
Commit
39fa862
·
1 Parent(s): 1c4e67d

Add GitHub Pages demo and recording functionality

Browse files

- Introduced a new demo section in README.md detailing how to deploy a static page on GitHub Pages using a recorded session.
- Added a GitHub Actions workflow for automatic deployment of the demo to GitHub Pages.
- Created a script to generate a demo recording (`scripts/make_demo_recording.py`) that simulates engine events for the demo.
- Updated the engine to support event recording to a JSONL file for replay functionality.
- Enhanced the web client to handle live and replay modes, including UI updates for replay controls and connection status.
- Added styles and elements in the web interface to support the new replay features.
- Updated configuration to allow specifying a recording path for event logging.
- Included a new `web/events.jsonl` file to store demo events for playback.
- Adjusted server arguments to enable recording mode.
- Improved event emission to support recording of engine events for the demo.
- Enhanced README with instructions for using the demo and generating recordings.

.claude/settings.local.json CHANGED
@@ -6,7 +6,9 @@
6
  "Bash(python -m pytest tests/test_block_manager.py tests/test_scheduler.py -v)",
7
  "Bash(pip install *)",
8
  "Bash(python -m pytest tests/ -v)",
9
- "Bash(python -c ' *)"
 
 
10
  ]
11
  }
12
  }
 
6
  "Bash(python -m pytest tests/test_block_manager.py tests/test_scheduler.py -v)",
7
  "Bash(pip install *)",
8
  "Bash(python -m pytest tests/ -v)",
9
+ "Bash(python -c ' *)",
10
+ "Bash(python scripts/make_demo_recording.py)",
11
+ "Bash(python *)"
12
  ]
13
  }
14
  }
.github/workflows/deploy-pages.yml ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: Deploy demo to GitHub Pages
2
+
3
+ # Publishes the static visualization from web/ to GitHub Pages on every push
4
+ # to main (and on demand from the Actions tab). In Pages mode it runs against
5
+ # a committed web/events.jsonl recording — see the README for how to capture
6
+ # a fresh one.
7
+
8
+ on:
9
+ push:
10
+ branches: [main]
11
+ paths:
12
+ - "web/**"
13
+ - ".github/workflows/deploy-pages.yml"
14
+ workflow_dispatch:
15
+
16
+ permissions:
17
+ contents: read
18
+ pages: write
19
+ id-token: write
20
+
21
+ # Allow only one Pages deployment at a time; skip queued ones if a newer
22
+ # commit arrives.
23
+ concurrency:
24
+ group: pages
25
+ cancel-in-progress: true
26
+
27
+ jobs:
28
+ deploy:
29
+ environment:
30
+ name: github-pages
31
+ url: ${{ steps.deployment.outputs.page_url }}
32
+ runs-on: ubuntu-latest
33
+ steps:
34
+ - name: Checkout
35
+ uses: actions/checkout@v4
36
+
37
+ - name: Setup Pages
38
+ uses: actions/configure-pages@v5
39
+
40
+ - name: Upload web/ as artifact
41
+ uses: actions/upload-pages-artifact@v3
42
+ with:
43
+ path: web
44
+
45
+ - name: Deploy
46
+ id: deployment
47
+ uses: actions/deploy-pages@v4
README.md CHANGED
@@ -76,6 +76,33 @@ pip install pytest
76
  python -m pytest tests/
77
  ```
78
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79
  ## What the demo page shows
80
 
81
  | Panel | What you're looking at |
 
76
  python -m pytest tests/
77
  ```
78
 
79
+ ## GitHub Pages demo (replay mode)
80
+
81
+ The visualization can run as a **static page** on GitHub Pages with no
82
+ backend. It plays back a recorded session from `web/events.jsonl`:
83
+
84
+ 1. The repo ships a fabricated `web/events.jsonl` so the page works on first
85
+ deploy (run `python scripts/make_demo_recording.py > web/events.jsonl` to
86
+ regenerate).
87
+ 2. To use a **real** recording instead, run the server with `--record`:
88
+ ```bash
89
+ python -m tiny_vllm.server --record web/events.jsonl
90
+ # …submit some prompts via the UI or smoke_client…
91
+ # Ctrl-C the server. events.jsonl now contains the full session.
92
+ git add web/events.jsonl && git commit -m "fresh demo recording" && git push
93
+ ```
94
+ 3. Enable Pages once: **repo → Settings → Pages → Source: "GitHub Actions"**.
95
+ The workflow in `.github/workflows/deploy-pages.yml` then publishes
96
+ `web/` on every push to `main` that touches it.
97
+
98
+ The page auto-detects mode:
99
+ - Tries `/engine/events` SSE first; if it responds within 2s it's **live**.
100
+ - Otherwise falls back to **replay**, fetching `events.jsonl` from the same
101
+ directory and playing it back with original timing (speed control / pause
102
+ / restart in the controls row).
103
+ - Force a mode with `?mode=replay` or `?mode=live`; point at a different
104
+ recording with `?session=URL`.
105
+
106
  ## What the demo page shows
107
 
108
  | Panel | What you're looking at |
scripts/make_demo_recording.py ADDED
@@ -0,0 +1,261 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Fabricate a representative events.jsonl for the GH Pages demo.
2
+
3
+ We don't need a model or torch here — we just synthesize a sequence of engine
4
+ events that exercises:
5
+
6
+ * a prompt being admitted (request event + first step admitting it)
7
+ * chunked prefill across two steps
8
+ * decoding several tokens
9
+ * a second request that hits the prefix cache on identical prefix tokens
10
+ * a finished sequence and a freed-but-cached block
11
+
12
+ The output is read by `web/app.js`'s replay mode.
13
+
14
+ Run:
15
+ python scripts/make_demo_recording.py > web/events.jsonl
16
+ """
17
+ from __future__ import annotations
18
+
19
+ import json
20
+ import sys
21
+ from copy import deepcopy
22
+
23
+ NUM_BLOCKS = 24
24
+ BLOCK_SIZE = 8
25
+ MODEL = "Qwen/Qwen2.5-0.5B-Instruct"
26
+
27
+ CONFIG = {
28
+ "model": MODEL,
29
+ "block_size": BLOCK_SIZE,
30
+ "num_blocks": NUM_BLOCKS,
31
+ "max_num_seqs": 8,
32
+ "max_num_batched_tokens": 32,
33
+ "prefix_caching": True,
34
+ }
35
+
36
+ # State we evolve.
37
+ ref = [0] * NUM_BLOCKS
38
+ hashed = [False] * NUM_BLOCKS
39
+ lookups = 0
40
+ hits = 0
41
+
42
+ events: list[dict] = []
43
+ ts = 0.0
44
+ step = 0
45
+
46
+
47
+ def pool_snapshot() -> dict:
48
+ return {
49
+ "num_blocks": NUM_BLOCKS,
50
+ "block_size": BLOCK_SIZE,
51
+ "num_free_blocks": sum(1 for r in ref if r == 0),
52
+ "num_cached_entries": sum(1 for h in hashed if h),
53
+ "prefix_cache_hits": hits,
54
+ "prefix_cache_lookups": lookups,
55
+ "ref_counts": list(ref),
56
+ "hashed": list(hashed),
57
+ }
58
+
59
+
60
+ def seq_view(rid, sid, status, prompt_len, num_gen, num_computed, num_cached, block_table) -> dict:
61
+ return {
62
+ "seq_id": sid, "request_id": rid, "status": status,
63
+ "prompt_len": prompt_len, "num_generated": num_gen,
64
+ "num_computed_tokens": num_computed,
65
+ "num_cached_prefix_tokens": num_cached,
66
+ "block_table": list(block_table),
67
+ }
68
+
69
+
70
+ def emit(ev_type: str, payload: dict) -> None:
71
+ global ts
72
+ events.append({"type": ev_type, "step": step, "timestamp": round(ts, 3), "payload": payload})
73
+
74
+
75
+ def emit_snapshot(running, waiting) -> None:
76
+ emit("snapshot", {
77
+ "step": step,
78
+ "config": CONFIG,
79
+ "block_pool": pool_snapshot(),
80
+ "running": running,
81
+ "waiting": waiting,
82
+ })
83
+
84
+
85
+ def emit_step(dur_ms, n_tok, n_pf, n_dec, deltas, running, waiting,
86
+ admitted=None, finished=None, preempted=None):
87
+ emit("step", {
88
+ "duration_ms": dur_ms,
89
+ "num_tokens": n_tok,
90
+ "num_seqs": n_pf + n_dec,
91
+ "num_prefill_seqs": n_pf,
92
+ "num_decode_seqs": n_dec,
93
+ "deltas": deltas or [],
94
+ "newly_admitted": admitted or [],
95
+ "finished": finished or [],
96
+ "preempted": preempted or [],
97
+ "snapshot": {
98
+ "step": step,
99
+ "config": CONFIG,
100
+ "block_pool": pool_snapshot(),
101
+ "running": running,
102
+ "waiting": waiting,
103
+ },
104
+ })
105
+
106
+
107
+ # ---------- script the session ----------
108
+
109
+ # Initial empty snapshot.
110
+ emit_snapshot(running=[], waiting=[])
111
+ ts += 0.4
112
+
113
+ # --- request A admitted with a 20-token prompt --------------------------
114
+ RID_A = "demo-aaaa1111"
115
+ SID_A = 1
116
+ PROMPT_A = "Explain paged attention in two sentences. Then explain prefix caching."
117
+
118
+ step = 1
119
+ ts += 0.1
120
+ emit("request", {"request_id": RID_A, "seq_id": SID_A, "prompt": PROMPT_A,
121
+ "prompt_len": 20, "max_tokens": 24})
122
+
123
+ # Step 1: chunked prefill, first chunk (16 tokens → 2 blocks).
124
+ ref[0] = ref[1] = 1
125
+ hashed[0] = hashed[1] = True
126
+ ts += 0.05
127
+ running_A_pf1 = [seq_view(RID_A, SID_A, "prefilling", 20, 0, 16, 0, [0, 1])]
128
+ emit_step(280, 16, 1, 0, [], [], [running_A_pf1[0]], admitted=[SID_A])
129
+ # (waiting moved to running at end of prefill; here still prefilling, so it
130
+ # stays in waiting in our scheduler; emit fix:)
131
+ events[-1]["payload"]["snapshot"]["running"] = []
132
+ events[-1]["payload"]["snapshot"]["waiting"] = running_A_pf1
133
+ ts += 0.3
134
+
135
+ # Step 2: finish prefill (4 more tokens). Need 1 more block.
136
+ step = 2
137
+ ref[2] = 1 # block 2 holds last 4 tokens (partial)
138
+ ts += 0.05
139
+ running_A = [seq_view(RID_A, SID_A, "running", 20, 0, 20, 0, [0, 1, 2])]
140
+ emit_step(180, 4, 1, 0, [], running_A, [])
141
+ ts += 0.2
142
+
143
+ # Steps 3-7: A decodes 5 tokens. Each step appends one slot; block 2 fills
144
+ # at token 24, then block 3 starts.
145
+ TOKS_A = [" Paged", " attention", " splits", " keys", " and"]
146
+ for i, tok in enumerate(TOKS_A):
147
+ step += 1
148
+ ts += 0.12
149
+ n_gen = i + 1
150
+ n_comp = 20 + i # after fwd: num_computed_tokens increments by 1 each decode
151
+ # Block table grows when crossing boundary.
152
+ bt = [0, 1, 2]
153
+ if 20 + n_gen > 24:
154
+ bt = [0, 1, 2, 3]
155
+ ref[3] = 1
156
+ # Maybe hash block 2 when it just filled (after sampling tok at pos 23 → 24 tokens computed).
157
+ if 20 + n_gen >= 24 and not hashed[2]:
158
+ hashed[2] = True
159
+ running_A = [seq_view(RID_A, SID_A, "running", 20, n_gen, 20 + n_gen,
160
+ 0, bt)]
161
+ emit_step(95, 1, 0, 1,
162
+ [{"request_id": RID_A, "new_text": tok, "finished": False, "finish_reason": None}],
163
+ running_A, [])
164
+
165
+ # --- request B admitted, identical prefix → cache hit -------------------
166
+ step += 1
167
+ RID_B = "demo-bbbb2222"
168
+ SID_B = 2
169
+ PROMPT_B = PROMPT_A # same → prefix cache hit
170
+ ts += 0.6
171
+ emit("request", {"request_id": RID_B, "seq_id": SID_B, "prompt": PROMPT_B,
172
+ "prompt_len": 20, "max_tokens": 24})
173
+
174
+ step += 1
175
+ ts += 0.05
176
+ # Hit lookups: 2 full blocks of B's prompt (block_size=8 → 16/8=2 blocks).
177
+ # Both hit because A's first 2 blocks are hashed (and currently in use).
178
+ lookups += 2
179
+ hits += 2
180
+ ref[0] += 1
181
+ ref[1] += 1
182
+ # B needs a 3rd block (partial, 4 tokens). Fresh: block 4.
183
+ ref[4] = 1
184
+ running_B_pf = [seq_view(RID_B, SID_B, "prefilling", 20, 0, 16, 16, [0, 1, 4])]
185
+ # A is still running and decoding in this step.
186
+ step_A_n_gen = len(TOKS_A)
187
+ running_A_now = [seq_view(RID_A, SID_A, "running", 20, step_A_n_gen,
188
+ 20 + step_A_n_gen, 0, [0, 1, 2, 3])]
189
+ emit_step(140, 5, 1, 1,
190
+ [{"request_id": RID_A, "new_text": " values", "finished": False, "finish_reason": None}],
191
+ [running_A_now[0]], [running_B_pf[0]], admitted=[SID_B])
192
+ ts += 0.12
193
+
194
+ # Step: B finishes prefill (4 tokens), A decodes one more.
195
+ step += 1
196
+ running_A_now = [seq_view(RID_A, SID_A, "running", 20, step_A_n_gen + 1,
197
+ 21 + step_A_n_gen, 0, [0, 1, 2, 3])]
198
+ running_B = [seq_view(RID_B, SID_B, "running", 20, 0, 20, 16, [0, 1, 4])]
199
+ emit_step(110, 5, 1, 1,
200
+ [{"request_id": RID_A, "new_text": " into", "finished": False, "finish_reason": None}],
201
+ [running_A_now[0], running_B[0]], [])
202
+ ts += 0.12
203
+
204
+ # Steps: both decode in lock-step for a few rounds.
205
+ A_tokens = [" small", ",", " fixed", "-size", " blocks", " stored", " in"]
206
+ B_tokens = [" Prefix", " caching", " reuses", " those", " blocks", " across", " requests"]
207
+ for i in range(7):
208
+ step += 1
209
+ ts += 0.10
210
+ a_gen = step_A_n_gen + 2 + i
211
+ b_gen = i + 1
212
+ a_bt = [0, 1, 2, 3]
213
+ b_bt = [0, 1, 4]
214
+ # B crosses block boundary at b_gen such that 20 + b_gen > 24 → needs block 5.
215
+ if 20 + b_gen > 24 and ref[5] == 0:
216
+ ref[5] = 1
217
+ if 20 + b_gen > 24:
218
+ b_bt = [0, 1, 4, 5]
219
+ if 20 + b_gen >= 24 and not hashed[4]:
220
+ hashed[4] = True
221
+ seq_a = seq_view(RID_A, SID_A, "running", 20, a_gen, 20 + a_gen, 0, a_bt)
222
+ seq_b = seq_view(RID_B, SID_B, "running", 20, b_gen, 20 + b_gen, 16, b_bt)
223
+ emit_step(105, 2, 0, 2,
224
+ [{"request_id": RID_A, "new_text": A_tokens[i], "finished": False, "finish_reason": None},
225
+ {"request_id": RID_B, "new_text": B_tokens[i], "finished": False, "finish_reason": None}],
226
+ [seq_a, seq_b], [])
227
+
228
+ # A finishes (hits EOS or max_tokens).
229
+ step += 1
230
+ ts += 0.10
231
+ ref[0] -= 1 # A releases its blocks; block 0,1 still held by B so refcount drops to 1
232
+ ref[1] -= 1
233
+ ref[2] = 0 # block 2 (A only) → goes to cached free list (hashed)
234
+ ref[3] = 0 # block 3 (A only) → if hashed, cached; else uncached free
235
+ # Block 2 was hashed earlier; block 3 we never hashed (partial in A's decode). For demo,
236
+ # hash block 3 to make the cache view richer.
237
+ hashed[3] = True
238
+ seq_b = seq_view(RID_B, SID_B, "running", 20, 8, 28, 16, [0, 1, 4, 5])
239
+ emit_step(85, 1, 0, 1,
240
+ [{"request_id": RID_A, "new_text": " the GPU.", "finished": True, "finish_reason": "stop"}],
241
+ [seq_b], [], finished=[RID_A])
242
+
243
+ # B finishes too.
244
+ step += 1
245
+ ts += 0.10
246
+ ref[0] -= 1
247
+ ref[1] -= 1
248
+ ref[4] = 0
249
+ ref[5] = 0
250
+ hashed[5] = True
251
+ emit_step(80, 1, 0, 1,
252
+ [{"request_id": RID_B, "new_text": " for the same prompts.", "finished": True, "finish_reason": "stop"}],
253
+ [], [], finished=[RID_B])
254
+
255
+
256
+ # ---------- emit ----------
257
+
258
+ out = sys.stdout
259
+ for ev in events:
260
+ out.write(json.dumps(ev, separators=(",", ":")))
261
+ out.write("\n")
tiny_vllm/config.py CHANGED
@@ -25,6 +25,8 @@ class EngineConfig:
25
  # Logging / events
26
  emit_events: bool = True # produce engine events for the UI
27
  event_buffer: int = 256
 
 
28
 
29
  def __post_init__(self) -> None:
30
  if self.max_num_batched_tokens < self.block_size:
 
25
  # Logging / events
26
  emit_events: bool = True # produce engine events for the UI
27
  event_buffer: int = 256
28
+ record_path: Optional[str] = None # JSONL file to append every event to
29
+ # (powers the static GH-Pages replay)
30
 
31
  def __post_init__(self) -> None:
32
  if self.max_num_batched_tokens < self.block_size:
tiny_vllm/engine.py CHANGED
@@ -16,11 +16,12 @@ from __future__ import annotations
16
 
17
  import asyncio
18
  import itertools
 
19
  import time
20
  import uuid
21
  from collections import deque
22
  from dataclasses import dataclass, field
23
- from typing import AsyncIterator, Optional
24
 
25
  from .block_manager import BlockManager
26
  from .config import EngineConfig, SamplingParams
@@ -69,6 +70,9 @@ class LLMEngine:
69
  self._step_idx = 0
70
  self._run_task: Optional[asyncio.Task] = None
71
  self._wake = asyncio.Event()
 
 
 
72
 
73
  # ---- lifecycle ------------------------------------------------------
74
 
@@ -87,6 +91,19 @@ class LLMEngine:
87
  )
88
  self.scheduler = Scheduler(self.config, self.block_manager)
89
  self.sampler = Sampler(self.model_runner.device)
 
 
 
 
 
 
 
 
 
 
 
 
 
90
  self._run_task = asyncio.create_task(self._run_loop())
91
 
92
  async def shutdown(self) -> None:
@@ -97,6 +114,12 @@ class LLMEngine:
97
  await asyncio.wait_for(self._run_task, timeout=5)
98
  except asyncio.TimeoutError:
99
  self._run_task.cancel()
 
 
 
 
 
 
100
 
101
  # ---- request submission --------------------------------------------
102
 
@@ -110,8 +133,10 @@ class LLMEngine:
110
  raise RuntimeError("engine not started")
111
  if isinstance(prompt, str):
112
  token_ids = self.model_runner.encode(prompt)
 
113
  else:
114
  token_ids = list(prompt)
 
115
  if not token_ids:
116
  raise ValueError("empty prompt")
117
  if len(token_ids) >= self.config.max_model_len:
@@ -129,6 +154,13 @@ class LLMEngine:
129
  self._prev_text_len[rid] = 0
130
  assert self.scheduler is not None
131
  self.scheduler.add(seq)
 
 
 
 
 
 
 
132
  self._wake.set()
133
  return rid
134
 
@@ -166,7 +198,7 @@ class LLMEngine:
166
  pass
167
 
168
  def _emit(self, event_type: str, payload: dict) -> None:
169
- if not self.config.emit_events or not self._event_subscribers:
170
  return
171
  ev = EngineEvent(
172
  step=self._step_idx,
@@ -178,7 +210,6 @@ class LLMEngine:
178
  try:
179
  q.put_nowait(ev)
180
  except asyncio.QueueFull:
181
- # Drop oldest, push new.
182
  try:
183
  q.get_nowait()
184
  except asyncio.QueueEmpty:
@@ -187,6 +218,23 @@ class LLMEngine:
187
  q.put_nowait(ev)
188
  except asyncio.QueueFull:
189
  pass
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
190
 
191
  # ---- inspection ----------------------------------------------------
192
 
@@ -301,7 +349,9 @@ class LLMEngine:
301
  self.scheduler.running.remove(seq)
302
  self.block_manager.free(seq)
303
 
304
- # Emit outputs to per-request queues.
 
 
305
  for item in sched.scheduled:
306
  seq = item.seq
307
  rid = seq.request_id
@@ -325,8 +375,14 @@ class LLMEngine:
325
  q = self._output_queues.get(rid)
326
  if q is not None:
327
  await q.put(si)
 
 
 
 
 
 
 
328
  if is_done:
329
- # Clean up.
330
  self._sequences.pop(rid, None)
331
  self._prev_text_len.pop(rid, None)
332
 
@@ -340,6 +396,7 @@ class LLMEngine:
340
  "preempted": sched.preempted,
341
  "newly_admitted": sched.newly_admitted,
342
  "finished": [s.request_id for s in finished_now],
 
343
  "snapshot": self.snapshot(),
344
  })
345
 
 
16
 
17
  import asyncio
18
  import itertools
19
+ import json
20
  import time
21
  import uuid
22
  from collections import deque
23
  from dataclasses import dataclass, field
24
+ from typing import AsyncIterator, Optional, TextIO
25
 
26
  from .block_manager import BlockManager
27
  from .config import EngineConfig, SamplingParams
 
70
  self._step_idx = 0
71
  self._run_task: Optional[asyncio.Task] = None
72
  self._wake = asyncio.Event()
73
+ # recording (for the static GH-Pages replay)
74
+ self._record_fh: Optional[TextIO] = None
75
+ self._record_t0: float = 0.0
76
 
77
  # ---- lifecycle ------------------------------------------------------
78
 
 
91
  )
92
  self.scheduler = Scheduler(self.config, self.block_manager)
93
  self.sampler = Sampler(self.model_runner.device)
94
+
95
+ # Open the recorder *after* the block manager exists so the initial
96
+ # snapshot we write is valid.
97
+ if self.config.record_path:
98
+ self._record_fh = open(self.config.record_path, "w", buffering=1)
99
+ self._record_t0 = time.monotonic()
100
+ self._record({
101
+ "type": "snapshot",
102
+ "step": 0,
103
+ "timestamp": 0.0,
104
+ "payload": self.snapshot(),
105
+ })
106
+
107
  self._run_task = asyncio.create_task(self._run_loop())
108
 
109
  async def shutdown(self) -> None:
 
114
  await asyncio.wait_for(self._run_task, timeout=5)
115
  except asyncio.TimeoutError:
116
  self._run_task.cancel()
117
+ if self._record_fh is not None:
118
+ try:
119
+ self._record_fh.close()
120
+ except Exception:
121
+ pass
122
+ self._record_fh = None
123
 
124
  # ---- request submission --------------------------------------------
125
 
 
133
  raise RuntimeError("engine not started")
134
  if isinstance(prompt, str):
135
  token_ids = self.model_runner.encode(prompt)
136
+ prompt_text = prompt
137
  else:
138
  token_ids = list(prompt)
139
+ prompt_text = self.model_runner.decode(token_ids)
140
  if not token_ids:
141
  raise ValueError("empty prompt")
142
  if len(token_ids) >= self.config.max_model_len:
 
154
  self._prev_text_len[rid] = 0
155
  assert self.scheduler is not None
156
  self.scheduler.add(seq)
157
+ self._emit("request", {
158
+ "request_id": rid,
159
+ "seq_id": seq.seq_id,
160
+ "prompt": prompt_text,
161
+ "prompt_len": len(token_ids),
162
+ "max_tokens": sampling_params.max_tokens,
163
+ })
164
  self._wake.set()
165
  return rid
166
 
 
198
  pass
199
 
200
  def _emit(self, event_type: str, payload: dict) -> None:
201
+ if not self.config.emit_events:
202
  return
203
  ev = EngineEvent(
204
  step=self._step_idx,
 
210
  try:
211
  q.put_nowait(ev)
212
  except asyncio.QueueFull:
 
213
  try:
214
  q.get_nowait()
215
  except asyncio.QueueEmpty:
 
218
  q.put_nowait(ev)
219
  except asyncio.QueueFull:
220
  pass
221
+ # Mirror into the on-disk recording (timestamps re-based to t0).
222
+ if self._record_fh is not None:
223
+ self._record({
224
+ "type": ev.type,
225
+ "step": ev.step,
226
+ "timestamp": ev.timestamp - self._record_t0,
227
+ "payload": ev.payload,
228
+ })
229
+
230
+ def _record(self, ev: dict) -> None:
231
+ fh = self._record_fh
232
+ if fh is None:
233
+ return
234
+ try:
235
+ fh.write(json.dumps(ev, separators=(",", ":")) + "\n")
236
+ except Exception:
237
+ pass
238
 
239
  # ---- inspection ----------------------------------------------------
240
 
 
349
  self.scheduler.running.remove(seq)
350
  self.block_manager.free(seq)
351
 
352
+ # Emit outputs to per-request queues, and collect per-step deltas
353
+ # for the event stream (powers the replay UI's text panes).
354
+ step_deltas: list[dict] = []
355
  for item in sched.scheduled:
356
  seq = item.seq
357
  rid = seq.request_id
 
375
  q = self._output_queues.get(rid)
376
  if q is not None:
377
  await q.put(si)
378
+ if new_text or is_done:
379
+ step_deltas.append({
380
+ "request_id": rid,
381
+ "new_text": new_text,
382
+ "finished": is_done,
383
+ "finish_reason": seq.finish_reason,
384
+ })
385
  if is_done:
 
386
  self._sequences.pop(rid, None)
387
  self._prev_text_len.pop(rid, None)
388
 
 
396
  "preempted": sched.preempted,
397
  "newly_admitted": sched.newly_admitted,
398
  "finished": [s.request_id for s in finished_now],
399
+ "deltas": step_deltas,
400
  "snapshot": self.snapshot(),
401
  })
402
 
tiny_vllm/server.py CHANGED
@@ -282,6 +282,9 @@ def main() -> None:
282
  parser.add_argument("--max-num-batched-tokens", type=int, default=512)
283
  parser.add_argument("--max-model-len", type=int, default=2048)
284
  parser.add_argument("--disable-prefix-caching", action="store_true")
 
 
 
285
  parser.add_argument("--host", default="0.0.0.0")
286
  parser.add_argument("--port", type=int, default=8000)
287
  args = parser.parse_args()
@@ -296,6 +299,7 @@ def main() -> None:
296
  max_num_batched_tokens=args.max_num_batched_tokens,
297
  max_model_len=args.max_model_len,
298
  enable_prefix_caching=not args.disable_prefix_caching,
 
299
  )
300
 
301
  import uvicorn
 
282
  parser.add_argument("--max-num-batched-tokens", type=int, default=512)
283
  parser.add_argument("--max-model-len", type=int, default=2048)
284
  parser.add_argument("--disable-prefix-caching", action="store_true")
285
+ parser.add_argument("--record", default=None,
286
+ help="Append every engine event to this JSONL file "
287
+ "(e.g. web/events.jsonl) to power the static replay demo.")
288
  parser.add_argument("--host", default="0.0.0.0")
289
  parser.add_argument("--port", type=int, default=8000)
290
  args = parser.parse_args()
 
299
  max_num_batched_tokens=args.max_num_batched_tokens,
300
  max_model_len=args.max_model_len,
301
  enable_prefix_caching=not args.disable_prefix_caching,
302
+ record_path=args.record,
303
  )
304
 
305
  import uvicorn
web/app.js CHANGED
@@ -1,12 +1,16 @@
1
  /* tiny_vllm — demo page client.
2
  *
3
- * Two streams in play:
4
  *
5
- * /engine/events engine state snapshots (one per scheduling step)
6
- * /generate — token-level deltas for whatever prompt this page sent
 
 
 
7
  *
8
- * The page itself is stateless; everything is driven by what comes off the
9
- * event stream. Token deltas from /generate are merged into per-request UI.
 
10
  */
11
 
12
  const $ = (id) => document.getElementById(id);
@@ -27,6 +31,11 @@ const ui = {
27
  seqs: $("seqs"),
28
  send: $("send"),
29
  sendTwice: $("send-twice"),
 
 
 
 
 
30
  };
31
 
32
  const state = {
@@ -38,6 +47,8 @@ const state = {
38
  requests: new Map(),
39
  // seq_id -> { request_id, blockTable, cachedPrefixBlocks, status, ... }
40
  seqsBySeqId: new Map(),
 
 
41
  };
42
 
43
  function logLine(html, cls = "") {
@@ -46,6 +57,46 @@ function logLine(html, cls = "") {
46
  ui.log.scrollTop = ui.log.scrollHeight;
47
  }
48
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
  function initPool(numBlocks) {
50
  if (state.numBlocks === numBlocks && state.poolEls.length === numBlocks) return;
51
  state.numBlocks = numBlocks;
@@ -68,13 +119,9 @@ function renderPool(pool) {
68
  const rc = pool.ref_counts[i];
69
  const hashed = pool.hashed[i];
70
  let cls = "block";
71
- if (rc === 0) {
72
- cls += hashed ? " cached" : " free";
73
- } else if (rc === 1) {
74
- cls += " used";
75
- } else {
76
- cls += " shared";
77
- }
78
  if (hashed) cls += " hashed";
79
  el.className = cls;
80
  el.title = `block ${i} — refcount=${rc}${hashed ? " — hashed (cacheable)" : ""}`;
@@ -95,11 +142,10 @@ function renderPool(pool) {
95
  function renderSeqs(snapshot) {
96
  ui.schedStep.textContent = ` — step ${snapshot.step}`;
97
  const all = [...snapshot.running, ...snapshot.waiting];
98
- // index for later token-delta merges
99
  state.seqsBySeqId = new Map(all.map(s => [s.seq_id, s]));
100
  ui.seqs.innerHTML = "";
101
  if (all.length === 0) {
102
- ui.seqs.innerHTML = `<div class="muted">(no active sequences — send a prompt above)</div>`;
103
  return;
104
  }
105
  for (const s of all) {
@@ -129,7 +175,7 @@ function renderSeqs(snapshot) {
129
  </span>
130
  </div>
131
  <div class="seq-blocks">${blocksHTML || '<span class="muted">(no blocks yet)</span>'}</div>
132
- <div class="seq-text"><span class="prompt">${escapeHtml(promptText)}</span><span class="gen">${escapeHtml(gen)}</span>${s.status === 'running' || s.status === 'prefilling' ? '<span class="cursor">&nbsp;</span>' : ''}</div>
133
  `;
134
  ui.seqs.appendChild(div);
135
  }
@@ -139,6 +185,27 @@ function escapeHtml(s) {
139
  return (s || "").replace(/[&<>"]/g, c => ({"&": "&amp;", "<": "&lt;", ">": "&gt;", '"': "&quot;"}[c]));
140
  }
141
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
142
  function handleEvent(ev) {
143
  if (ev.type === "snapshot") {
144
  const snap = ev.payload;
@@ -147,6 +214,17 @@ function handleEvent(ev) {
147
  renderSeqs(snap);
148
  return;
149
  }
 
 
 
 
 
 
 
 
 
 
 
150
  if (ev.type === "step") {
151
  const p = ev.payload;
152
  ui.statTokens.textContent = p.num_tokens;
@@ -154,50 +232,134 @@ function handleEvent(ev) {
154
  ui.statMs.textContent = p.duration_ms.toFixed(1);
155
  if (p.preempted?.length) state.preempted += p.preempted.length;
156
  ui.statPre.textContent = state.preempted;
 
157
  renderPool(p.snapshot.block_pool);
158
  renderSeqs(p.snapshot);
159
 
160
  let msg = `step ${ev.step}: ${p.num_tokens}t (${p.num_prefill_seqs}P/${p.num_decode_seqs}D) in ${p.duration_ms.toFixed(1)}ms`;
161
  let cls = "ev-step";
162
- if (p.newly_admitted?.length) {
163
- msg += ` · admitted seq=${p.newly_admitted.join(",")}`;
164
- cls = "ev-admit";
165
- }
166
- if (p.finished?.length) {
167
- msg += ` · finished ${p.finished.map(r => r.slice(0,8)).join(",")}`;
168
- cls = "ev-finish";
169
- }
170
- if (p.preempted?.length) {
171
- msg += ` · PREEMPTED seq=${p.preempted.join(",")}`;
172
- cls = "ev-preempt";
173
- }
174
  logLine(msg, cls);
175
  }
176
  }
177
 
178
- function connectEvents() {
 
 
179
  const es = new EventSource("/engine/events");
180
- es.onopen = () => {
181
- ui.connection.textContent = "connected";
182
- ui.connection.classList.remove("offline");
183
- ui.connection.classList.add("online");
184
- };
185
  es.onerror = () => {
186
- ui.connection.textContent = "disconnected";
187
- ui.connection.classList.remove("online");
188
- ui.connection.classList.add("offline");
 
 
 
 
 
189
  };
190
  es.onmessage = (e) => {
191
  if (!e.data) return;
192
- try {
193
- handleEvent(JSON.parse(e.data));
194
- } catch (err) {
195
- console.error("bad event", err, e.data);
196
- }
197
  };
 
 
 
 
 
 
 
198
  }
199
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
200
  async function sendPrompt(prompt) {
 
201
  const body = {
202
  prompt,
203
  max_tokens: parseInt($("max_tokens").value, 10),
@@ -215,8 +377,6 @@ async function sendPrompt(prompt) {
215
  logLine(`request failed: ${txt}`, "ev-preempt");
216
  return;
217
  }
218
-
219
- // Parse SSE manually so we can read each event as it arrives.
220
  const reader = resp.body.getReader();
221
  const decoder = new TextDecoder();
222
  let buf = "";
@@ -242,31 +402,43 @@ async function sendPrompt(prompt) {
242
  if (j.text) rec.generated += j.text;
243
  rec.finished = j.finished;
244
  rec.finishReason = j.finish_reason;
245
- // Repaint the matching seq card if visible.
246
  const card = document.getElementById(`seq-${myReqId}`);
247
  if (card) {
248
- const text = card.querySelector(".seq-text .gen");
249
- if (text) text.textContent = rec.generated;
250
  }
251
- } catch (e) {
252
- console.error("bad chunk", e, data);
253
- }
254
  }
255
  }
256
  }
257
 
258
- ui.send.addEventListener("click", () => sendPrompt($("prompt").value));
259
  ui.sendTwice.addEventListener("click", async () => {
260
- const p = $("prompt").value;
261
- // First send fills the prefix cache; second send should hit it.
262
  await sendPrompt(p);
263
  await new Promise(r => setTimeout(r, 200));
264
  await sendPrompt(p);
265
  });
266
- $("prompt").addEventListener("keydown", (e) => {
267
- if ((e.metaKey || e.ctrlKey) && e.key === "Enter") {
268
- sendPrompt(e.target.value);
269
- }
270
  });
271
 
272
- connectEvents();
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  /* tiny_vllm — demo page client.
2
  *
3
+ * Runs in one of two modes:
4
  *
5
+ * LIVE talks to a tiny_vllm server. Subscribes to /engine/events
6
+ * (SSE) and POSTs to /generate to submit prompts.
7
+ * REPLAY — no backend. Fetches a pre-recorded events.jsonl from the
8
+ * same directory and dispatches each event with original timing.
9
+ * Used for the GitHub Pages demo.
10
  *
11
+ * Mode is auto-detected: we try SSE first; if there's no response within a
12
+ * short window we fall back to replay. Force a mode with ?mode=replay or
13
+ * ?mode=live in the URL. Point at a different recording with ?session=URL.
14
  */
15
 
16
  const $ = (id) => document.getElementById(id);
 
31
  seqs: $("seqs"),
32
  send: $("send"),
33
  sendTwice: $("send-twice"),
34
+ prompt: $("prompt"),
35
+ banner: $("banner"),
36
+ speed: $("speed"),
37
+ playPause: $("play-pause"),
38
+ restart: $("restart"),
39
  };
40
 
41
  const state = {
 
47
  requests: new Map(),
48
  // seq_id -> { request_id, blockTable, cachedPrefixBlocks, status, ... }
49
  seqsBySeqId: new Map(),
50
+ mode: "connecting", // "live" | "replay" | "connecting"
51
+ replay: null, // controller object for replay mode
52
  };
53
 
54
  function logLine(html, cls = "") {
 
57
  ui.log.scrollTop = ui.log.scrollHeight;
58
  }
59
 
60
+ function setBanner(text, cls) {
61
+ if (!ui.banner) return;
62
+ ui.banner.textContent = text;
63
+ ui.banner.className = `banner ${cls || ""}`;
64
+ ui.banner.style.display = text ? "" : "none";
65
+ }
66
+
67
+ function setMode(mode) {
68
+ state.mode = mode;
69
+ if (mode === "live") {
70
+ ui.connection.textContent = "live";
71
+ ui.connection.classList.remove("offline");
72
+ ui.connection.classList.add("online");
73
+ ui.send.disabled = false;
74
+ ui.sendTwice.disabled = false;
75
+ ui.prompt.disabled = false;
76
+ setBanner("", "");
77
+ if (ui.speed) ui.speed.style.display = "none";
78
+ if (ui.playPause) ui.playPause.style.display = "none";
79
+ if (ui.restart) ui.restart.style.display = "none";
80
+ } else if (mode === "replay") {
81
+ ui.connection.textContent = "replay";
82
+ ui.connection.classList.remove("offline");
83
+ ui.connection.classList.add("replay");
84
+ ui.send.disabled = true;
85
+ ui.sendTwice.disabled = true;
86
+ ui.prompt.disabled = true;
87
+ setBanner(
88
+ "REPLAY MODE — this is a pre-recorded session. Run the server locally to send your own prompts.",
89
+ "replay-banner",
90
+ );
91
+ if (ui.speed) ui.speed.style.display = "";
92
+ if (ui.playPause) ui.playPause.style.display = "";
93
+ if (ui.restart) ui.restart.style.display = "";
94
+ } else {
95
+ ui.connection.textContent = "connecting…";
96
+ ui.connection.classList.add("offline");
97
+ }
98
+ }
99
+
100
  function initPool(numBlocks) {
101
  if (state.numBlocks === numBlocks && state.poolEls.length === numBlocks) return;
102
  state.numBlocks = numBlocks;
 
119
  const rc = pool.ref_counts[i];
120
  const hashed = pool.hashed[i];
121
  let cls = "block";
122
+ if (rc === 0) cls += hashed ? " cached" : " free";
123
+ else if (rc === 1) cls += " used";
124
+ else cls += " shared";
 
 
 
 
125
  if (hashed) cls += " hashed";
126
  el.className = cls;
127
  el.title = `block ${i} — refcount=${rc}${hashed ? " — hashed (cacheable)" : ""}`;
 
142
  function renderSeqs(snapshot) {
143
  ui.schedStep.textContent = ` — step ${snapshot.step}`;
144
  const all = [...snapshot.running, ...snapshot.waiting];
 
145
  state.seqsBySeqId = new Map(all.map(s => [s.seq_id, s]));
146
  ui.seqs.innerHTML = "";
147
  if (all.length === 0) {
148
+ ui.seqs.innerHTML = `<div class="muted">(no active sequences${state.mode === 'replay' ? '' : ' — send a prompt above'})</div>`;
149
  return;
150
  }
151
  for (const s of all) {
 
175
  </span>
176
  </div>
177
  <div class="seq-blocks">${blocksHTML || '<span class="muted">(no blocks yet)</span>'}</div>
178
+ <div class="seq-text"><span class="prompt">${escapeHtml(promptText)}</span><span class="gen">${escapeHtml(gen)}</span>${(s.status === 'running' || s.status === 'prefilling') ? '<span class="cursor">&nbsp;</span>' : ''}</div>
179
  `;
180
  ui.seqs.appendChild(div);
181
  }
 
185
  return (s || "").replace(/[&<>"]/g, c => ({"&": "&amp;", "<": "&lt;", ">": "&gt;", '"': "&quot;"}[c]));
186
  }
187
 
188
+ function applyDeltas(deltas) {
189
+ if (!deltas) return;
190
+ for (const d of deltas) {
191
+ let rec = state.requests.get(d.request_id);
192
+ if (!rec) {
193
+ rec = { promptText: "(prompt unknown)", generated: "", finished: false };
194
+ state.requests.set(d.request_id, rec);
195
+ }
196
+ if (d.new_text) rec.generated += d.new_text;
197
+ if (d.finished) {
198
+ rec.finished = true;
199
+ rec.finishReason = d.finish_reason;
200
+ }
201
+ const card = document.getElementById(`seq-${d.request_id}`);
202
+ if (card) {
203
+ const t = card.querySelector(".seq-text .gen");
204
+ if (t) t.textContent = rec.generated;
205
+ }
206
+ }
207
+ }
208
+
209
  function handleEvent(ev) {
210
  if (ev.type === "snapshot") {
211
  const snap = ev.payload;
 
214
  renderSeqs(snap);
215
  return;
216
  }
217
+ if (ev.type === "request") {
218
+ // From the recording: capture prompt text + max_tokens for the UI.
219
+ const p = ev.payload;
220
+ state.requests.set(p.request_id, {
221
+ promptText: p.prompt,
222
+ generated: "",
223
+ finished: false,
224
+ });
225
+ logLine(`request ${p.request_id.slice(0,8)} — prompt=${p.prompt_len}t max_tokens=${p.max_tokens}`, "ev-admit");
226
+ return;
227
+ }
228
  if (ev.type === "step") {
229
  const p = ev.payload;
230
  ui.statTokens.textContent = p.num_tokens;
 
232
  ui.statMs.textContent = p.duration_ms.toFixed(1);
233
  if (p.preempted?.length) state.preempted += p.preempted.length;
234
  ui.statPre.textContent = state.preempted;
235
+ applyDeltas(p.deltas);
236
  renderPool(p.snapshot.block_pool);
237
  renderSeqs(p.snapshot);
238
 
239
  let msg = `step ${ev.step}: ${p.num_tokens}t (${p.num_prefill_seqs}P/${p.num_decode_seqs}D) in ${p.duration_ms.toFixed(1)}ms`;
240
  let cls = "ev-step";
241
+ if (p.newly_admitted?.length) { msg += ` · admitted seq=${p.newly_admitted.join(",")}`; cls = "ev-admit"; }
242
+ if (p.finished?.length) { msg += ` · finished ${p.finished.map(r => r.slice(0,8)).join(",")}`; cls = "ev-finish"; }
243
+ if (p.preempted?.length) { msg += ` · PREEMPTED seq=${p.preempted.join(",")}`; cls = "ev-preempt"; }
 
 
 
 
 
 
 
 
 
244
  logLine(msg, cls);
245
  }
246
  }
247
 
248
+ // ---------- live mode (SSE) ----------
249
+
250
+ function connectLive() {
251
  const es = new EventSource("/engine/events");
252
+ let gotOne = false;
253
+ es.onopen = () => { /* wait for first message to confirm live */ };
 
 
 
254
  es.onerror = () => {
255
+ if (!gotOne) {
256
+ es.close();
257
+ startReplay(); // fall back
258
+ } else {
259
+ ui.connection.textContent = "disconnected";
260
+ ui.connection.classList.remove("online");
261
+ ui.connection.classList.add("offline");
262
+ }
263
  };
264
  es.onmessage = (e) => {
265
  if (!e.data) return;
266
+ if (!gotOne) { gotOne = true; setMode("live"); }
267
+ try { handleEvent(JSON.parse(e.data)); }
268
+ catch (err) { console.error("bad event", err, e.data); }
 
 
269
  };
270
+ // Give the server a couple seconds to respond before falling back.
271
+ setTimeout(() => {
272
+ if (!gotOne) {
273
+ es.close();
274
+ startReplay();
275
+ }
276
+ }, 2000);
277
  }
278
 
279
+ // ---------- replay mode ----------
280
+
281
+ async function startReplay() {
282
+ setMode("replay");
283
+ const params = new URLSearchParams(location.search);
284
+ const url = params.get("session") || "events.jsonl";
285
+ let text;
286
+ try {
287
+ const resp = await fetch(url, { cache: "no-cache" });
288
+ if (!resp.ok) throw new Error(`HTTP ${resp.status}`);
289
+ text = await resp.text();
290
+ } catch (e) {
291
+ setBanner(
292
+ `Could not load recording (${url}). Run the server locally or commit a web/events.jsonl recording.`,
293
+ "replay-banner error",
294
+ );
295
+ ui.connection.textContent = "no recording";
296
+ return;
297
+ }
298
+ const events = text.split("\n").filter(Boolean).map(l => JSON.parse(l));
299
+ if (events.length === 0) {
300
+ setBanner("Recording is empty.", "replay-banner error");
301
+ return;
302
+ }
303
+ state.replay = new Replayer(events);
304
+ state.replay.start();
305
+ }
306
+
307
+ class Replayer {
308
+ constructor(events) {
309
+ this.events = events;
310
+ this.idx = 0;
311
+ this.speed = parseFloat($("speed")?.value || "1");
312
+ this.paused = false;
313
+ this._timeout = null;
314
+ }
315
+ reset() {
316
+ this.stop();
317
+ this.idx = 0;
318
+ state.requests.clear();
319
+ state.preempted = 0;
320
+ ui.log.innerHTML = "";
321
+ }
322
+ setSpeed(s) {
323
+ this.speed = s;
324
+ if (!this.paused) {
325
+ this.stop();
326
+ this._schedule();
327
+ }
328
+ }
329
+ pause() { this.paused = true; this.stop(); }
330
+ resume() { if (!this.paused) return; this.paused = false; this._schedule(); }
331
+ stop() {
332
+ if (this._timeout) clearTimeout(this._timeout);
333
+ this._timeout = null;
334
+ }
335
+ start() {
336
+ this.reset();
337
+ this._schedule(0);
338
+ }
339
+ _schedule(delayOverride) {
340
+ if (this.idx >= this.events.length) {
341
+ logLine("(replay complete — press Restart to replay)", "ev-finish");
342
+ return;
343
+ }
344
+ let delay = 0;
345
+ if (delayOverride !== undefined) {
346
+ delay = delayOverride;
347
+ } else if (this.idx > 0) {
348
+ const gap = this.events[this.idx].timestamp - this.events[this.idx - 1].timestamp;
349
+ delay = Math.max(0, Math.min(gap, 1.0)) * 1000 / this.speed; // cap at 1s
350
+ }
351
+ this._timeout = setTimeout(() => {
352
+ const ev = this.events[this.idx++];
353
+ try { handleEvent(ev); } catch (e) { console.error(e); }
354
+ if (!this.paused) this._schedule();
355
+ }, delay);
356
+ }
357
+ }
358
+
359
+ // ---------- live: prompt submission ----------
360
+
361
  async function sendPrompt(prompt) {
362
+ if (state.mode !== "live") return;
363
  const body = {
364
  prompt,
365
  max_tokens: parseInt($("max_tokens").value, 10),
 
377
  logLine(`request failed: ${txt}`, "ev-preempt");
378
  return;
379
  }
 
 
380
  const reader = resp.body.getReader();
381
  const decoder = new TextDecoder();
382
  let buf = "";
 
402
  if (j.text) rec.generated += j.text;
403
  rec.finished = j.finished;
404
  rec.finishReason = j.finish_reason;
 
405
  const card = document.getElementById(`seq-${myReqId}`);
406
  if (card) {
407
+ const t = card.querySelector(".seq-text .gen");
408
+ if (t) t.textContent = rec.generated;
409
  }
410
+ } catch (e) { console.error("bad chunk", e, data); }
 
 
411
  }
412
  }
413
  }
414
 
415
+ ui.send.addEventListener("click", () => sendPrompt(ui.prompt.value));
416
  ui.sendTwice.addEventListener("click", async () => {
417
+ const p = ui.prompt.value;
 
418
  await sendPrompt(p);
419
  await new Promise(r => setTimeout(r, 200));
420
  await sendPrompt(p);
421
  });
422
+ ui.prompt.addEventListener("keydown", (e) => {
423
+ if ((e.metaKey || e.ctrlKey) && e.key === "Enter") sendPrompt(e.target.value);
 
 
424
  });
425
 
426
+ if (ui.speed) ui.speed.addEventListener("change", () => {
427
+ state.replay?.setSpeed(parseFloat(ui.speed.value));
428
+ });
429
+ if (ui.playPause) ui.playPause.addEventListener("click", () => {
430
+ if (!state.replay) return;
431
+ if (state.replay.paused) { state.replay.resume(); ui.playPause.textContent = "Pause"; }
432
+ else { state.replay.pause(); ui.playPause.textContent = "Play"; }
433
+ });
434
+ if (ui.restart) ui.restart.addEventListener("click", () => state.replay?.start());
435
+
436
+ // ---------- entry point ----------
437
+
438
+ (function boot() {
439
+ setMode("connecting");
440
+ const force = new URLSearchParams(location.search).get("mode");
441
+ if (force === "replay") startReplay();
442
+ else if (force === "live") connectLive();
443
+ else connectLive(); // will auto-fall-back to replay on no-response
444
+ })();
web/events.jsonl ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {"type":"snapshot","step":0,"timestamp":0.0,"payload":{"step":0,"config":{"model":"Qwen/Qwen2.5-0.5B-Instruct","block_size":8,"num_blocks":24,"max_num_seqs":8,"max_num_batched_tokens":32,"prefix_caching":true},"block_pool":{"num_blocks":24,"block_size":8,"num_free_blocks":24,"num_cached_entries":0,"prefix_cache_hits":0,"prefix_cache_lookups":0,"ref_counts":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"hashed":[false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false]},"running":[],"waiting":[]}}
2
+ {"type":"request","step":1,"timestamp":0.5,"payload":{"request_id":"demo-aaaa1111","seq_id":1,"prompt":"Explain paged attention in two sentences. Then explain prefix caching.","prompt_len":20,"max_tokens":24}}
3
+ {"type":"step","step":1,"timestamp":0.55,"payload":{"duration_ms":280,"num_tokens":16,"num_seqs":1,"num_prefill_seqs":1,"num_decode_seqs":0,"deltas":[],"newly_admitted":[1],"finished":[],"preempted":[],"snapshot":{"step":1,"config":{"model":"Qwen/Qwen2.5-0.5B-Instruct","block_size":8,"num_blocks":24,"max_num_seqs":8,"max_num_batched_tokens":32,"prefix_caching":true},"block_pool":{"num_blocks":24,"block_size":8,"num_free_blocks":22,"num_cached_entries":2,"prefix_cache_hits":0,"prefix_cache_lookups":0,"ref_counts":[1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"hashed":[true,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false]},"running":[],"waiting":[{"seq_id":1,"request_id":"demo-aaaa1111","status":"prefilling","prompt_len":20,"num_generated":0,"num_computed_tokens":16,"num_cached_prefix_tokens":0,"block_table":[0,1]}]}}}
4
+ {"type":"step","step":2,"timestamp":0.9,"payload":{"duration_ms":180,"num_tokens":4,"num_seqs":1,"num_prefill_seqs":1,"num_decode_seqs":0,"deltas":[],"newly_admitted":[],"finished":[],"preempted":[],"snapshot":{"step":2,"config":{"model":"Qwen/Qwen2.5-0.5B-Instruct","block_size":8,"num_blocks":24,"max_num_seqs":8,"max_num_batched_tokens":32,"prefix_caching":true},"block_pool":{"num_blocks":24,"block_size":8,"num_free_blocks":21,"num_cached_entries":2,"prefix_cache_hits":0,"prefix_cache_lookups":0,"ref_counts":[1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"hashed":[true,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false]},"running":[{"seq_id":1,"request_id":"demo-aaaa1111","status":"running","prompt_len":20,"num_generated":0,"num_computed_tokens":20,"num_cached_prefix_tokens":0,"block_table":[0,1,2]}],"waiting":[]}}}
5
+ {"type":"step","step":3,"timestamp":1.22,"payload":{"duration_ms":95,"num_tokens":1,"num_seqs":1,"num_prefill_seqs":0,"num_decode_seqs":1,"deltas":[{"request_id":"demo-aaaa1111","new_text":" Paged","finished":false,"finish_reason":null}],"newly_admitted":[],"finished":[],"preempted":[],"snapshot":{"step":3,"config":{"model":"Qwen/Qwen2.5-0.5B-Instruct","block_size":8,"num_blocks":24,"max_num_seqs":8,"max_num_batched_tokens":32,"prefix_caching":true},"block_pool":{"num_blocks":24,"block_size":8,"num_free_blocks":21,"num_cached_entries":2,"prefix_cache_hits":0,"prefix_cache_lookups":0,"ref_counts":[1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"hashed":[true,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false]},"running":[{"seq_id":1,"request_id":"demo-aaaa1111","status":"running","prompt_len":20,"num_generated":1,"num_computed_tokens":21,"num_cached_prefix_tokens":0,"block_table":[0,1,2]}],"waiting":[]}}}
6
+ {"type":"step","step":4,"timestamp":1.34,"payload":{"duration_ms":95,"num_tokens":1,"num_seqs":1,"num_prefill_seqs":0,"num_decode_seqs":1,"deltas":[{"request_id":"demo-aaaa1111","new_text":" attention","finished":false,"finish_reason":null}],"newly_admitted":[],"finished":[],"preempted":[],"snapshot":{"step":4,"config":{"model":"Qwen/Qwen2.5-0.5B-Instruct","block_size":8,"num_blocks":24,"max_num_seqs":8,"max_num_batched_tokens":32,"prefix_caching":true},"block_pool":{"num_blocks":24,"block_size":8,"num_free_blocks":21,"num_cached_entries":2,"prefix_cache_hits":0,"prefix_cache_lookups":0,"ref_counts":[1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"hashed":[true,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false]},"running":[{"seq_id":1,"request_id":"demo-aaaa1111","status":"running","prompt_len":20,"num_generated":2,"num_computed_tokens":22,"num_cached_prefix_tokens":0,"block_table":[0,1,2]}],"waiting":[]}}}
7
+ {"type":"step","step":5,"timestamp":1.46,"payload":{"duration_ms":95,"num_tokens":1,"num_seqs":1,"num_prefill_seqs":0,"num_decode_seqs":1,"deltas":[{"request_id":"demo-aaaa1111","new_text":" splits","finished":false,"finish_reason":null}],"newly_admitted":[],"finished":[],"preempted":[],"snapshot":{"step":5,"config":{"model":"Qwen/Qwen2.5-0.5B-Instruct","block_size":8,"num_blocks":24,"max_num_seqs":8,"max_num_batched_tokens":32,"prefix_caching":true},"block_pool":{"num_blocks":24,"block_size":8,"num_free_blocks":21,"num_cached_entries":2,"prefix_cache_hits":0,"prefix_cache_lookups":0,"ref_counts":[1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"hashed":[true,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false]},"running":[{"seq_id":1,"request_id":"demo-aaaa1111","status":"running","prompt_len":20,"num_generated":3,"num_computed_tokens":23,"num_cached_prefix_tokens":0,"block_table":[0,1,2]}],"waiting":[]}}}
8
+ {"type":"step","step":6,"timestamp":1.58,"payload":{"duration_ms":95,"num_tokens":1,"num_seqs":1,"num_prefill_seqs":0,"num_decode_seqs":1,"deltas":[{"request_id":"demo-aaaa1111","new_text":" keys","finished":false,"finish_reason":null}],"newly_admitted":[],"finished":[],"preempted":[],"snapshot":{"step":6,"config":{"model":"Qwen/Qwen2.5-0.5B-Instruct","block_size":8,"num_blocks":24,"max_num_seqs":8,"max_num_batched_tokens":32,"prefix_caching":true},"block_pool":{"num_blocks":24,"block_size":8,"num_free_blocks":21,"num_cached_entries":3,"prefix_cache_hits":0,"prefix_cache_lookups":0,"ref_counts":[1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"hashed":[true,true,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false]},"running":[{"seq_id":1,"request_id":"demo-aaaa1111","status":"running","prompt_len":20,"num_generated":4,"num_computed_tokens":24,"num_cached_prefix_tokens":0,"block_table":[0,1,2]}],"waiting":[]}}}
9
+ {"type":"step","step":7,"timestamp":1.7,"payload":{"duration_ms":95,"num_tokens":1,"num_seqs":1,"num_prefill_seqs":0,"num_decode_seqs":1,"deltas":[{"request_id":"demo-aaaa1111","new_text":" and","finished":false,"finish_reason":null}],"newly_admitted":[],"finished":[],"preempted":[],"snapshot":{"step":7,"config":{"model":"Qwen/Qwen2.5-0.5B-Instruct","block_size":8,"num_blocks":24,"max_num_seqs":8,"max_num_batched_tokens":32,"prefix_caching":true},"block_pool":{"num_blocks":24,"block_size":8,"num_free_blocks":20,"num_cached_entries":3,"prefix_cache_hits":0,"prefix_cache_lookups":0,"ref_counts":[1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"hashed":[true,true,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false]},"running":[{"seq_id":1,"request_id":"demo-aaaa1111","status":"running","prompt_len":20,"num_generated":5,"num_computed_tokens":25,"num_cached_prefix_tokens":0,"block_table":[0,1,2,3]}],"waiting":[]}}}
10
+ {"type":"request","step":8,"timestamp":2.3,"payload":{"request_id":"demo-bbbb2222","seq_id":2,"prompt":"Explain paged attention in two sentences. Then explain prefix caching.","prompt_len":20,"max_tokens":24}}
11
+ {"type":"step","step":9,"timestamp":2.35,"payload":{"duration_ms":140,"num_tokens":5,"num_seqs":2,"num_prefill_seqs":1,"num_decode_seqs":1,"deltas":[{"request_id":"demo-aaaa1111","new_text":" values","finished":false,"finish_reason":null}],"newly_admitted":[2],"finished":[],"preempted":[],"snapshot":{"step":9,"config":{"model":"Qwen/Qwen2.5-0.5B-Instruct","block_size":8,"num_blocks":24,"max_num_seqs":8,"max_num_batched_tokens":32,"prefix_caching":true},"block_pool":{"num_blocks":24,"block_size":8,"num_free_blocks":19,"num_cached_entries":3,"prefix_cache_hits":2,"prefix_cache_lookups":2,"ref_counts":[2,2,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"hashed":[true,true,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false]},"running":[{"seq_id":1,"request_id":"demo-aaaa1111","status":"running","prompt_len":20,"num_generated":5,"num_computed_tokens":25,"num_cached_prefix_tokens":0,"block_table":[0,1,2,3]}],"waiting":[{"seq_id":2,"request_id":"demo-bbbb2222","status":"prefilling","prompt_len":20,"num_generated":0,"num_computed_tokens":16,"num_cached_prefix_tokens":16,"block_table":[0,1,4]}]}}}
12
+ {"type":"step","step":10,"timestamp":2.47,"payload":{"duration_ms":110,"num_tokens":5,"num_seqs":2,"num_prefill_seqs":1,"num_decode_seqs":1,"deltas":[{"request_id":"demo-aaaa1111","new_text":" into","finished":false,"finish_reason":null}],"newly_admitted":[],"finished":[],"preempted":[],"snapshot":{"step":10,"config":{"model":"Qwen/Qwen2.5-0.5B-Instruct","block_size":8,"num_blocks":24,"max_num_seqs":8,"max_num_batched_tokens":32,"prefix_caching":true},"block_pool":{"num_blocks":24,"block_size":8,"num_free_blocks":19,"num_cached_entries":3,"prefix_cache_hits":2,"prefix_cache_lookups":2,"ref_counts":[2,2,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"hashed":[true,true,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false]},"running":[{"seq_id":1,"request_id":"demo-aaaa1111","status":"running","prompt_len":20,"num_generated":6,"num_computed_tokens":26,"num_cached_prefix_tokens":0,"block_table":[0,1,2,3]},{"seq_id":2,"request_id":"demo-bbbb2222","status":"running","prompt_len":20,"num_generated":0,"num_computed_tokens":20,"num_cached_prefix_tokens":16,"block_table":[0,1,4]}],"waiting":[]}}}
13
+ {"type":"step","step":11,"timestamp":2.69,"payload":{"duration_ms":105,"num_tokens":2,"num_seqs":2,"num_prefill_seqs":0,"num_decode_seqs":2,"deltas":[{"request_id":"demo-aaaa1111","new_text":" small","finished":false,"finish_reason":null},{"request_id":"demo-bbbb2222","new_text":" Prefix","finished":false,"finish_reason":null}],"newly_admitted":[],"finished":[],"preempted":[],"snapshot":{"step":11,"config":{"model":"Qwen/Qwen2.5-0.5B-Instruct","block_size":8,"num_blocks":24,"max_num_seqs":8,"max_num_batched_tokens":32,"prefix_caching":true},"block_pool":{"num_blocks":24,"block_size":8,"num_free_blocks":19,"num_cached_entries":3,"prefix_cache_hits":2,"prefix_cache_lookups":2,"ref_counts":[2,2,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"hashed":[true,true,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false]},"running":[{"seq_id":1,"request_id":"demo-aaaa1111","status":"running","prompt_len":20,"num_generated":7,"num_computed_tokens":27,"num_cached_prefix_tokens":0,"block_table":[0,1,2,3]},{"seq_id":2,"request_id":"demo-bbbb2222","status":"running","prompt_len":20,"num_generated":1,"num_computed_tokens":21,"num_cached_prefix_tokens":16,"block_table":[0,1,4]}],"waiting":[]}}}
14
+ {"type":"step","step":12,"timestamp":2.79,"payload":{"duration_ms":105,"num_tokens":2,"num_seqs":2,"num_prefill_seqs":0,"num_decode_seqs":2,"deltas":[{"request_id":"demo-aaaa1111","new_text":",","finished":false,"finish_reason":null},{"request_id":"demo-bbbb2222","new_text":" caching","finished":false,"finish_reason":null}],"newly_admitted":[],"finished":[],"preempted":[],"snapshot":{"step":12,"config":{"model":"Qwen/Qwen2.5-0.5B-Instruct","block_size":8,"num_blocks":24,"max_num_seqs":8,"max_num_batched_tokens":32,"prefix_caching":true},"block_pool":{"num_blocks":24,"block_size":8,"num_free_blocks":19,"num_cached_entries":3,"prefix_cache_hits":2,"prefix_cache_lookups":2,"ref_counts":[2,2,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"hashed":[true,true,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false]},"running":[{"seq_id":1,"request_id":"demo-aaaa1111","status":"running","prompt_len":20,"num_generated":8,"num_computed_tokens":28,"num_cached_prefix_tokens":0,"block_table":[0,1,2,3]},{"seq_id":2,"request_id":"demo-bbbb2222","status":"running","prompt_len":20,"num_generated":2,"num_computed_tokens":22,"num_cached_prefix_tokens":16,"block_table":[0,1,4]}],"waiting":[]}}}
15
+ {"type":"step","step":13,"timestamp":2.89,"payload":{"duration_ms":105,"num_tokens":2,"num_seqs":2,"num_prefill_seqs":0,"num_decode_seqs":2,"deltas":[{"request_id":"demo-aaaa1111","new_text":" fixed","finished":false,"finish_reason":null},{"request_id":"demo-bbbb2222","new_text":" reuses","finished":false,"finish_reason":null}],"newly_admitted":[],"finished":[],"preempted":[],"snapshot":{"step":13,"config":{"model":"Qwen/Qwen2.5-0.5B-Instruct","block_size":8,"num_blocks":24,"max_num_seqs":8,"max_num_batched_tokens":32,"prefix_caching":true},"block_pool":{"num_blocks":24,"block_size":8,"num_free_blocks":19,"num_cached_entries":3,"prefix_cache_hits":2,"prefix_cache_lookups":2,"ref_counts":[2,2,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"hashed":[true,true,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false]},"running":[{"seq_id":1,"request_id":"demo-aaaa1111","status":"running","prompt_len":20,"num_generated":9,"num_computed_tokens":29,"num_cached_prefix_tokens":0,"block_table":[0,1,2,3]},{"seq_id":2,"request_id":"demo-bbbb2222","status":"running","prompt_len":20,"num_generated":3,"num_computed_tokens":23,"num_cached_prefix_tokens":16,"block_table":[0,1,4]}],"waiting":[]}}}
16
+ {"type":"step","step":14,"timestamp":2.99,"payload":{"duration_ms":105,"num_tokens":2,"num_seqs":2,"num_prefill_seqs":0,"num_decode_seqs":2,"deltas":[{"request_id":"demo-aaaa1111","new_text":"-size","finished":false,"finish_reason":null},{"request_id":"demo-bbbb2222","new_text":" those","finished":false,"finish_reason":null}],"newly_admitted":[],"finished":[],"preempted":[],"snapshot":{"step":14,"config":{"model":"Qwen/Qwen2.5-0.5B-Instruct","block_size":8,"num_blocks":24,"max_num_seqs":8,"max_num_batched_tokens":32,"prefix_caching":true},"block_pool":{"num_blocks":24,"block_size":8,"num_free_blocks":19,"num_cached_entries":4,"prefix_cache_hits":2,"prefix_cache_lookups":2,"ref_counts":[2,2,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"hashed":[true,true,true,false,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false]},"running":[{"seq_id":1,"request_id":"demo-aaaa1111","status":"running","prompt_len":20,"num_generated":10,"num_computed_tokens":30,"num_cached_prefix_tokens":0,"block_table":[0,1,2,3]},{"seq_id":2,"request_id":"demo-bbbb2222","status":"running","prompt_len":20,"num_generated":4,"num_computed_tokens":24,"num_cached_prefix_tokens":16,"block_table":[0,1,4]}],"waiting":[]}}}
17
+ {"type":"step","step":15,"timestamp":3.09,"payload":{"duration_ms":105,"num_tokens":2,"num_seqs":2,"num_prefill_seqs":0,"num_decode_seqs":2,"deltas":[{"request_id":"demo-aaaa1111","new_text":" blocks","finished":false,"finish_reason":null},{"request_id":"demo-bbbb2222","new_text":" blocks","finished":false,"finish_reason":null}],"newly_admitted":[],"finished":[],"preempted":[],"snapshot":{"step":15,"config":{"model":"Qwen/Qwen2.5-0.5B-Instruct","block_size":8,"num_blocks":24,"max_num_seqs":8,"max_num_batched_tokens":32,"prefix_caching":true},"block_pool":{"num_blocks":24,"block_size":8,"num_free_blocks":18,"num_cached_entries":4,"prefix_cache_hits":2,"prefix_cache_lookups":2,"ref_counts":[2,2,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"hashed":[true,true,true,false,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false]},"running":[{"seq_id":1,"request_id":"demo-aaaa1111","status":"running","prompt_len":20,"num_generated":11,"num_computed_tokens":31,"num_cached_prefix_tokens":0,"block_table":[0,1,2,3]},{"seq_id":2,"request_id":"demo-bbbb2222","status":"running","prompt_len":20,"num_generated":5,"num_computed_tokens":25,"num_cached_prefix_tokens":16,"block_table":[0,1,4,5]}],"waiting":[]}}}
18
+ {"type":"step","step":16,"timestamp":3.19,"payload":{"duration_ms":105,"num_tokens":2,"num_seqs":2,"num_prefill_seqs":0,"num_decode_seqs":2,"deltas":[{"request_id":"demo-aaaa1111","new_text":" stored","finished":false,"finish_reason":null},{"request_id":"demo-bbbb2222","new_text":" across","finished":false,"finish_reason":null}],"newly_admitted":[],"finished":[],"preempted":[],"snapshot":{"step":16,"config":{"model":"Qwen/Qwen2.5-0.5B-Instruct","block_size":8,"num_blocks":24,"max_num_seqs":8,"max_num_batched_tokens":32,"prefix_caching":true},"block_pool":{"num_blocks":24,"block_size":8,"num_free_blocks":18,"num_cached_entries":4,"prefix_cache_hits":2,"prefix_cache_lookups":2,"ref_counts":[2,2,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"hashed":[true,true,true,false,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false]},"running":[{"seq_id":1,"request_id":"demo-aaaa1111","status":"running","prompt_len":20,"num_generated":12,"num_computed_tokens":32,"num_cached_prefix_tokens":0,"block_table":[0,1,2,3]},{"seq_id":2,"request_id":"demo-bbbb2222","status":"running","prompt_len":20,"num_generated":6,"num_computed_tokens":26,"num_cached_prefix_tokens":16,"block_table":[0,1,4,5]}],"waiting":[]}}}
19
+ {"type":"step","step":17,"timestamp":3.29,"payload":{"duration_ms":105,"num_tokens":2,"num_seqs":2,"num_prefill_seqs":0,"num_decode_seqs":2,"deltas":[{"request_id":"demo-aaaa1111","new_text":" in","finished":false,"finish_reason":null},{"request_id":"demo-bbbb2222","new_text":" requests","finished":false,"finish_reason":null}],"newly_admitted":[],"finished":[],"preempted":[],"snapshot":{"step":17,"config":{"model":"Qwen/Qwen2.5-0.5B-Instruct","block_size":8,"num_blocks":24,"max_num_seqs":8,"max_num_batched_tokens":32,"prefix_caching":true},"block_pool":{"num_blocks":24,"block_size":8,"num_free_blocks":18,"num_cached_entries":4,"prefix_cache_hits":2,"prefix_cache_lookups":2,"ref_counts":[2,2,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"hashed":[true,true,true,false,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false]},"running":[{"seq_id":1,"request_id":"demo-aaaa1111","status":"running","prompt_len":20,"num_generated":13,"num_computed_tokens":33,"num_cached_prefix_tokens":0,"block_table":[0,1,2,3]},{"seq_id":2,"request_id":"demo-bbbb2222","status":"running","prompt_len":20,"num_generated":7,"num_computed_tokens":27,"num_cached_prefix_tokens":16,"block_table":[0,1,4,5]}],"waiting":[]}}}
20
+ {"type":"step","step":18,"timestamp":3.39,"payload":{"duration_ms":85,"num_tokens":1,"num_seqs":1,"num_prefill_seqs":0,"num_decode_seqs":1,"deltas":[{"request_id":"demo-aaaa1111","new_text":" the GPU.","finished":true,"finish_reason":"stop"}],"newly_admitted":[],"finished":["demo-aaaa1111"],"preempted":[],"snapshot":{"step":18,"config":{"model":"Qwen/Qwen2.5-0.5B-Instruct","block_size":8,"num_blocks":24,"max_num_seqs":8,"max_num_batched_tokens":32,"prefix_caching":true},"block_pool":{"num_blocks":24,"block_size":8,"num_free_blocks":20,"num_cached_entries":5,"prefix_cache_hits":2,"prefix_cache_lookups":2,"ref_counts":[1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"hashed":[true,true,true,true,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false]},"running":[{"seq_id":2,"request_id":"demo-bbbb2222","status":"running","prompt_len":20,"num_generated":8,"num_computed_tokens":28,"num_cached_prefix_tokens":16,"block_table":[0,1,4,5]}],"waiting":[]}}}
21
+ {"type":"step","step":19,"timestamp":3.49,"payload":{"duration_ms":80,"num_tokens":1,"num_seqs":1,"num_prefill_seqs":0,"num_decode_seqs":1,"deltas":[{"request_id":"demo-bbbb2222","new_text":" for the same prompts.","finished":true,"finish_reason":"stop"}],"newly_admitted":[],"finished":["demo-bbbb2222"],"preempted":[],"snapshot":{"step":19,"config":{"model":"Qwen/Qwen2.5-0.5B-Instruct","block_size":8,"num_blocks":24,"max_num_seqs":8,"max_num_batched_tokens":32,"prefix_caching":true},"block_pool":{"num_blocks":24,"block_size":8,"num_free_blocks":24,"num_cached_entries":6,"prefix_cache_hits":2,"prefix_cache_lookups":2,"ref_counts":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"hashed":[true,true,true,true,true,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false]},"running":[],"waiting":[]}}}
web/index.html CHANGED
@@ -10,11 +10,13 @@
10
  <header>
11
  <h1>tiny_vllm <span class="muted">— minimal continuous-batching engine</span></h1>
12
  <div class="status">
13
- <span id="connection" class="badge offline">disconnected</span>
14
  <span id="model" class="muted"></span>
15
  </div>
16
  </header>
17
 
 
 
18
  <section class="prompt-box">
19
  <textarea id="prompt" rows="2" placeholder="Type a prompt and press Send (or Cmd/Ctrl+Enter)…">Explain paged attention in two sentences.</textarea>
20
  <div class="controls">
@@ -23,6 +25,18 @@
23
  <label>top_p <input id="top_p" type="number" value="0.9" step="0.05" min="0" max="1"></label>
24
  <button id="send">Send</button>
25
  <button id="send-twice" title="Submit the same prompt twice — second should hit prefix cache">Send ×2 (prefix demo)</button>
 
 
 
 
 
 
 
 
 
 
 
 
26
  </div>
27
  </section>
28
 
 
10
  <header>
11
  <h1>tiny_vllm <span class="muted">— minimal continuous-batching engine</span></h1>
12
  <div class="status">
13
+ <span id="connection" class="badge offline">connecting…</span>
14
  <span id="model" class="muted"></span>
15
  </div>
16
  </header>
17
 
18
+ <div id="banner" class="banner" style="display:none"></div>
19
+
20
  <section class="prompt-box">
21
  <textarea id="prompt" rows="2" placeholder="Type a prompt and press Send (or Cmd/Ctrl+Enter)…">Explain paged attention in two sentences.</textarea>
22
  <div class="controls">
 
25
  <label>top_p <input id="top_p" type="number" value="0.9" step="0.05" min="0" max="1"></label>
26
  <button id="send">Send</button>
27
  <button id="send-twice" title="Submit the same prompt twice — second should hit prefix cache">Send ×2 (prefix demo)</button>
28
+
29
+ <span class="replay-controls">
30
+ <select id="speed" style="display:none" title="Replay speed">
31
+ <option value="0.5">0.5×</option>
32
+ <option value="1" selected>1×</option>
33
+ <option value="2">2×</option>
34
+ <option value="4">4×</option>
35
+ <option value="8">8×</option>
36
+ </select>
37
+ <button id="play-pause" style="display:none" class="ghost">Pause</button>
38
+ <button id="restart" style="display:none" class="ghost">Restart</button>
39
+ </span>
40
  </div>
41
  </section>
42
 
web/style.css CHANGED
@@ -40,6 +40,17 @@ header h1 { font-size: 16px; margin: 0; font-weight: 600; }
40
  }
41
  .badge.online { background: rgba(63, 185, 80, 0.15); color: var(--green); }
42
  .badge.offline { background: rgba(248, 81, 73, 0.15); color: var(--red); }
 
 
 
 
 
 
 
 
 
 
 
43
 
44
  .prompt-box {
45
  padding: 12px 20px;
@@ -70,8 +81,22 @@ button {
70
  padding: 6px 14px; font-weight: 500; cursor: pointer;
71
  }
72
  button:hover { filter: brightness(1.1); }
 
 
 
 
 
 
73
  #send-twice { background: var(--purple); }
74
 
 
 
 
 
 
 
 
 
75
  main {
76
  display: grid;
77
  grid-template-columns: 1fr 1fr;
 
40
  }
41
  .badge.online { background: rgba(63, 185, 80, 0.15); color: var(--green); }
42
  .badge.offline { background: rgba(248, 81, 73, 0.15); color: var(--red); }
43
+ .badge.replay { background: rgba(163, 113, 247, 0.15); color: var(--purple); }
44
+
45
+ .banner {
46
+ padding: 8px 20px;
47
+ font-size: 12px;
48
+ background: rgba(163, 113, 247, 0.12);
49
+ color: var(--purple);
50
+ border-bottom: 1px solid rgba(163, 113, 247, 0.3);
51
+ }
52
+ .banner.error { background: rgba(248, 81, 73, 0.12); color: var(--red); border-bottom-color: rgba(248, 81, 73, 0.3); }
53
+ .banner.replay-banner.error { background: rgba(248, 81, 73, 0.12); color: var(--red); }
54
 
55
  .prompt-box {
56
  padding: 12px 20px;
 
81
  padding: 6px 14px; font-weight: 500; cursor: pointer;
82
  }
83
  button:hover { filter: brightness(1.1); }
84
+ button:disabled { opacity: 0.4; cursor: not-allowed; }
85
+ button.ghost {
86
+ background: transparent;
87
+ border: 1px solid var(--border);
88
+ color: var(--fg);
89
+ }
90
  #send-twice { background: var(--purple); }
91
 
92
+ textarea:disabled { opacity: 0.5; }
93
+
94
+ .replay-controls { margin-left: auto; display: flex; gap: 6px; align-items: center; }
95
+ .replay-controls select {
96
+ background: var(--bg); color: var(--fg); border: 1px solid var(--border);
97
+ border-radius: 4px; padding: 4px 6px; font-family: var(--mono);
98
+ }
99
+
100
  main {
101
  display: grid;
102
  grid-template-columns: 1fr 1fr;