denisko Claude Opus 4.6 commited on
Commit
3f2bdcd
·
1 Parent(s): 5ef0391

Fix tilde rendering in vLLM section (avoid markdown strikethrough)

Browse files
Files changed (1) hide show
  1. README.md +5 -5
README.md CHANGED
@@ -156,8 +156,8 @@ Recommended settings: temperature 0.6; increase `max_new_tokens` for complex rea
156
 
157
  Preset selection and throughput-optimized serving require the vLLM plugin from [Fast-LLM](https://github.com/ServiceNow/Fast-LLM). Two serving modes are available:
158
 
159
- - **Single-preset mode**: Only the weights for the selected mixer placement are loaded (~27 GiB for bf16). Unused mixer weights are never loaded, so the model fits comfortably on a single GPU with room for KV cache.
160
- - **Supernet mode**: All four mixer weights are loaded per layer (~46 GiB for bf16), enabling instant placement switching at runtime via `collective_rpc` (~3–20 ms per switch, no engine reload).
161
 
162
  ### Installation
163
 
@@ -254,9 +254,9 @@ llm.collective_rpc("set_layer_placements", args=(pattern,))
254
 
255
  | Setup | GPU Memory | KV Cache | Notes |
256
  |-------|-----------|----------|-------|
257
- | Single-preset, 1 GPU | ~27 GiB | ~39 GiB | Best for fixed deployments |
258
- | Supernet, 1 GPU (`enforce_eager`) | ~46 GiB | ~20 GiB | Runtime switching, lower KV capacity |
259
- | Supernet, 2 GPU (TP=2) | ~23 GiB/GPU | ~50 GiB/GPU | Full compile + CUDA graphs |
260
 
261
  ### Per-Request Preset Selection
262
 
 
156
 
157
  Preset selection and throughput-optimized serving require the vLLM plugin from [Fast-LLM](https://github.com/ServiceNow/Fast-LLM). Two serving modes are available:
158
 
159
+ - **Single-preset mode**: Only the weights for the selected mixer placement are loaded (approx. 27 GiB in bf16). Unused mixer weights are never loaded, so the model fits comfortably on a single GPU with room for KV cache.
160
+ - **Supernet mode**: All four mixer weights are loaded per layer (approx. 46 GiB in bf16), enabling instant placement switching at runtime via `collective_rpc` (3–20 ms per switch, no engine reload).
161
 
162
  ### Installation
163
 
 
254
 
255
  | Setup | GPU Memory | KV Cache | Notes |
256
  |-------|-----------|----------|-------|
257
+ | Single-preset, 1 GPU | 27 GiB | 39 GiB | Best for fixed deployments |
258
+ | Supernet, 1 GPU (`enforce_eager`) | 46 GiB | 20 GiB | Runtime switching, lower KV capacity |
259
+ | Supernet, 2 GPU (TP=2) | 23 GiB/GPU | 50 GiB/GPU | Full compile + CUDA graphs |
260
 
261
  ### Per-Request Preset Selection
262