denisko Claude Opus 4.6 commited on
Commit
da9a2d8
·
1 Parent(s): 9c39f06

Add note about vLLM pure-recurrent placement limitation

Browse files

Custom placements using only GDN/KDA mixers (no FA or SWA) are not
supported due to a vLLM KV cache coordinator constraint. All shipped
presets include attention-type layers and are unaffected.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Files changed (1) hide show
  1. README.md +3 -0
README.md CHANGED
@@ -166,6 +166,8 @@ print(output[0].outputs[0].text)
166
 
167
  Available presets: `all-attention`, `Reg_Lklhd-26`, `Reg_Lklhd-18`, `Reg_Lklhd-13`, `Reg_Lklhd-10`, `Idealized_All-18`, `Idealized_All-6`, `Idealized_Lklhd-6`, `extra_bayesian-mix-7`.
168
 
 
 
169
  ### Supernet Mode (Runtime Switching)
170
 
171
  Load the full supernet and switch presets at runtime without reloading the engine. Use `enforce_eager=True` on a single GPU (saves memory by skipping CUDA graph capture):
@@ -376,6 +378,7 @@ It is **not intended** for use in safety-critical applications without human ove
376
  - **Language:** Strongest performance is in English. Output quality may degrade in underrepresented languages.
377
  - **Critical use:** Not suitable for medical, legal, financial, or other high-risk applications without safeguards.
378
  - **Long-context degradation:** Aggressive placements (fewer FA layers) may show reduced performance on long-range tasks like RULER/NIAH.
 
379
 
380
  ## Security and Responsible Use
381
 
 
166
 
167
  Available presets: `all-attention`, `Reg_Lklhd-26`, `Reg_Lklhd-18`, `Reg_Lklhd-13`, `Reg_Lklhd-10`, `Idealized_All-18`, `Idealized_All-6`, `Idealized_Lklhd-6`, `extra_bayesian-mix-7`.
168
 
169
+ > **Note:** Custom placements must include at least one attention-type layer (FA or SWA). Configurations using only recurrent mixers (GDN/KDA) are not currently supported due to a vLLM KV cache coordinator limitation. All shipped presets satisfy this requirement.
170
+
171
  ### Supernet Mode (Runtime Switching)
172
 
173
  Load the full supernet and switch presets at runtime without reloading the engine. Use `enforce_eager=True` on a single GPU (saves memory by skipping CUDA graph capture):
 
378
  - **Language:** Strongest performance is in English. Output quality may degrade in underrepresented languages.
379
  - **Critical use:** Not suitable for medical, legal, financial, or other high-risk applications without safeguards.
380
  - **Long-context degradation:** Aggressive placements (fewer FA layers) may show reduced performance on long-range tasks like RULER/NIAH.
381
+ - **vLLM pure-recurrent limitation:** Custom placements using only GDN/KDA mixers (no FA or SWA layers) are not supported with vLLM due to a KV cache coordinator constraint. All shipped presets include attention-type layers.
382
 
383
  ## Security and Responsible Use
384