joerowell commited on
Commit
b17bd8d
·
verified ·
1 Parent(s): 8655ff1

Drop VLLM_USE_DEEP_GEMM=0 from vllm serve recipe (DeepGEMM is supported on Hopper and datacenter Blackwell)

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -102,7 +102,7 @@ The full vLLM recipe is on the main [Laguna XS.2 model card](https://huggingface
102
  > Please note that, during testing, we discovered that models with FP8-quantised KV caches can produce scrambled output when deployed on non-Hopper GPUs. We are actively investigating this issue with the vLLM team, but in the meantime, you can circumvent this issue by explicitly disabling FP8 KV cache (Laguna XS.2 has 40 layers, so list every layer in `--kv-cache-dtype-skip-layers`):
103
  >
104
  > ```shell
105
- > VLLM_USE_DEEP_GEMM=0 vllm serve \
106
  > --model poolside/Laguna-XS.2-FP8 \
107
  > --tool-call-parser poolside_v1 \
108
  > --reasoning-parser poolside_v1 \
 
102
  > Please note that, during testing, we discovered that models with FP8-quantised KV caches can produce scrambled output when deployed on non-Hopper GPUs. We are actively investigating this issue with the vLLM team, but in the meantime, you can circumvent this issue by explicitly disabling FP8 KV cache (Laguna XS.2 has 40 layers, so list every layer in `--kv-cache-dtype-skip-layers`):
103
  >
104
  > ```shell
105
+ > vllm serve \
106
  > --model poolside/Laguna-XS.2-FP8 \
107
  > --tool-call-parser poolside_v1 \
108
  > --reasoning-parser poolside_v1 \