Drop VLLM_USE_DEEP_GEMM=0 from vllm serve recipe (DeepGEMM is supported on Hopper and datacenter Blackwell)
Browse files
README.md
CHANGED
|
@@ -102,7 +102,7 @@ The full vLLM recipe is on the main [Laguna XS.2 model card](https://huggingface
|
|
| 102 |
> Please note that, during testing, we discovered that models with FP8-quantised KV caches can produce scrambled output when deployed on non-Hopper GPUs. We are actively investigating this issue with the vLLM team, but in the meantime, you can circumvent this issue by explicitly disabling FP8 KV cache (Laguna XS.2 has 40 layers, so list every layer in `--kv-cache-dtype-skip-layers`):
|
| 103 |
>
|
| 104 |
> ```shell
|
| 105 |
-
>
|
| 106 |
> --model poolside/Laguna-XS.2-FP8 \
|
| 107 |
> --tool-call-parser poolside_v1 \
|
| 108 |
> --reasoning-parser poolside_v1 \
|
|
|
|
| 102 |
> Please note that, during testing, we discovered that models with FP8-quantised KV caches can produce scrambled output when deployed on non-Hopper GPUs. We are actively investigating this issue with the vLLM team, but in the meantime, you can circumvent this issue by explicitly disabling FP8 KV cache (Laguna XS.2 has 40 layers, so list every layer in `--kv-cache-dtype-skip-layers`):
|
| 103 |
>
|
| 104 |
> ```shell
|
| 105 |
+
> vllm serve \
|
| 106 |
> --model poolside/Laguna-XS.2-FP8 \
|
| 107 |
> --tool-call-parser poolside_v1 \
|
| 108 |
> --reasoning-parser poolside_v1 \
|