INT8 quantization for KVCache on DGX Spark/GB10

by JDWarner - opened about 7 hours ago

about 7 hours ago

Per the model card:
"On NVIDIA DGX Spark, the Step 3.5 Flash achieves a generation speed of 20 tokens per second; by integrating the INT8 quantization technology for KVCache, it supports an extended context window of up to 256K tokens, thus delivering long text processing capabilities on par with cloud-based inference."

Unless I missed something (possible!), this does not seem to have instructions or breadcrumbs in the model card. The provided start-up command for this GGUF on the Spark seems to limit context to 16k. Could you please provide some guidance on how to use INT8 KVCache with this Int4 GGUF on the DGX Spark? Thanks!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment