INT8 quantization for KVCache on DGX Spark/GB10
#6
by
JDWarner
- opened
Per the model card:
"On NVIDIA DGX Spark, the Step 3.5 Flash achieves a generation speed of 20 tokens per second; by integrating the INT8 quantization technology for KVCache, it supports an extended context window of up to 256K tokens, thus delivering long text processing capabilities on par with cloud-based inference."
Unless I missed something (possible!), this does not seem to have instructions or breadcrumbs in the model card. The provided start-up command for this GGUF on the Spark seems to limit context to 16k. Could you please provide some guidance on how to use INT8 KVCache with this Int4 GGUF on the DGX Spark? Thanks!