Avarok
/

IQuest-Coder-V1-40B-Loop-Instruct-GGUF

@@ -16,6 +16,11 @@ pipeline_tag: text-generation
 This repository contains GGUF format models for [IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct](https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct),
 optimized for use with llama.cpp.
 ## Model Architecture
 This model implements **Loop Attention**, a novel recurrent attention mechanism that processes all layers multiple times:
@@ -68,16 +73,18 @@ huggingface-cli download Avarok/IQuest-Coder-V1-40B-Loop-Instruct-GGUF IQuest-Co
 ## Performance Benchmarks
-Tested on NVIDIA GB10 (Blackwell), compute 12.1:
-**Q4_K_M (23GB)**:
 - Prompt processing: 106.2 tokens/second
 - Text generation: 4.2 tokens/second
-**F16 (75GB)**:
 - Prompt processing: 3.4 tokens/second
 - Text generation: 0.8 tokens/second
 ## Model Details
 - **Base Model**: Llama architecture with loop attention extension

 This repository contains GGUF format models for [IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct](https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct),
 optimized for use with llama.cpp.
+**🚨 IMPORTANT**: This model requires a custom llama.cpp build with loop attention support!
+**See PR**: [llama.cpp#18680](https://github.com/ggml-org/llama.cpp/pull/18680)
+Built and tested on **NVIDIA DGX Spark** infrastructure.
 ## Model Architecture
 This model implements **Loop Attention**, a novel recurrent attention mechanism that processes all layers multiple times:
 ## Performance Benchmarks
+**Testing Platform**: NVIDIA DGX Spark with GB10 (Blackwell) GPU, compute capability 12.1
+**Q4_K_M (23GB)** - Recommended:
 - Prompt processing: 106.2 tokens/second
 - Text generation: 4.2 tokens/second
+**F16 (75GB)** - Maximum quality:
 - Prompt processing: 3.4 tokens/second
 - Text generation: 0.8 tokens/second
+All testing and quantization was performed on NVIDIA DGX Spark infrastructure.
 ## Model Details
 - **Base Model**: Llama architecture with loop attention extension