nologik commited on
Commit
321bfae
·
verified ·
1 Parent(s): f1e11e3

Update README with DGX Spark testing info and prominent PR link

Browse files
Files changed (1) hide show
  1. README.md +10 -3
README.md CHANGED
@@ -16,6 +16,11 @@ pipeline_tag: text-generation
16
  This repository contains GGUF format models for [IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct](https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct),
17
  optimized for use with llama.cpp.
18
 
 
 
 
 
 
19
  ## Model Architecture
20
 
21
  This model implements **Loop Attention**, a novel recurrent attention mechanism that processes all layers multiple times:
@@ -68,16 +73,18 @@ huggingface-cli download Avarok/IQuest-Coder-V1-40B-Loop-Instruct-GGUF IQuest-Co
68
 
69
  ## Performance Benchmarks
70
 
71
- Tested on NVIDIA GB10 (Blackwell), compute 12.1:
72
 
73
- **Q4_K_M (23GB)**:
74
  - Prompt processing: 106.2 tokens/second
75
  - Text generation: 4.2 tokens/second
76
 
77
- **F16 (75GB)**:
78
  - Prompt processing: 3.4 tokens/second
79
  - Text generation: 0.8 tokens/second
80
 
 
 
81
  ## Model Details
82
 
83
  - **Base Model**: Llama architecture with loop attention extension
 
16
  This repository contains GGUF format models for [IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct](https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct),
17
  optimized for use with llama.cpp.
18
 
19
+ **🚨 IMPORTANT**: This model requires a custom llama.cpp build with loop attention support!
20
+ **See PR**: [llama.cpp#18680](https://github.com/ggml-org/llama.cpp/pull/18680)
21
+
22
+ Built and tested on **NVIDIA DGX Spark** infrastructure.
23
+
24
  ## Model Architecture
25
 
26
  This model implements **Loop Attention**, a novel recurrent attention mechanism that processes all layers multiple times:
 
73
 
74
  ## Performance Benchmarks
75
 
76
+ **Testing Platform**: NVIDIA DGX Spark with GB10 (Blackwell) GPU, compute capability 12.1
77
 
78
+ **Q4_K_M (23GB)** - Recommended:
79
  - Prompt processing: 106.2 tokens/second
80
  - Text generation: 4.2 tokens/second
81
 
82
+ **F16 (75GB)** - Maximum quality:
83
  - Prompt processing: 3.4 tokens/second
84
  - Text generation: 0.8 tokens/second
85
 
86
+ All testing and quantization was performed on NVIDIA DGX Spark infrastructure.
87
+
88
  ## Model Details
89
 
90
  - **Base Model**: Llama architecture with loop attention extension