Update README with DGX Spark testing info and prominent PR link
Browse files
README.md
CHANGED
|
@@ -16,6 +16,11 @@ pipeline_tag: text-generation
|
|
| 16 |
This repository contains GGUF format models for [IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct](https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct),
|
| 17 |
optimized for use with llama.cpp.
|
| 18 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
## Model Architecture
|
| 20 |
|
| 21 |
This model implements **Loop Attention**, a novel recurrent attention mechanism that processes all layers multiple times:
|
|
@@ -68,16 +73,18 @@ huggingface-cli download Avarok/IQuest-Coder-V1-40B-Loop-Instruct-GGUF IQuest-Co
|
|
| 68 |
|
| 69 |
## Performance Benchmarks
|
| 70 |
|
| 71 |
-
|
| 72 |
|
| 73 |
-
**Q4_K_M (23GB)
|
| 74 |
- Prompt processing: 106.2 tokens/second
|
| 75 |
- Text generation: 4.2 tokens/second
|
| 76 |
|
| 77 |
-
**F16 (75GB)
|
| 78 |
- Prompt processing: 3.4 tokens/second
|
| 79 |
- Text generation: 0.8 tokens/second
|
| 80 |
|
|
|
|
|
|
|
| 81 |
## Model Details
|
| 82 |
|
| 83 |
- **Base Model**: Llama architecture with loop attention extension
|
|
|
|
| 16 |
This repository contains GGUF format models for [IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct](https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct),
|
| 17 |
optimized for use with llama.cpp.
|
| 18 |
|
| 19 |
+
**🚨 IMPORTANT**: This model requires a custom llama.cpp build with loop attention support!
|
| 20 |
+
**See PR**: [llama.cpp#18680](https://github.com/ggml-org/llama.cpp/pull/18680)
|
| 21 |
+
|
| 22 |
+
Built and tested on **NVIDIA DGX Spark** infrastructure.
|
| 23 |
+
|
| 24 |
## Model Architecture
|
| 25 |
|
| 26 |
This model implements **Loop Attention**, a novel recurrent attention mechanism that processes all layers multiple times:
|
|
|
|
| 73 |
|
| 74 |
## Performance Benchmarks
|
| 75 |
|
| 76 |
+
**Testing Platform**: NVIDIA DGX Spark with GB10 (Blackwell) GPU, compute capability 12.1
|
| 77 |
|
| 78 |
+
**Q4_K_M (23GB)** - Recommended:
|
| 79 |
- Prompt processing: 106.2 tokens/second
|
| 80 |
- Text generation: 4.2 tokens/second
|
| 81 |
|
| 82 |
+
**F16 (75GB)** - Maximum quality:
|
| 83 |
- Prompt processing: 3.4 tokens/second
|
| 84 |
- Text generation: 0.8 tokens/second
|
| 85 |
|
| 86 |
+
All testing and quantization was performed on NVIDIA DGX Spark infrastructure.
|
| 87 |
+
|
| 88 |
## Model Details
|
| 89 |
|
| 90 |
- **Base Model**: Llama architecture with loop attention extension
|