File size: 4,039 Bytes
e9c9d33 f1e11e3 e9c9d33 f1e11e3 e9c9d33 f1e11e3 e9c9d33 f1e11e3 e9c9d33 f1e11e3 e9c9d33 321bfae f1e11e3 e9c9d33 f1e11e3 e9c9d33 f1e11e3 e9c9d33 f1e11e3 e9c9d33 f1e11e3 e9c9d33 f1e11e3 e9c9d33 f1e11e3 e9c9d33 f1e11e3 e9c9d33 f1e11e3 e9c9d33 f1e11e3 e9c9d33 f1e11e3 e9c9d33 f1e11e3 e9c9d33 f1e11e3 e9c9d33 f1e11e3 e9c9d33 f1e11e3 e9c9d33 f1e11e3 e9c9d33 321bfae e9c9d33 321bfae f1e11e3 e9c9d33 321bfae f1e11e3 e9c9d33 321bfae f1e11e3 e9c9d33 f1e11e3 e9c9d33 f1e11e3 e9c9d33 f1e11e3 e9c9d33 f1e11e3 e9c9d33 f1e11e3 e9c9d33 f1e11e3 e9c9d33 f1e11e3 e9c9d33 f1e11e3 e9c9d33 f1e11e3 e9c9d33 f1e11e3 e9c9d33 f1e11e3 e9c9d33 f1e11e3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 |
---
license: apache-2.0
tags:
- code
- llama
- loop-attention
- gguf
- llama.cpp
language:
- en
pipeline_tag: text-generation
---
# IQuest-Coder-V1-40B-Loop-Instruct GGUF
This repository contains GGUF format models for [IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct](https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct),
optimized for use with llama.cpp.
**🚨 IMPORTANT**: This model requires a custom llama.cpp build with loop attention support!
**See PR**: [llama.cpp#18680](https://github.com/ggml-org/llama.cpp/pull/18680)
Built and tested on **NVIDIA DGX Spark** infrastructure.
## Model Architecture
This model implements **Loop Attention**, a novel recurrent attention mechanism that processes all layers multiple times:
- **loop_num=2**: All 80 transformer layers are processed twice (160 total operations)
- **Loop 0**: Standard attention with global K/V caching
- **Loop 1**: Dual attention (local + global) with learned per-head gating
### Loop Attention Formula
```
gate = sigmoid(sum(Q * gate_weight) + gate_bias)
output = local_attn + gate * (global_attn - local_attn)
```
## llama.cpp Support
**IMPORTANT**: Loop attention support requires a custom branch of llama.cpp.
See PR: https://github.com/ggml-org/llama.cpp/pull/18680
### Quick Start
```bash
# Clone llama.cpp with loop attention support
git clone https://github.com/tbraun96/llama.cpp
cd llama.cpp
git checkout feature/iquest-loop-attention
# Build
mkdir build && cd build
cmake .. -DGGML_CUDA=ON
cmake --build . --config Release -j$(nproc)
# Download a quantized model
huggingface-cli download Avarok/IQuest-Coder-V1-40B-Loop-Instruct-GGUF IQuest-Coder-V1-40B-Loop-Instruct-Q4_K_M.gguf --local-dir .
# Run inference
./bin/llama-cli -m IQuest-Coder-V1-40B-Loop-Instruct-Q4_K_M.gguf -p "def fibonacci(n):" -n 200
```
## Available Models
| Filename | Quantization | Size | Description | Use Case |
|----------|-------------|------|-------------|----------|
| IQuest-Coder-V1-40B-Loop-Instruct-F16.gguf | F16 | 75GB | Unquantized, highest quality | Maximum accuracy |
| IQuest-Coder-V1-40B-Loop-Instruct-Q8_0.gguf | Q8_0 | 40GB | Very high quality | Near-F16 quality |
| IQuest-Coder-V1-40B-Loop-Instruct-Q5_K_M.gguf | Q5_K_M | 27GB | High quality | Balanced quality/size |
| IQuest-Coder-V1-40B-Loop-Instruct-Q4_K_M.gguf | Q4_K_M | 23GB | Good quality | **Recommended** |
## Performance Benchmarks
**Testing Platform**: NVIDIA DGX Spark with GB10 (Blackwell) GPU, compute capability 12.1
**Q4_K_M (23GB)** - Recommended:
- Prompt processing: 106.2 tokens/second
- Text generation: 4.2 tokens/second
**F16 (75GB)** - Maximum quality:
- Prompt processing: 3.4 tokens/second
- Text generation: 0.8 tokens/second
All testing and quantization was performed on NVIDIA DGX Spark infrastructure.
## Model Details
- **Base Model**: Llama architecture with loop attention extension
- **Parameters**: 40B
- **Context Length**: 32,768 tokens
- **Training**: Fine-tuned for code generation and instruction following
- **License**: Apache 2.0
## Citation
If you use this model, please cite:
```bibtex
@software{iquest_loop_instruct_gguf_2025,
title={IQuest-Coder-V1-40B-Loop-Instruct GGUF},
author={IQuestLab and Community Contributors},
year={2025},
url={https://huggingface.co/Avarok/IQuest-Coder-V1-40B-Loop-Instruct-GGUF}
}
```
## Original Model
Original PyTorch model: [IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct](https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct)
## Conversion
These models were converted using the custom GGUF converter available in the llama.cpp branch above.
```bash
python convert_hf_to_gguf.py /path/to/IQuest-Coder-V1-40B-Loop-Instruct --outtype f16
```
## World's First
This is the **world's first implementation** of loop attention in GGUF format, bringing recurrent attention mechanisms to llama.cpp!
---
**Questions or Issues?** Please open an issue on the [llama.cpp PR](https://github.com/ggml-org/llama.cpp/pull/18680) or the original model repository.
|