nologik's picture
Update README with DGX Spark testing info and prominent PR link
321bfae verified
---
license: apache-2.0
tags:
- code
- llama
- loop-attention
- gguf
- llama.cpp
language:
- en
pipeline_tag: text-generation
---
# IQuest-Coder-V1-40B-Loop-Instruct GGUF
This repository contains GGUF format models for [IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct](https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct),
optimized for use with llama.cpp.
**🚨 IMPORTANT**: This model requires a custom llama.cpp build with loop attention support!
**See PR**: [llama.cpp#18680](https://github.com/ggml-org/llama.cpp/pull/18680)
Built and tested on **NVIDIA DGX Spark** infrastructure.
## Model Architecture
This model implements **Loop Attention**, a novel recurrent attention mechanism that processes all layers multiple times:
- **loop_num=2**: All 80 transformer layers are processed twice (160 total operations)
- **Loop 0**: Standard attention with global K/V caching
- **Loop 1**: Dual attention (local + global) with learned per-head gating
### Loop Attention Formula
```
gate = sigmoid(sum(Q * gate_weight) + gate_bias)
output = local_attn + gate * (global_attn - local_attn)
```
## llama.cpp Support
**IMPORTANT**: Loop attention support requires a custom branch of llama.cpp.
See PR: https://github.com/ggml-org/llama.cpp/pull/18680
### Quick Start
```bash
# Clone llama.cpp with loop attention support
git clone https://github.com/tbraun96/llama.cpp
cd llama.cpp
git checkout feature/iquest-loop-attention
# Build
mkdir build && cd build
cmake .. -DGGML_CUDA=ON
cmake --build . --config Release -j$(nproc)
# Download a quantized model
huggingface-cli download Avarok/IQuest-Coder-V1-40B-Loop-Instruct-GGUF IQuest-Coder-V1-40B-Loop-Instruct-Q4_K_M.gguf --local-dir .
# Run inference
./bin/llama-cli -m IQuest-Coder-V1-40B-Loop-Instruct-Q4_K_M.gguf -p "def fibonacci(n):" -n 200
```
## Available Models
| Filename | Quantization | Size | Description | Use Case |
|----------|-------------|------|-------------|----------|
| IQuest-Coder-V1-40B-Loop-Instruct-F16.gguf | F16 | 75GB | Unquantized, highest quality | Maximum accuracy |
| IQuest-Coder-V1-40B-Loop-Instruct-Q8_0.gguf | Q8_0 | 40GB | Very high quality | Near-F16 quality |
| IQuest-Coder-V1-40B-Loop-Instruct-Q5_K_M.gguf | Q5_K_M | 27GB | High quality | Balanced quality/size |
| IQuest-Coder-V1-40B-Loop-Instruct-Q4_K_M.gguf | Q4_K_M | 23GB | Good quality | **Recommended** |
## Performance Benchmarks
**Testing Platform**: NVIDIA DGX Spark with GB10 (Blackwell) GPU, compute capability 12.1
**Q4_K_M (23GB)** - Recommended:
- Prompt processing: 106.2 tokens/second
- Text generation: 4.2 tokens/second
**F16 (75GB)** - Maximum quality:
- Prompt processing: 3.4 tokens/second
- Text generation: 0.8 tokens/second
All testing and quantization was performed on NVIDIA DGX Spark infrastructure.
## Model Details
- **Base Model**: Llama architecture with loop attention extension
- **Parameters**: 40B
- **Context Length**: 32,768 tokens
- **Training**: Fine-tuned for code generation and instruction following
- **License**: Apache 2.0
## Citation
If you use this model, please cite:
```bibtex
@software{iquest_loop_instruct_gguf_2025,
title={IQuest-Coder-V1-40B-Loop-Instruct GGUF},
author={IQuestLab and Community Contributors},
year={2025},
url={https://huggingface.co/Avarok/IQuest-Coder-V1-40B-Loop-Instruct-GGUF}
}
```
## Original Model
Original PyTorch model: [IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct](https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct)
## Conversion
These models were converted using the custom GGUF converter available in the llama.cpp branch above.
```bash
python convert_hf_to_gguf.py /path/to/IQuest-Coder-V1-40B-Loop-Instruct --outtype f16
```
## World's First
This is the **world's first implementation** of loop attention in GGUF format, bringing recurrent attention mechanisms to llama.cpp!
---
**Questions or Issues?** Please open an issue on the [llama.cpp PR](https://github.com/ggml-org/llama.cpp/pull/18680) or the original model repository.