IQuest-Coder-V1-40B-Loop-Instruct GGUF
This repository contains GGUF format models for IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct, optimized for use with llama.cpp.
๐จ IMPORTANT: This model requires a custom llama.cpp build with loop attention support! See PR: llama.cpp#18680
Built and tested on NVIDIA DGX Spark infrastructure.
Model Architecture
This model implements Loop Attention, a novel recurrent attention mechanism that processes all layers multiple times:
- loop_num=2: All 80 transformer layers are processed twice (160 total operations)
- Loop 0: Standard attention with global K/V caching
- Loop 1: Dual attention (local + global) with learned per-head gating
Loop Attention Formula
gate = sigmoid(sum(Q * gate_weight) + gate_bias)
output = local_attn + gate * (global_attn - local_attn)
llama.cpp Support
IMPORTANT: Loop attention support requires a custom branch of llama.cpp.
See PR: https://github.com/ggml-org/llama.cpp/pull/18680
Quick Start
# Clone llama.cpp with loop attention support
git clone https://github.com/tbraun96/llama.cpp
cd llama.cpp
git checkout feature/iquest-loop-attention
# Build
mkdir build && cd build
cmake .. -DGGML_CUDA=ON
cmake --build . --config Release -j$(nproc)
# Download a quantized model
huggingface-cli download Avarok/IQuest-Coder-V1-40B-Loop-Instruct-GGUF IQuest-Coder-V1-40B-Loop-Instruct-Q4_K_M.gguf --local-dir .
# Run inference
./bin/llama-cli -m IQuest-Coder-V1-40B-Loop-Instruct-Q4_K_M.gguf -p "def fibonacci(n):" -n 200
Available Models
| Filename | Quantization | Size | Description | Use Case |
|---|---|---|---|---|
| IQuest-Coder-V1-40B-Loop-Instruct-F16.gguf | F16 | 75GB | Unquantized, highest quality | Maximum accuracy |
| IQuest-Coder-V1-40B-Loop-Instruct-Q8_0.gguf | Q8_0 | 40GB | Very high quality | Near-F16 quality |
| IQuest-Coder-V1-40B-Loop-Instruct-Q5_K_M.gguf | Q5_K_M | 27GB | High quality | Balanced quality/size |
| IQuest-Coder-V1-40B-Loop-Instruct-Q4_K_M.gguf | Q4_K_M | 23GB | Good quality | Recommended |
Performance Benchmarks
Testing Platform: NVIDIA DGX Spark with GB10 (Blackwell) GPU, compute capability 12.1
Q4_K_M (23GB) - Recommended:
- Prompt processing: 106.2 tokens/second
- Text generation: 4.2 tokens/second
F16 (75GB) - Maximum quality:
- Prompt processing: 3.4 tokens/second
- Text generation: 0.8 tokens/second
All testing and quantization was performed on NVIDIA DGX Spark infrastructure.
Model Details
- Base Model: Llama architecture with loop attention extension
- Parameters: 40B
- Context Length: 32,768 tokens
- Training: Fine-tuned for code generation and instruction following
- License: Apache 2.0
Citation
If you use this model, please cite:
@software{iquest_loop_instruct_gguf_2025,
title={IQuest-Coder-V1-40B-Loop-Instruct GGUF},
author={IQuestLab and Community Contributors},
year={2025},
url={https://huggingface.co/Avarok/IQuest-Coder-V1-40B-Loop-Instruct-GGUF}
}
Original Model
Original PyTorch model: IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct
Conversion
These models were converted using the custom GGUF converter available in the llama.cpp branch above.
python convert_hf_to_gguf.py /path/to/IQuest-Coder-V1-40B-Loop-Instruct --outtype f16
World's First
This is the world's first implementation of loop attention in GGUF format, bringing recurrent attention mechanisms to llama.cpp!
Questions or Issues? Please open an issue on the llama.cpp PR or the original model repository.