--- license: apache-2.0 tags: - code - llama - loop-attention - gguf - llama.cpp language: - en pipeline_tag: text-generation --- # IQuest-Coder-V1-40B-Loop-Instruct GGUF This repository contains GGUF format models for [IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct](https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct), optimized for use with llama.cpp. **🚨 IMPORTANT**: This model requires a custom llama.cpp build with loop attention support! **See PR**: [llama.cpp#18680](https://github.com/ggml-org/llama.cpp/pull/18680) Built and tested on **NVIDIA DGX Spark** infrastructure. ## Model Architecture This model implements **Loop Attention**, a novel recurrent attention mechanism that processes all layers multiple times: - **loop_num=2**: All 80 transformer layers are processed twice (160 total operations) - **Loop 0**: Standard attention with global K/V caching - **Loop 1**: Dual attention (local + global) with learned per-head gating ### Loop Attention Formula ``` gate = sigmoid(sum(Q * gate_weight) + gate_bias) output = local_attn + gate * (global_attn - local_attn) ``` ## llama.cpp Support **IMPORTANT**: Loop attention support requires a custom branch of llama.cpp. See PR: https://github.com/ggml-org/llama.cpp/pull/18680 ### Quick Start ```bash # Clone llama.cpp with loop attention support git clone https://github.com/tbraun96/llama.cpp cd llama.cpp git checkout feature/iquest-loop-attention # Build mkdir build && cd build cmake .. -DGGML_CUDA=ON cmake --build . --config Release -j$(nproc) # Download a quantized model huggingface-cli download Avarok/IQuest-Coder-V1-40B-Loop-Instruct-GGUF IQuest-Coder-V1-40B-Loop-Instruct-Q4_K_M.gguf --local-dir . # Run inference ./bin/llama-cli -m IQuest-Coder-V1-40B-Loop-Instruct-Q4_K_M.gguf -p "def fibonacci(n):" -n 200 ``` ## Available Models | Filename | Quantization | Size | Description | Use Case | |----------|-------------|------|-------------|----------| | IQuest-Coder-V1-40B-Loop-Instruct-F16.gguf | F16 | 75GB | Unquantized, highest quality | Maximum accuracy | | IQuest-Coder-V1-40B-Loop-Instruct-Q8_0.gguf | Q8_0 | 40GB | Very high quality | Near-F16 quality | | IQuest-Coder-V1-40B-Loop-Instruct-Q5_K_M.gguf | Q5_K_M | 27GB | High quality | Balanced quality/size | | IQuest-Coder-V1-40B-Loop-Instruct-Q4_K_M.gguf | Q4_K_M | 23GB | Good quality | **Recommended** | ## Performance Benchmarks **Testing Platform**: NVIDIA DGX Spark with GB10 (Blackwell) GPU, compute capability 12.1 **Q4_K_M (23GB)** - Recommended: - Prompt processing: 106.2 tokens/second - Text generation: 4.2 tokens/second **F16 (75GB)** - Maximum quality: - Prompt processing: 3.4 tokens/second - Text generation: 0.8 tokens/second All testing and quantization was performed on NVIDIA DGX Spark infrastructure. ## Model Details - **Base Model**: Llama architecture with loop attention extension - **Parameters**: 40B - **Context Length**: 32,768 tokens - **Training**: Fine-tuned for code generation and instruction following - **License**: Apache 2.0 ## Citation If you use this model, please cite: ```bibtex @software{iquest_loop_instruct_gguf_2025, title={IQuest-Coder-V1-40B-Loop-Instruct GGUF}, author={IQuestLab and Community Contributors}, year={2025}, url={https://huggingface.co/Avarok/IQuest-Coder-V1-40B-Loop-Instruct-GGUF} } ``` ## Original Model Original PyTorch model: [IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct](https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct) ## Conversion These models were converted using the custom GGUF converter available in the llama.cpp branch above. ```bash python convert_hf_to_gguf.py /path/to/IQuest-Coder-V1-40B-Loop-Instruct --outtype f16 ``` ## World's First This is the **world's first implementation** of loop attention in GGUF format, bringing recurrent attention mechanisms to llama.cpp! --- **Questions or Issues?** Please open an issue on the [llama.cpp PR](https://github.com/ggml-org/llama.cpp/pull/18680) or the original model repository.