Avarok
/

IQuest-Coder-V1-40B-Loop-Instruct-GGUF

+---
+license: other
+license_name: iquestcoder
+license_link: https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct
+base_model: IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct
+tags:
+  - gguf
+  - quantized
+  - loop-attention
+  - recurrent-transformer
+  - code-generation
+  - iquest
+language:
+  - en
+pipeline_tag: text-generation
+---
+# IQuest-Coder-V1-40B-Loop-Instruct - GGUF
+**World's first GGUF conversion** of IQuestLab's IQuest-Coder-V1-40B-Loop-Instruct model with recurrent loop attention mechanism.
+## Model Details
+- **Base Model**: [IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct](https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct)
+- **Architecture**: Llama with Loop Attention (recurrent transformer, 2 iterations)
+- **Parameters**: 40B
+- **Context Length**: 131,072 tokens
+- **Vocabulary**: 76,800 tokens
+- **Conversion Date**: 2026-01-07
+- **Converted By**: Avarok (Dual NVIDIA DGX Spark with GB10 GPUs)
+## Files Included
+| Filename | Size | Quant Type | Use Case |
+|----------|------|------------|----------|
+| `IQuest-Coder-V1-40B-Loop-Instruct-f16.gguf` | 75GB | F16 | Full precision reference |
+| `IQuest-Coder-V1-40B-Loop-Instruct-q8_0.gguf` | 40GB | Q8_0 | Excellent quality, minimal loss |
+| `IQuest-Coder-V1-40B-Loop-Instruct-q5_k_m.gguf` | 27GB | Q5_K_M | Good quality balance |
+| `IQuest-Coder-V1-40B-Loop-Instruct-q4_k_m.gguf` | 23GB | Q4_K_M | **RECOMMENDED** - Best size/quality balance |
+## SHA256 Checksums
+```
+b70d3bb48753e786c8afca7556b818341fc9258e29083be4b0375c5a8b788289  IQuest-Coder-V1-40B-Loop-Instruct-f16.gguf
+a9323b7ca583a842737dd4ec1f7422101c68ededf2a86c75a8d5e9da70eaae06  IQuest-Coder-V1-40B-Loop-Instruct-q8_0.gguf
+a15814998038c8c6334f69bc11b776bce785350c933ce95fe9c41c4c7ec708ba  IQuest-Coder-V1-40B-Loop-Instruct-q5_k_m.gguf
+b665999c8d6660ba0ea29cbbb072056052ef965a233ef65661ec16a16b39a9e3  IQuest-Coder-V1-40B-Loop-Instruct-q4_k_m.gguf
+```
+## Current Status
+⚠️ **IMPORTANT**: These GGUF files contain all loop attention tensors and metadata, but **runtime support is pending** in llama.cpp.
+**What Works**:
+- ✅ GGUF files load correctly
+- ✅ All 883 tensors preserved (721 standard + 160 loop gates + 2 embeddings)
+- ✅ Loop parameters stored in metadata (loop_num=2, loop_window_size=64)
+- ✅ Quantization tested and verified
+**What's Pending**:
+- ⏳ Loop attention runtime implementation in llama.cpp
+- ⏳ Inference will fail until runtime support added
+## Technical Details
+### Loop Architecture
+The IQuest Loop Coder uses a **recurrent transformer design** with:
+- **loop_num**: 2 iterations of attention per layer
+- **loop_window_size**: 64 token attention window
+- **Gate Projections**: 160 additional tensors for gating mechanism
+  - `blk.-79.loop_gate.weight`: [128, 40] per layer
+  - `blk.-79.loop_gate.bias`: [40] per layer
+### Conversion Process
+Converted using custom `IQuestLoopCoderModel` class:
+- Inherits from LlamaModel (compatible base architecture)
+- Maps gate_projections to GGUF tensor names
+- Preserves loop parameters in metadata
+- Tested with all quantization levels
+Conversion time: **2-7 minutes** per quantization on NVIDIA GB10
+## Usage (When Runtime Support Available)
+### With Ollama
+```bash
+# Create Modelfile
+cat > Modelfile <<EOF
+FROM IQuest-Coder-V1-40B-Loop-Instruct-q4_k_m.gguf
+PARAMETER temperature 0.7
+PARAMETER top_p 0.9
+EOF
+# Create model
+ollama create iquest-loop:q4 -f Modelfile
+# Run
+ollama run iquest-loop:q4 "Write a Python function for fibonacci"
+```
+### With llama.cpp
+```bash
+./llama-cli \
+    --model IQuest-Coder-V1-40B-Loop-Instruct-q4_k_m.gguf \
+    --prompt "def fibonacci(n):" \
+    --n-predict 100
+```
+**Note**: Will fail until loop attention runtime is implemented.
+## Implementation Status
+### Converter ✅ (Complete)
+The converter successfully creates GGUF files with all loop-specific components:
+- Custom tensor mapping for gate projections
+- Loop parameter metadata storage
+- Tested with 40B parameter model
+- All quantization levels verified
+### Runtime ⏳ (In Progress)
+Runtime implementation requires:
+1. C++ implementation of loop attention mechanism
+2. CUDA kernels for GPU acceleration
+3. Integration into llama.cpp forward pass
+4. Testing against PyTorch reference
+See `RUNTIME_IMPLEMENTATION_GUIDE.md` for detailed implementation requirements.
+## Contribution & Support
+- **Converter Implementation**: Available in llama.cpp PR (pending)
+- **Runtime Development**: Community contribution welcome
+- **Technical Documentation**: Included in this repository
+## Resources
+- **Original Model**: [IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct](https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct)
+- **Conversion Guide**: See `CONVERSION_SUMMARY.md`
+- **Runtime Guide**: See `RUNTIME_IMPLEMENTATION_GUIDE.md`
+- **llama.cpp Issue**: [#18517](https://github.com/ggerganov/llama.cpp/issues/18517)
+- **vLLM Support**: [PR #31575](https://github.com/vllm-project/vllm/pull/31575)
+## Credits
+- **Original Model**: IQuestLab team
+- **Conversion**: Avarok (Dual DGX Spark hardware)
+- **Tools**: llama.cpp (ggerganov), vLLM project
+- **Achievement**: First Loop-Instruct variant in GGUF format
+## License
+Same as base model: IQuestCoder license
+- Link: https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct
+## Acknowledgments
+This is the first publicly available GGUF conversion of an IQuest Loop-Instruct model. The conversion preserves all architectural components needed for loop attention, paving the way for future runtime support.
+---
+**Status**: Converter complete ✅ | Runtime pending ⏳ | Community contributions welcome 🤝