Avarok
/

IQuest-Coder-V1-40B-Loop-Instruct-GGUF

@@ -1,167 +1,120 @@
 ---
-license: other
-license_name: iquestcoder
-license_link: https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct
-base_model: IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct
 tags:
-  - gguf
-  - quantized
   - loop-attention
-  - recurrent-transformer
-  - code-generation
-  - iquest
 language:
   - en
 pipeline_tag: text-generation
 ---
-# IQuest-Coder-V1-40B-Loop-Instruct - GGUF
-**World's first GGUF conversion** of IQuestLab's IQuest-Coder-V1-40B-Loop-Instruct model with recurrent loop attention mechanism.
-## Model Details
-- **Base Model**: [IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct](https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct)
-- **Architecture**: Llama with Loop Attention (recurrent transformer, 2 iterations)
-- **Parameters**: 40B
-- **Context Length**: 131,072 tokens
-- **Vocabulary**: 76,800 tokens
-- **Conversion Date**: 2026-01-07
-- **Converted By**: Avarok (Dual NVIDIA DGX Spark with GB10 GPUs)
-## Files Included
-| Filename | Size | Quant Type | Use Case |
-|----------|------|------------|----------|
-| `IQuest-Coder-V1-40B-Loop-Instruct-f16.gguf` | 75GB | F16 | Full precision reference |
-| `IQuest-Coder-V1-40B-Loop-Instruct-q8_0.gguf` | 40GB | Q8_0 | Excellent quality, minimal loss |
-| `IQuest-Coder-V1-40B-Loop-Instruct-q5_k_m.gguf` | 27GB | Q5_K_M | Good quality balance |
-| `IQuest-Coder-V1-40B-Loop-Instruct-q4_k_m.gguf` | 23GB | Q4_K_M | **RECOMMENDED** - Best size/quality balance |
-## SHA256 Checksums
 ```
-b70d3bb48753e786c8afca7556b818341fc9258e29083be4b0375c5a8b788289  IQuest-Coder-V1-40B-Loop-Instruct-f16.gguf
-a9323b7ca583a842737dd4ec1f7422101c68ededf2a86c75a8d5e9da70eaae06  IQuest-Coder-V1-40B-Loop-Instruct-q8_0.gguf
-a15814998038c8c6334f69bc11b776bce785350c933ce95fe9c41c4c7ec708ba  IQuest-Coder-V1-40B-Loop-Instruct-q5_k_m.gguf
-b665999c8d6660ba0ea29cbbb072056052ef965a233ef65661ec16a16b39a9e3  IQuest-Coder-V1-40B-Loop-Instruct-q4_k_m.gguf
 ```
-## Current Status
-⚠️ **IMPORTANT**: These GGUF files contain all loop attention tensors and metadata, but **runtime support is pending** in llama.cpp.
-**What Works**:
-- ✅ GGUF files load correctly
-- ✅ All 883 tensors preserved (721 standard + 160 loop gates + 2 embeddings)
-- ✅ Loop parameters stored in metadata (loop_num=2, loop_window_size=64)
-- ✅ Quantization tested and verified
-**What's Pending**:
-- ⏳ Loop attention runtime implementation in llama.cpp
-- ⏳ Inference will fail until runtime support added
-## Technical Details
-### Loop Architecture
-The IQuest Loop Coder uses a **recurrent transformer design** with:
-- **loop_num**: 2 iterations of attention per layer
-- **loop_window_size**: 64 token attention window
-- **Gate Projections**: 160 additional tensors for gating mechanism
-  - `blk.-79.loop_gate.weight`: [128, 40] per layer
-  - `blk.-79.loop_gate.bias`: [40] per layer
-### Conversion Process
-Converted using custom `IQuestLoopCoderModel` class:
-- Inherits from LlamaModel (compatible base architecture)
-- Maps gate_projections to GGUF tensor names
-- Preserves loop parameters in metadata
-- Tested with all quantization levels
-Conversion time: **2-7 minutes** per quantization on NVIDIA GB10
-## Usage (When Runtime Support Available)
-### With Ollama
-```bash
-# Create Modelfile
-cat > Modelfile <<EOF
-FROM IQuest-Coder-V1-40B-Loop-Instruct-q4_k_m.gguf
-PARAMETER temperature 0.7
-PARAMETER top_p 0.9
-EOF
-# Create model
-ollama create iquest-loop:q4 -f Modelfile
-# Run
-ollama run iquest-loop:q4 "Write a Python function for fibonacci"
 ```
-### With llama.cpp
-```bash
-./llama-cli \
-    --model IQuest-Coder-V1-40B-Loop-Instruct-q4_k_m.gguf \
-    --prompt "def fibonacci(n):" \
-    --n-predict 100
-```
-**Note**: Will fail until loop attention runtime is implemented.
-## Implementation Status
-### Converter ✅ (Complete)
-The converter successfully creates GGUF files with all loop-specific components:
-- Custom tensor mapping for gate projections
-- Loop parameter metadata storage
-- Tested with 40B parameter model
-- All quantization levels verified
-### Runtime ⏳ (In Progress)
-Runtime implementation requires:
-1. C++ implementation of loop attention mechanism
-2. CUDA kernels for GPU acceleration
-3. Integration into llama.cpp forward pass
-4. Testing against PyTorch reference
-See `RUNTIME_IMPLEMENTATION_GUIDE.md` for detailed implementation requirements.
-## Contribution & Support
-- **Converter Implementation**: Available in llama.cpp PR (pending)
-- **Runtime Development**: Community contribution welcome
-- **Technical Documentation**: Included in this repository
-## Resources
-- **Original Model**: [IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct](https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct)
-- **Conversion Guide**: See `CONVERSION_SUMMARY.md`
-- **Runtime Guide**: See `RUNTIME_IMPLEMENTATION_GUIDE.md`
-- **llama.cpp Issue**: [#18517](https://github.com/ggerganov/llama.cpp/issues/18517)
-- **vLLM Support**: [PR #31575](https://github.com/vllm-project/vllm/pull/31575)
-## Credits
-- **Original Model**: IQuestLab team
-- **Conversion**: Avarok (Dual DGX Spark hardware)
-- **Tools**: llama.cpp (ggerganov), vLLM project
-- **Achievement**: First Loop-Instruct variant in GGUF format
-## License
-Same as base model: IQuestCoder license
-- Link: https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct
-## Acknowledgments
-This is the first publicly available GGUF conversion of an IQuest Loop-Instruct model. The conversion preserves all architectural components needed for loop attention, paving the way for future runtime support.
 ---
-**Status**: Converter complete ✅ | Runtime pending ⏳ | Community contributions welcome 🤝

 ---
+license: apache-2.0
 tags:
+  - code
+  - llama
   - loop-attention
+  - gguf
+  - llama.cpp
 language:
   - en
 pipeline_tag: text-generation
 ---
+# IQuest-Coder-V1-40B-Loop-Instruct GGUF
+This repository contains GGUF format models for [IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct](https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct),
+optimized for use with llama.cpp.
+## Model Architecture
+This model implements **Loop Attention**, a novel recurrent attention mechanism that processes all layers multiple times:
+- **loop_num=2**: All 80 transformer layers are processed twice (160 total operations)
+- **Loop 0**: Standard attention with global K/V caching
+- **Loop 1**: Dual attention (local + global) with learned per-head gating
+### Loop Attention Formula
 ```
+gate = sigmoid(sum(Q * gate_weight) + gate_bias)
+output = local_attn + gate * (global_attn - local_attn)
 ```
+## llama.cpp Support
+**IMPORTANT**: Loop attention support requires a custom branch of llama.cpp.
+See PR: https://github.com/ggml-org/llama.cpp/pull/18680
+### Quick Start
+```bash
+# Clone llama.cpp with loop attention support
+git clone https://github.com/tbraun96/llama.cpp
+cd llama.cpp
+git checkout feature/iquest-loop-attention
+# Build
+mkdir build && cd build
+cmake .. -DGGML_CUDA=ON
+cmake --build . --config Release -j$(nproc)
+# Download a quantized model
+huggingface-cli download Avarok/IQuest-Coder-V1-40B-Loop-Instruct-GGUF IQuest-Coder-V1-40B-Loop-Instruct-Q4_K_M.gguf --local-dir .
+# Run inference
+./bin/llama-cli -m IQuest-Coder-V1-40B-Loop-Instruct-Q4_K_M.gguf -p "def fibonacci(n):" -n 200
 ```
+## Available Models
+| Filename | Quantization | Size | Description | Use Case |
+|----------|-------------|------|-------------|----------|
+| IQuest-Coder-V1-40B-Loop-Instruct-F16.gguf | F16 | 75GB | Unquantized, highest quality | Maximum accuracy |
+| IQuest-Coder-V1-40B-Loop-Instruct-Q8_0.gguf | Q8_0 | 40GB | Very high quality | Near-F16 quality |
+| IQuest-Coder-V1-40B-Loop-Instruct-Q5_K_M.gguf | Q5_K_M | 27GB | High quality | Balanced quality/size |
+| IQuest-Coder-V1-40B-Loop-Instruct-Q4_K_M.gguf | Q4_K_M | 23GB | Good quality | **Recommended** |
+## Performance Benchmarks
+Tested on NVIDIA GB10 (Blackwell), compute 12.1:
+**Q4_K_M (23GB)**:
+- Prompt processing: 106.2 tokens/second
+- Text generation: 4.2 tokens/second
+**F16 (75GB)**:
+- Prompt processing: 3.4 tokens/second
+- Text generation: 0.8 tokens/second
+## Model Details
+- **Base Model**: Llama architecture with loop attention extension
+- **Parameters**: 40B
+- **Context Length**: 32,768 tokens
+- **Training**: Fine-tuned for code generation and instruction following
+- **License**: Apache 2.0
+## Citation
+If you use this model, please cite:
+```bibtex
+@software{iquest_loop_instruct_gguf_2025,
+  title={IQuest-Coder-V1-40B-Loop-Instruct GGUF},
+  author={IQuestLab and Community Contributors},
+  year={2025},
+  url={https://huggingface.co/Avarok/IQuest-Coder-V1-40B-Loop-Instruct-GGUF}
+}
+```
+## Original Model
+Original PyTorch model: [IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct](https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct)
+## Conversion
+These models were converted using the custom GGUF converter available in the llama.cpp branch above.
+```bash
+python convert_hf_to_gguf.py /path/to/IQuest-Coder-V1-40B-Loop-Instruct --outtype f16
+```
+## World's First
+This is the **world's first implementation** of loop attention in GGUF format, bringing recurrent attention mechanisms to llama.cpp!
 ---
+**Questions or Issues?** Please open an issue on the [llama.cpp PR](https://github.com/ggml-org/llama.cpp/pull/18680) or the original model repository.