IQuest-Coder-V1-40B-Loop-Instruct GGUF

This repository contains GGUF format models for IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct, optimized for use with llama.cpp.

🚨 IMPORTANT: This model requires a custom llama.cpp build with loop attention support! See PR: llama.cpp#18680

Built and tested on NVIDIA DGX Spark infrastructure.

Model Architecture

This model implements Loop Attention, a novel recurrent attention mechanism that processes all layers multiple times:

loop_num=2: All 80 transformer layers are processed twice (160 total operations)
Loop 0: Standard attention with global K/V caching
Loop 1: Dual attention (local + global) with learned per-head gating

Loop Attention Formula

gate = sigmoid(sum(Q * gate_weight) + gate_bias)
output = local_attn + gate * (global_attn - local_attn)

llama.cpp Support

IMPORTANT: Loop attention support requires a custom branch of llama.cpp.

See PR: https://github.com/ggml-org/llama.cpp/pull/18680

Quick Start

# Clone llama.cpp with loop attention support
git clone https://github.com/tbraun96/llama.cpp
cd llama.cpp
git checkout feature/iquest-loop-attention

# Build
mkdir build && cd build
cmake .. -DGGML_CUDA=ON
cmake --build . --config Release -j$(nproc)

# Download a quantized model
huggingface-cli download Avarok/IQuest-Coder-V1-40B-Loop-Instruct-GGUF IQuest-Coder-V1-40B-Loop-Instruct-Q4_K_M.gguf --local-dir .

# Run inference
./bin/llama-cli -m IQuest-Coder-V1-40B-Loop-Instruct-Q4_K_M.gguf -p "def fibonacci(n):" -n 200

Available Models

Filename	Quantization	Size	Description	Use Case
IQuest-Coder-V1-40B-Loop-Instruct-F16.gguf	F16	75GB	Unquantized, highest quality	Maximum accuracy
IQuest-Coder-V1-40B-Loop-Instruct-Q8_0.gguf	Q8_0	40GB	Very high quality	Near-F16 quality
IQuest-Coder-V1-40B-Loop-Instruct-Q5_K_M.gguf	Q5_K_M	27GB	High quality	Balanced quality/size
IQuest-Coder-V1-40B-Loop-Instruct-Q4_K_M.gguf	Q4_K_M	23GB	Good quality	Recommended

Performance Benchmarks

Testing Platform: NVIDIA DGX Spark with GB10 (Blackwell) GPU, compute capability 12.1

Q4_K_M (23GB) - Recommended:

Prompt processing: 106.2 tokens/second
Text generation: 4.2 tokens/second

F16 (75GB) - Maximum quality:

Prompt processing: 3.4 tokens/second
Text generation: 0.8 tokens/second

All testing and quantization was performed on NVIDIA DGX Spark infrastructure.

Model Details

Base Model: Llama architecture with loop attention extension
Parameters: 40B
Context Length: 32,768 tokens
Training: Fine-tuned for code generation and instruction following
License: Apache 2.0

Citation

If you use this model, please cite:

@software{iquest_loop_instruct_gguf_2025,
  title={IQuest-Coder-V1-40B-Loop-Instruct GGUF},
  author={IQuestLab and Community Contributors},
  year={2025},
  url={https://huggingface.co/Avarok/IQuest-Coder-V1-40B-Loop-Instruct-GGUF}
}

Original Model

Original PyTorch model: IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct

Conversion

These models were converted using the custom GGUF converter available in the llama.cpp branch above.

python convert_hf_to_gguf.py /path/to/IQuest-Coder-V1-40B-Loop-Instruct --outtype f16

World's First

This is the world's first implementation of loop attention in GGUF format, bringing recurrent attention mechanisms to llama.cpp!

Questions or Issues? Please open an issue on the llama.cpp PR or the original model repository.

Downloads last month: -; Downloads are not tracked for this model. How to track