IQuest-Coder-V1-40B-Loop-Instruct GGUF

This repository contains GGUF format models for IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct, optimized for use with llama.cpp.

๐Ÿšจ IMPORTANT: This model requires a custom llama.cpp build with loop attention support! See PR: llama.cpp#18680

Built and tested on NVIDIA DGX Spark infrastructure.

Model Architecture

This model implements Loop Attention, a novel recurrent attention mechanism that processes all layers multiple times:

  • loop_num=2: All 80 transformer layers are processed twice (160 total operations)
  • Loop 0: Standard attention with global K/V caching
  • Loop 1: Dual attention (local + global) with learned per-head gating

Loop Attention Formula

gate = sigmoid(sum(Q * gate_weight) + gate_bias)
output = local_attn + gate * (global_attn - local_attn)

llama.cpp Support

IMPORTANT: Loop attention support requires a custom branch of llama.cpp.

See PR: https://github.com/ggml-org/llama.cpp/pull/18680

Quick Start

# Clone llama.cpp with loop attention support
git clone https://github.com/tbraun96/llama.cpp
cd llama.cpp
git checkout feature/iquest-loop-attention

# Build
mkdir build && cd build
cmake .. -DGGML_CUDA=ON
cmake --build . --config Release -j$(nproc)

# Download a quantized model
huggingface-cli download Avarok/IQuest-Coder-V1-40B-Loop-Instruct-GGUF IQuest-Coder-V1-40B-Loop-Instruct-Q4_K_M.gguf --local-dir .

# Run inference
./bin/llama-cli -m IQuest-Coder-V1-40B-Loop-Instruct-Q4_K_M.gguf -p "def fibonacci(n):" -n 200

Available Models

Filename Quantization Size Description Use Case
IQuest-Coder-V1-40B-Loop-Instruct-F16.gguf F16 75GB Unquantized, highest quality Maximum accuracy
IQuest-Coder-V1-40B-Loop-Instruct-Q8_0.gguf Q8_0 40GB Very high quality Near-F16 quality
IQuest-Coder-V1-40B-Loop-Instruct-Q5_K_M.gguf Q5_K_M 27GB High quality Balanced quality/size
IQuest-Coder-V1-40B-Loop-Instruct-Q4_K_M.gguf Q4_K_M 23GB Good quality Recommended

Performance Benchmarks

Testing Platform: NVIDIA DGX Spark with GB10 (Blackwell) GPU, compute capability 12.1

Q4_K_M (23GB) - Recommended:

  • Prompt processing: 106.2 tokens/second
  • Text generation: 4.2 tokens/second

F16 (75GB) - Maximum quality:

  • Prompt processing: 3.4 tokens/second
  • Text generation: 0.8 tokens/second

All testing and quantization was performed on NVIDIA DGX Spark infrastructure.

Model Details

  • Base Model: Llama architecture with loop attention extension
  • Parameters: 40B
  • Context Length: 32,768 tokens
  • Training: Fine-tuned for code generation and instruction following
  • License: Apache 2.0

Citation

If you use this model, please cite:

@software{iquest_loop_instruct_gguf_2025,
  title={IQuest-Coder-V1-40B-Loop-Instruct GGUF},
  author={IQuestLab and Community Contributors},
  year={2025},
  url={https://huggingface.co/Avarok/IQuest-Coder-V1-40B-Loop-Instruct-GGUF}
}

Original Model

Original PyTorch model: IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct

Conversion

These models were converted using the custom GGUF converter available in the llama.cpp branch above.

python convert_hf_to_gguf.py /path/to/IQuest-Coder-V1-40B-Loop-Instruct --outtype f16

World's First

This is the world's first implementation of loop attention in GGUF format, bringing recurrent attention mechanisms to llama.cpp!


Questions or Issues? Please open an issue on the llama.cpp PR or the original model repository.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support