---
license: apache-2.0
tags:
  - code
  - llama
  - loop-attention
  - gguf
  - llama.cpp
language:
  - en
pipeline_tag: text-generation
---

# IQuest-Coder-V1-40B-Loop-Instruct GGUF

This repository contains GGUF format models for [IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct](https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct),
optimized for use with llama.cpp.

**🚨 IMPORTANT**: This model requires a custom llama.cpp build with loop attention support!
**See PR**: [llama.cpp#18680](https://github.com/ggml-org/llama.cpp/pull/18680)

Built and tested on **NVIDIA DGX Spark** infrastructure.

## Model Architecture

This model implements **Loop Attention**, a novel recurrent attention mechanism that processes all layers multiple times:

- **loop_num=2**: All 80 transformer layers are processed twice (160 total operations)
- **Loop 0**: Standard attention with global K/V caching
- **Loop 1**: Dual attention (local + global) with learned per-head gating

### Loop Attention Formula

```
gate = sigmoid(sum(Q * gate_weight) + gate_bias)
output = local_attn + gate * (global_attn - local_attn)
```

## llama.cpp Support

**IMPORTANT**: Loop attention support requires a custom branch of llama.cpp.

See PR: https://github.com/ggml-org/llama.cpp/pull/18680

### Quick Start

```bash
# Clone llama.cpp with loop attention support
git clone https://github.com/tbraun96/llama.cpp
cd llama.cpp
git checkout feature/iquest-loop-attention

# Build
mkdir build && cd build
cmake .. -DGGML_CUDA=ON
cmake --build . --config Release -j$(nproc)

# Download a quantized model
huggingface-cli download Avarok/IQuest-Coder-V1-40B-Loop-Instruct-GGUF IQuest-Coder-V1-40B-Loop-Instruct-Q4_K_M.gguf --local-dir .

# Run inference
./bin/llama-cli -m IQuest-Coder-V1-40B-Loop-Instruct-Q4_K_M.gguf -p "def fibonacci(n):" -n 200
```

## Available Models

| Filename | Quantization | Size | Description | Use Case |
|----------|-------------|------|-------------|----------|
| IQuest-Coder-V1-40B-Loop-Instruct-F16.gguf | F16 | 75GB | Unquantized, highest quality | Maximum accuracy |
| IQuest-Coder-V1-40B-Loop-Instruct-Q8_0.gguf | Q8_0 | 40GB | Very high quality | Near-F16 quality |
| IQuest-Coder-V1-40B-Loop-Instruct-Q5_K_M.gguf | Q5_K_M | 27GB | High quality | Balanced quality/size |
| IQuest-Coder-V1-40B-Loop-Instruct-Q4_K_M.gguf | Q4_K_M | 23GB | Good quality | **Recommended** |

## Performance Benchmarks

**Testing Platform**: NVIDIA DGX Spark with GB10 (Blackwell) GPU, compute capability 12.1

**Q4_K_M (23GB)** - Recommended:
- Prompt processing: 106.2 tokens/second
- Text generation: 4.2 tokens/second

**F16 (75GB)** - Maximum quality:
- Prompt processing: 3.4 tokens/second
- Text generation: 0.8 tokens/second

All testing and quantization was performed on NVIDIA DGX Spark infrastructure.

## Model Details

- **Base Model**: Llama architecture with loop attention extension
- **Parameters**: 40B
- **Context Length**: 32,768 tokens
- **Training**: Fine-tuned for code generation and instruction following
- **License**: Apache 2.0

## Citation

If you use this model, please cite:

```bibtex
@software{iquest_loop_instruct_gguf_2025,
  title={IQuest-Coder-V1-40B-Loop-Instruct GGUF},
  author={IQuestLab and Community Contributors},
  year={2025},
  url={https://huggingface.co/Avarok/IQuest-Coder-V1-40B-Loop-Instruct-GGUF}
}
```

## Original Model

Original PyTorch model: [IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct](https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct)

## Conversion

These models were converted using the custom GGUF converter available in the llama.cpp branch above.

```bash
python convert_hf_to_gguf.py /path/to/IQuest-Coder-V1-40B-Loop-Instruct --outtype f16
```

## World's First

This is the **world's first implementation** of loop attention in GGUF format, bringing recurrent attention mechanisms to llama.cpp!

---

**Questions or Issues?** Please open an issue on the [llama.cpp PR](https://github.com/ggml-org/llama.cpp/pull/18680) or the original model repository.