File size: 4,039 Bytes
e9c9d33
f1e11e3
e9c9d33
f1e11e3
 
e9c9d33
f1e11e3
 
e9c9d33
 
 
 
 
f1e11e3
e9c9d33
f1e11e3
 
e9c9d33
321bfae
 
 
 
 
f1e11e3
e9c9d33
f1e11e3
e9c9d33
f1e11e3
 
 
e9c9d33
f1e11e3
e9c9d33
 
f1e11e3
 
e9c9d33
 
f1e11e3
e9c9d33
f1e11e3
e9c9d33
f1e11e3
e9c9d33
f1e11e3
e9c9d33
f1e11e3
 
 
 
 
e9c9d33
f1e11e3
 
 
 
e9c9d33
f1e11e3
 
e9c9d33
f1e11e3
 
e9c9d33
 
f1e11e3
e9c9d33
f1e11e3
 
 
 
 
 
e9c9d33
f1e11e3
e9c9d33
321bfae
e9c9d33
321bfae
f1e11e3
 
e9c9d33
321bfae
f1e11e3
 
e9c9d33
321bfae
 
f1e11e3
e9c9d33
f1e11e3
 
 
 
 
e9c9d33
f1e11e3
e9c9d33
f1e11e3
e9c9d33
f1e11e3
 
 
 
 
 
 
 
e9c9d33
f1e11e3
e9c9d33
f1e11e3
e9c9d33
f1e11e3
e9c9d33
f1e11e3
e9c9d33
f1e11e3
 
 
e9c9d33
f1e11e3
e9c9d33
f1e11e3
e9c9d33
 
 
f1e11e3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
---
license: apache-2.0
tags:
  - code
  - llama
  - loop-attention
  - gguf
  - llama.cpp
language:
  - en
pipeline_tag: text-generation
---

# IQuest-Coder-V1-40B-Loop-Instruct GGUF

This repository contains GGUF format models for [IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct](https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct),
optimized for use with llama.cpp.

**🚨 IMPORTANT**: This model requires a custom llama.cpp build with loop attention support!
**See PR**: [llama.cpp#18680](https://github.com/ggml-org/llama.cpp/pull/18680)

Built and tested on **NVIDIA DGX Spark** infrastructure.

## Model Architecture

This model implements **Loop Attention**, a novel recurrent attention mechanism that processes all layers multiple times:

- **loop_num=2**: All 80 transformer layers are processed twice (160 total operations)
- **Loop 0**: Standard attention with global K/V caching
- **Loop 1**: Dual attention (local + global) with learned per-head gating

### Loop Attention Formula

```
gate = sigmoid(sum(Q * gate_weight) + gate_bias)
output = local_attn + gate * (global_attn - local_attn)
```

## llama.cpp Support

**IMPORTANT**: Loop attention support requires a custom branch of llama.cpp.

See PR: https://github.com/ggml-org/llama.cpp/pull/18680

### Quick Start

```bash
# Clone llama.cpp with loop attention support
git clone https://github.com/tbraun96/llama.cpp
cd llama.cpp
git checkout feature/iquest-loop-attention

# Build
mkdir build && cd build
cmake .. -DGGML_CUDA=ON
cmake --build . --config Release -j$(nproc)

# Download a quantized model
huggingface-cli download Avarok/IQuest-Coder-V1-40B-Loop-Instruct-GGUF IQuest-Coder-V1-40B-Loop-Instruct-Q4_K_M.gguf --local-dir .

# Run inference
./bin/llama-cli -m IQuest-Coder-V1-40B-Loop-Instruct-Q4_K_M.gguf -p "def fibonacci(n):" -n 200
```

## Available Models

| Filename | Quantization | Size | Description | Use Case |
|----------|-------------|------|-------------|----------|
| IQuest-Coder-V1-40B-Loop-Instruct-F16.gguf | F16 | 75GB | Unquantized, highest quality | Maximum accuracy |
| IQuest-Coder-V1-40B-Loop-Instruct-Q8_0.gguf | Q8_0 | 40GB | Very high quality | Near-F16 quality |
| IQuest-Coder-V1-40B-Loop-Instruct-Q5_K_M.gguf | Q5_K_M | 27GB | High quality | Balanced quality/size |
| IQuest-Coder-V1-40B-Loop-Instruct-Q4_K_M.gguf | Q4_K_M | 23GB | Good quality | **Recommended** |

## Performance Benchmarks

**Testing Platform**: NVIDIA DGX Spark with GB10 (Blackwell) GPU, compute capability 12.1

**Q4_K_M (23GB)** - Recommended:
- Prompt processing: 106.2 tokens/second
- Text generation: 4.2 tokens/second

**F16 (75GB)** - Maximum quality:
- Prompt processing: 3.4 tokens/second
- Text generation: 0.8 tokens/second

All testing and quantization was performed on NVIDIA DGX Spark infrastructure.

## Model Details

- **Base Model**: Llama architecture with loop attention extension
- **Parameters**: 40B
- **Context Length**: 32,768 tokens
- **Training**: Fine-tuned for code generation and instruction following
- **License**: Apache 2.0

## Citation

If you use this model, please cite:

```bibtex
@software{iquest_loop_instruct_gguf_2025,
  title={IQuest-Coder-V1-40B-Loop-Instruct GGUF},
  author={IQuestLab and Community Contributors},
  year={2025},
  url={https://huggingface.co/Avarok/IQuest-Coder-V1-40B-Loop-Instruct-GGUF}
}
```

## Original Model

Original PyTorch model: [IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct](https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct)

## Conversion

These models were converted using the custom GGUF converter available in the llama.cpp branch above.

```bash
python convert_hf_to_gguf.py /path/to/IQuest-Coder-V1-40B-Loop-Instruct --outtype f16
```

## World's First

This is the **world's first implementation** of loop attention in GGUF format, bringing recurrent attention mechanisms to llama.cpp!

---

**Questions or Issues?** Please open an issue on the [llama.cpp PR](https://github.com/ggml-org/llama.cpp/pull/18680) or the original model repository.