Add RUNTIME_IMPLEMENTATION_GUIDE.md
Browse files- RUNTIME_IMPLEMENTATION_GUIDE.md +402 -0
RUNTIME_IMPLEMENTATION_GUIDE.md
ADDED
|
@@ -0,0 +1,402 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# IQuest Loop Attention Runtime Implementation Guide
|
| 2 |
+
|
| 3 |
+
**Status**: Converter implemented ✅ | Runtime support needed ⏳
|
| 4 |
+
|
| 5 |
+
## Overview
|
| 6 |
+
|
| 7 |
+
This document outlines the requirements for implementing IQuestLoopCoder runtime support in llama.cpp. The converter (`IQuestLoopCoderModel`) successfully creates GGUF files with all loop-specific tensors, but the inference runtime needs to be implemented.
|
| 8 |
+
|
| 9 |
+
## What We Know
|
| 10 |
+
|
| 11 |
+
### Architecture Summary
|
| 12 |
+
|
| 13 |
+
**Loop Mechanism**: Recurrent transformer design with shared parameters across two iterations (loop_num=2)
|
| 14 |
+
|
| 15 |
+
**Key Parameters**:
|
| 16 |
+
- `llama.loop.num`: 2 (iterations of recurrent processing)
|
| 17 |
+
- `llama.loop.window_size`: 64 (attention window for loop mechanism)
|
| 18 |
+
|
| 19 |
+
**Additional Tensors** (160 total):
|
| 20 |
+
- `blk.{0-79}.loop_gate.weight`: [128, 40] per layer
|
| 21 |
+
- `blk.{0-79}.loop_gate.bias`: [40] per layer
|
| 22 |
+
|
| 23 |
+
### Tensor Layout in GGUF
|
| 24 |
+
|
| 25 |
+
```
|
| 26 |
+
Standard Llama tensors (721):
|
| 27 |
+
├── blk.{0-79}.attn_q.weight [5120, 5120]
|
| 28 |
+
├── blk.{0-79}.attn_k.weight [5120, 1024]
|
| 29 |
+
├── blk.{0-79}.attn_v.weight [5120, 1024]
|
| 30 |
+
├── blk.{0-79}.attn_output.weight [5120, 5120]
|
| 31 |
+
├── blk.{0-79}.attn_norm.weight [5120]
|
| 32 |
+
├── blk.{0-79}.ffn_gate.weight [5120, 27648]
|
| 33 |
+
├── blk.{0-79}.ffn_up.weight [5120, 27648]
|
| 34 |
+
├── blk.{0-79}.ffn_down.weight [27648, 5120]
|
| 35 |
+
└── blk.{0-79}.ffn_norm.weight [5120]
|
| 36 |
+
|
| 37 |
+
Loop-specific tensors (160):
|
| 38 |
+
├── blk.{0-79}.loop_gate.weight [128, 40] ← NEW
|
| 39 |
+
└── blk.{0-79}.loop_gate.bias [40] ← NEW
|
| 40 |
+
|
| 41 |
+
Embeddings (2):
|
| 42 |
+
├── token_embd.weight [5120, 76800]
|
| 43 |
+
└── output.weight [5120, 76800]
|
| 44 |
+
```
|
| 45 |
+
|
| 46 |
+
### Gate Projection Shape Analysis
|
| 47 |
+
|
| 48 |
+
- **Weight**: [128, 40] = [head_dim, num_heads]
|
| 49 |
+
- **Bias**: [40] = [num_heads]
|
| 50 |
+
- **Per layer**: 1 weight + 1 bias tensor
|
| 51 |
+
- **Total layers**: 80
|
| 52 |
+
- **Total loop tensors**: 160
|
| 53 |
+
|
| 54 |
+
This suggests the gate projects from head dimension to per-head gates.
|
| 55 |
+
|
| 56 |
+
## Runtime Implementation Requirements
|
| 57 |
+
|
| 58 |
+
### 1. GGUF Metadata Reading
|
| 59 |
+
|
| 60 |
+
**File**: `llama.cpp` (or equivalent model loader)
|
| 61 |
+
|
| 62 |
+
Add support for reading loop parameters:
|
| 63 |
+
|
| 64 |
+
```cpp
|
| 65 |
+
// In llama_model_loader or similar
|
| 66 |
+
uint32_t loop_num = 0;
|
| 67 |
+
uint32_t loop_window_size = 0;
|
| 68 |
+
|
| 69 |
+
// Read from GGUF metadata
|
| 70 |
+
gguf_get_val_u32(ctx, gguf_find_key(ctx, "llama.loop.num"), &loop_num);
|
| 71 |
+
gguf_get_val_u32(ctx, gguf_find_key(ctx, "llama.loop.window_size"), &loop_window_size);
|
| 72 |
+
|
| 73 |
+
// Store in model struct
|
| 74 |
+
model->hparams.loop_num = loop_num;
|
| 75 |
+
model->hparams.loop_window_size = loop_window_size;
|
| 76 |
+
```
|
| 77 |
+
|
| 78 |
+
### 2. Tensor Loading
|
| 79 |
+
|
| 80 |
+
**File**: `llama.cpp` tensor loading section
|
| 81 |
+
|
| 82 |
+
Add loop gate tensor loading:
|
| 83 |
+
|
| 84 |
+
```cpp
|
| 85 |
+
// In tensor loading loop
|
| 86 |
+
for (int i = 0; i < n_layer; i++) {
|
| 87 |
+
// Existing tensors...
|
| 88 |
+
|
| 89 |
+
// NEW: Load loop gate tensors
|
| 90 |
+
model.layers[i].loop_gate_w = ml.create_tensor(
|
| 91 |
+
ctx, tn(LLM_TENSOR_LOOP_GATE_W, "weight", i), {n_embd_head, n_head}
|
| 92 |
+
);
|
| 93 |
+
model.layers[i].loop_gate_b = ml.create_tensor(
|
| 94 |
+
ctx, tn(LLM_TENSOR_LOOP_GATE_B, "bias", i), {n_head}
|
| 95 |
+
);
|
| 96 |
+
}
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
### 3. Loop Attention Forward Pass (Conceptual)
|
| 100 |
+
|
| 101 |
+
Based on available information, the loop attention likely works as follows:
|
| 102 |
+
|
| 103 |
+
```python
|
| 104 |
+
# Conceptual implementation (needs verification)
|
| 105 |
+
def loop_attention_forward(x, layer, loop_num=2, loop_window_size=64):
|
| 106 |
+
"""
|
| 107 |
+
Recurrent attention with loop_num iterations
|
| 108 |
+
|
| 109 |
+
Args:
|
| 110 |
+
x: input tensor [batch, seq_len, hidden_dim]
|
| 111 |
+
layer: transformer layer with loop_gate weights
|
| 112 |
+
loop_num: number of recurrent iterations (default: 2)
|
| 113 |
+
loop_window_size: attention window size (default: 64)
|
| 114 |
+
|
| 115 |
+
Returns:
|
| 116 |
+
output tensor [batch, seq_len, hidden_dim]
|
| 117 |
+
"""
|
| 118 |
+
hidden_state = x
|
| 119 |
+
|
| 120 |
+
# Recurrent loop with shared parameters
|
| 121 |
+
for loop_iter in range(loop_num):
|
| 122 |
+
# Standard self-attention
|
| 123 |
+
attn_output = self_attention(
|
| 124 |
+
hidden_state,
|
| 125 |
+
q_proj=layer.attn_q,
|
| 126 |
+
k_proj=layer.attn_k,
|
| 127 |
+
v_proj=layer.attn_v,
|
| 128 |
+
output_proj=layer.attn_output
|
| 129 |
+
)
|
| 130 |
+
|
| 131 |
+
# Apply loop gating mechanism
|
| 132 |
+
# Gate shape: [num_heads, 1] per position
|
| 133 |
+
gates = compute_loop_gates(
|
| 134 |
+
hidden_state,
|
| 135 |
+
gate_weight=layer.loop_gate.weight, # [head_dim, num_heads]
|
| 136 |
+
gate_bias=layer.loop_gate.bias, # [num_heads]
|
| 137 |
+
window_size=loop_window_size
|
| 138 |
+
)
|
| 139 |
+
|
| 140 |
+
# Blend attention output with residual using gates
|
| 141 |
+
if loop_iter < loop_num - 1:
|
| 142 |
+
# Intermediate iterations: gated combination
|
| 143 |
+
hidden_state = gates * attn_output + (1 - gates) * hidden_state
|
| 144 |
+
else:
|
| 145 |
+
# Final iteration: standard residual
|
| 146 |
+
hidden_state = attn_output + x
|
| 147 |
+
|
| 148 |
+
return hidden_state
|
| 149 |
+
|
| 150 |
+
def compute_loop_gates(hidden_state, gate_weight, gate_bias, window_size):
|
| 151 |
+
"""
|
| 152 |
+
Compute per-head gating values
|
| 153 |
+
|
| 154 |
+
Args:
|
| 155 |
+
hidden_state: [batch, seq_len, hidden_dim]
|
| 156 |
+
gate_weight: [head_dim, num_heads]
|
| 157 |
+
gate_bias: [num_heads]
|
| 158 |
+
window_size: local attention window
|
| 159 |
+
|
| 160 |
+
Returns:
|
| 161 |
+
gates: [batch, seq_len, num_heads, 1]
|
| 162 |
+
"""
|
| 163 |
+
# Reshape hidden_state to [batch, seq_len, num_heads, head_dim]
|
| 164 |
+
batch, seq_len, hidden_dim = hidden_state.shape
|
| 165 |
+
num_heads = gate_bias.shape[0]
|
| 166 |
+
head_dim = hidden_dim // num_heads
|
| 167 |
+
|
| 168 |
+
x = hidden_state.view(batch, seq_len, num_heads, head_dim)
|
| 169 |
+
|
| 170 |
+
# Project through gate weight: [batch, seq_len, num_heads, head_dim] @ [head_dim, 1]
|
| 171 |
+
# This gives per-head activation
|
| 172 |
+
gate_logits = torch.einsum('bsnh,hk->bsnk', x, gate_weight) + gate_bias
|
| 173 |
+
|
| 174 |
+
# Apply sigmoid for gating in [0, 1]
|
| 175 |
+
gates = torch.sigmoid(gate_logits)
|
| 176 |
+
|
| 177 |
+
return gates
|
| 178 |
+
```
|
| 179 |
+
|
| 180 |
+
### 4. C++/CUDA Implementation Outline
|
| 181 |
+
|
| 182 |
+
**File**: `ggml-cuda.cu` (CUDA kernels) or `ggml.c` (CPU implementation)
|
| 183 |
+
|
| 184 |
+
Required kernel functions:
|
| 185 |
+
|
| 186 |
+
```cpp
|
| 187 |
+
// Kernel 1: Compute loop gates
|
| 188 |
+
struct ggml_tensor * ggml_loop_gate(
|
| 189 |
+
struct ggml_context * ctx,
|
| 190 |
+
struct ggml_tensor * hidden_state, // [batch, seq_len, n_embd]
|
| 191 |
+
struct ggml_tensor * gate_weight, // [n_embd_head, n_head]
|
| 192 |
+
struct ggml_tensor * gate_bias, // [n_head]
|
| 193 |
+
int window_size
|
| 194 |
+
) {
|
| 195 |
+
// 1. Reshape hidden_state to [batch, seq_len, n_head, n_embd_head]
|
| 196 |
+
// 2. Project through gate_weight
|
| 197 |
+
// 3. Add gate_bias
|
| 198 |
+
// 4. Apply sigmoid activation
|
| 199 |
+
// 5. Return gates [batch, seq_len, n_head, 1]
|
| 200 |
+
}
|
| 201 |
+
|
| 202 |
+
// Kernel 2: Gated residual combination
|
| 203 |
+
struct ggml_tensor * ggml_gated_residual(
|
| 204 |
+
struct ggml_context * ctx,
|
| 205 |
+
struct ggml_tensor * attn_output, // [batch, seq_len, n_embd]
|
| 206 |
+
struct ggml_tensor * residual, // [batch, seq_len, n_embd]
|
| 207 |
+
struct ggml_tensor * gates // [batch, seq_len, n_head, 1]
|
| 208 |
+
) {
|
| 209 |
+
// output = gates * attn_output + (1 - gates) * residual
|
| 210 |
+
// Per-head gating needs broadcasting
|
| 211 |
+
}
|
| 212 |
+
|
| 213 |
+
// Main loop attention function
|
| 214 |
+
struct ggml_tensor * ggml_loop_attention(
|
| 215 |
+
struct ggml_context * ctx,
|
| 216 |
+
struct ggml_tensor * x,
|
| 217 |
+
struct llama_layer * layer,
|
| 218 |
+
int loop_num,
|
| 219 |
+
int loop_window_size
|
| 220 |
+
) {
|
| 221 |
+
struct ggml_tensor * hidden_state = x;
|
| 222 |
+
|
| 223 |
+
for (int loop_iter = 0; loop_iter < loop_num; loop_iter++) {
|
| 224 |
+
// Standard attention
|
| 225 |
+
struct ggml_tensor * attn_output = ggml_attention(
|
| 226 |
+
ctx, hidden_state, layer, /* ... */
|
| 227 |
+
);
|
| 228 |
+
|
| 229 |
+
// Compute gates
|
| 230 |
+
struct ggml_tensor * gates = ggml_loop_gate(
|
| 231 |
+
ctx, hidden_state,
|
| 232 |
+
layer->loop_gate_w,
|
| 233 |
+
layer->loop_gate_b,
|
| 234 |
+
loop_window_size
|
| 235 |
+
);
|
| 236 |
+
|
| 237 |
+
// Apply gated residual
|
| 238 |
+
if (loop_iter < loop_num - 1) {
|
| 239 |
+
hidden_state = ggml_gated_residual(
|
| 240 |
+
ctx, attn_output, hidden_state, gates
|
| 241 |
+
);
|
| 242 |
+
} else {
|
| 243 |
+
hidden_state = ggml_add(ctx, attn_output, x);
|
| 244 |
+
}
|
| 245 |
+
}
|
| 246 |
+
|
| 247 |
+
return hidden_state;
|
| 248 |
+
}
|
| 249 |
+
```
|
| 250 |
+
|
| 251 |
+
### 5. Integration Points
|
| 252 |
+
|
| 253 |
+
**Files to modify**:
|
| 254 |
+
|
| 255 |
+
1. **`llama.h`**: Add loop parameters to `llama_hparams`
|
| 256 |
+
2. **`llama.cpp`**:
|
| 257 |
+
- Read loop metadata from GGUF
|
| 258 |
+
- Load loop_gate tensors
|
| 259 |
+
- Integrate `ggml_loop_attention` into forward pass
|
| 260 |
+
3. **`ggml.h`**: Add loop attention operation declarations
|
| 261 |
+
4. **`ggml.c`**: Implement CPU kernels for loop gates
|
| 262 |
+
5. **`ggml-cuda.cu`**: Implement CUDA kernels for GPU acceleration
|
| 263 |
+
6. **`ggml-metal.m`**: Implement Metal shaders for Apple Silicon
|
| 264 |
+
7. **`convert_hf_to_gguf.py`**: Already done! ✅
|
| 265 |
+
|
| 266 |
+
## Testing Strategy
|
| 267 |
+
|
| 268 |
+
### 1. Tensor Loading Test
|
| 269 |
+
|
| 270 |
+
Verify all 883 tensors load correctly:
|
| 271 |
+
|
| 272 |
+
```bash
|
| 273 |
+
./llama-cli --model IQuest-Coder-V1-40B-Loop-Instruct-q4_k_m.gguf --verbose
|
| 274 |
+
```
|
| 275 |
+
|
| 276 |
+
Expected output:
|
| 277 |
+
- 80 × loop_gate.weight tensors [128, 40]
|
| 278 |
+
- 80 × loop_gate.bias tensors [40]
|
| 279 |
+
- loop_num = 2
|
| 280 |
+
- loop_window_size = 64
|
| 281 |
+
|
| 282 |
+
### 2. Forward Pass Test
|
| 283 |
+
|
| 284 |
+
Compare output with PyTorch reference:
|
| 285 |
+
|
| 286 |
+
```python
|
| 287 |
+
# Generate reference output with HuggingFace
|
| 288 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 289 |
+
|
| 290 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 291 |
+
"IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct",
|
| 292 |
+
trust_remote_code=True
|
| 293 |
+
)
|
| 294 |
+
tokenizer = AutoTokenizer.from_pretrained(...)
|
| 295 |
+
|
| 296 |
+
input_text = "def fibonacci(n):"
|
| 297 |
+
inputs = tokenizer(input_text, return_tensors="pt")
|
| 298 |
+
|
| 299 |
+
with torch.no_grad():
|
| 300 |
+
pytorch_output = model.generate(**inputs, max_new_tokens=50)
|
| 301 |
+
|
| 302 |
+
print("Reference:", tokenizer.decode(pytorch_output[0]))
|
| 303 |
+
```
|
| 304 |
+
|
| 305 |
+
Then test llama.cpp:
|
| 306 |
+
|
| 307 |
+
```bash
|
| 308 |
+
./llama-cli --model IQuest-Coder-V1-40B-Loop-Instruct-q4_k_m.gguf \
|
| 309 |
+
--prompt "def fibonacci(n):" --n-predict 50
|
| 310 |
+
```
|
| 311 |
+
|
| 312 |
+
Compare token-by-token outputs.
|
| 313 |
+
|
| 314 |
+
### 3. Performance Benchmarks
|
| 315 |
+
|
| 316 |
+
- **Throughput**: tokens/second
|
| 317 |
+
- **Latency**: time to first token
|
| 318 |
+
- **Memory**: peak GPU/CPU memory usage
|
| 319 |
+
- **Quality**: Compare perplexity with reference
|
| 320 |
+
|
| 321 |
+
## Unknown Implementation Details
|
| 322 |
+
|
| 323 |
+
The following need verification from original implementation or technical paper:
|
| 324 |
+
|
| 325 |
+
1. **Gate activation function**: Sigmoid? Tanh? Softmax?
|
| 326 |
+
2. **Gate application**: Per-head? Per-token? Global?
|
| 327 |
+
3. **Loop window**: How is window_size=64 used? Sliding window? Chunking?
|
| 328 |
+
4. **Residual connection**: Standard or modified for loops?
|
| 329 |
+
5. **Positional encoding**: Modified during loop iterations?
|
| 330 |
+
6. **KV cache**: Recomputed each loop? Shared across iterations?
|
| 331 |
+
|
| 332 |
+
## References for Implementation
|
| 333 |
+
|
| 334 |
+
1. **vLLM PR #31575**: https://github.com/vllm-project/vllm/pull/31575
|
| 335 |
+
- Shows integration patterns
|
| 336 |
+
- LoopCoderNorm → RMSNorm refactoring noted
|
| 337 |
+
|
| 338 |
+
2. **Model Config**: `/workspace/.cache/huggingface/.../config.json`
|
| 339 |
+
- Contains: loop_num=2, loop_window_size=64
|
| 340 |
+
|
| 341 |
+
3. **Converted GGUFs**: `/workspace/models/converted/`
|
| 342 |
+
- Reference for tensor shapes and names
|
| 343 |
+
- Test files for validation
|
| 344 |
+
|
| 345 |
+
4. **Issue #18517**: https://github.com/ggerganov/llama.cpp/issues/18517
|
| 346 |
+
- Community request for Loop support
|
| 347 |
+
|
| 348 |
+
## Recommended Approach
|
| 349 |
+
|
| 350 |
+
### Phase 1: Minimal Implementation
|
| 351 |
+
1. Load loop_gate tensors (no-op in forward pass)
|
| 352 |
+
2. Verify GGUF files load without errors
|
| 353 |
+
3. Run standard Llama forward pass (ignoring loop for now)
|
| 354 |
+
4. **Result**: Model runs but without loop benefits
|
| 355 |
+
|
| 356 |
+
### Phase 2: Basic Loop Implementation
|
| 357 |
+
1. Implement `ggml_loop_gate` CPU kernel
|
| 358 |
+
2. Implement gated residual combination
|
| 359 |
+
3. Integrate 2-iteration loop in forward pass
|
| 360 |
+
4. Test on CPU with small models
|
| 361 |
+
|
| 362 |
+
### Phase 3: GPU Acceleration
|
| 363 |
+
1. Port kernels to CUDA
|
| 364 |
+
2. Optimize memory layout for coalesced access
|
| 365 |
+
3. Implement fused kernels where beneficial
|
| 366 |
+
4. Benchmark against CPU
|
| 367 |
+
|
| 368 |
+
### Phase 4: Optimization
|
| 369 |
+
1. Profile hotspots
|
| 370 |
+
2. Implement kernel fusion
|
| 371 |
+
3. Add quantization support for loop gates
|
| 372 |
+
4. Optimize KV cache handling
|
| 373 |
+
|
| 374 |
+
## Community Contribution
|
| 375 |
+
|
| 376 |
+
This implementation requires significant C++/CUDA expertise. Recommended contributors:
|
| 377 |
+
|
| 378 |
+
- **C++ developers**: Familiar with ggml tensor operations
|
| 379 |
+
- **CUDA developers**: For GPU kernel implementation
|
| 380 |
+
- **ML researchers**: To verify loop attention correctness
|
| 381 |
+
|
| 382 |
+
**Coordination**: Use llama.cpp Issue #18517 for discussion and implementation tracking.
|
| 383 |
+
|
| 384 |
+
## Current Status
|
| 385 |
+
|
| 386 |
+
✅ **Completed**:
|
| 387 |
+
- Converter implementation (IQuestLoopCoderModel)
|
| 388 |
+
- GGUF file generation (F16, Q4_K_M, Q5_K_M, Q8_0)
|
| 389 |
+
- Tensor mapping documentation
|
| 390 |
+
- Loop parameter preservation
|
| 391 |
+
|
| 392 |
+
⏳ **Needed**:
|
| 393 |
+
- Runtime loop attention mechanism
|
| 394 |
+
- CUDA/CPU kernel implementation
|
| 395 |
+
- Testing against PyTorch reference
|
| 396 |
+
- Performance optimization
|
| 397 |
+
|
| 398 |
+
---
|
| 399 |
+
|
| 400 |
+
**Last Updated**: 2026-01-07
|
| 401 |
+
**Contributors**: First GGUF conversion and converter implementation
|
| 402 |
+
**Next Steps**: Submit PR with converter + documentation, community implements runtime
|