nologik commited on
Commit
a2f2377
·
verified ·
1 Parent(s): ae85f11

Add RUNTIME_IMPLEMENTATION_GUIDE.md

Browse files
Files changed (1) hide show
  1. RUNTIME_IMPLEMENTATION_GUIDE.md +402 -0
RUNTIME_IMPLEMENTATION_GUIDE.md ADDED
@@ -0,0 +1,402 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # IQuest Loop Attention Runtime Implementation Guide
2
+
3
+ **Status**: Converter implemented ✅ | Runtime support needed ⏳
4
+
5
+ ## Overview
6
+
7
+ This document outlines the requirements for implementing IQuestLoopCoder runtime support in llama.cpp. The converter (`IQuestLoopCoderModel`) successfully creates GGUF files with all loop-specific tensors, but the inference runtime needs to be implemented.
8
+
9
+ ## What We Know
10
+
11
+ ### Architecture Summary
12
+
13
+ **Loop Mechanism**: Recurrent transformer design with shared parameters across two iterations (loop_num=2)
14
+
15
+ **Key Parameters**:
16
+ - `llama.loop.num`: 2 (iterations of recurrent processing)
17
+ - `llama.loop.window_size`: 64 (attention window for loop mechanism)
18
+
19
+ **Additional Tensors** (160 total):
20
+ - `blk.{0-79}.loop_gate.weight`: [128, 40] per layer
21
+ - `blk.{0-79}.loop_gate.bias`: [40] per layer
22
+
23
+ ### Tensor Layout in GGUF
24
+
25
+ ```
26
+ Standard Llama tensors (721):
27
+ ├── blk.{0-79}.attn_q.weight [5120, 5120]
28
+ ├── blk.{0-79}.attn_k.weight [5120, 1024]
29
+ ├── blk.{0-79}.attn_v.weight [5120, 1024]
30
+ ├── blk.{0-79}.attn_output.weight [5120, 5120]
31
+ ├── blk.{0-79}.attn_norm.weight [5120]
32
+ ├── blk.{0-79}.ffn_gate.weight [5120, 27648]
33
+ ├── blk.{0-79}.ffn_up.weight [5120, 27648]
34
+ ├── blk.{0-79}.ffn_down.weight [27648, 5120]
35
+ └── blk.{0-79}.ffn_norm.weight [5120]
36
+
37
+ Loop-specific tensors (160):
38
+ ├── blk.{0-79}.loop_gate.weight [128, 40] ← NEW
39
+ └── blk.{0-79}.loop_gate.bias [40] ← NEW
40
+
41
+ Embeddings (2):
42
+ ├── token_embd.weight [5120, 76800]
43
+ └── output.weight [5120, 76800]
44
+ ```
45
+
46
+ ### Gate Projection Shape Analysis
47
+
48
+ - **Weight**: [128, 40] = [head_dim, num_heads]
49
+ - **Bias**: [40] = [num_heads]
50
+ - **Per layer**: 1 weight + 1 bias tensor
51
+ - **Total layers**: 80
52
+ - **Total loop tensors**: 160
53
+
54
+ This suggests the gate projects from head dimension to per-head gates.
55
+
56
+ ## Runtime Implementation Requirements
57
+
58
+ ### 1. GGUF Metadata Reading
59
+
60
+ **File**: `llama.cpp` (or equivalent model loader)
61
+
62
+ Add support for reading loop parameters:
63
+
64
+ ```cpp
65
+ // In llama_model_loader or similar
66
+ uint32_t loop_num = 0;
67
+ uint32_t loop_window_size = 0;
68
+
69
+ // Read from GGUF metadata
70
+ gguf_get_val_u32(ctx, gguf_find_key(ctx, "llama.loop.num"), &loop_num);
71
+ gguf_get_val_u32(ctx, gguf_find_key(ctx, "llama.loop.window_size"), &loop_window_size);
72
+
73
+ // Store in model struct
74
+ model->hparams.loop_num = loop_num;
75
+ model->hparams.loop_window_size = loop_window_size;
76
+ ```
77
+
78
+ ### 2. Tensor Loading
79
+
80
+ **File**: `llama.cpp` tensor loading section
81
+
82
+ Add loop gate tensor loading:
83
+
84
+ ```cpp
85
+ // In tensor loading loop
86
+ for (int i = 0; i < n_layer; i++) {
87
+ // Existing tensors...
88
+
89
+ // NEW: Load loop gate tensors
90
+ model.layers[i].loop_gate_w = ml.create_tensor(
91
+ ctx, tn(LLM_TENSOR_LOOP_GATE_W, "weight", i), {n_embd_head, n_head}
92
+ );
93
+ model.layers[i].loop_gate_b = ml.create_tensor(
94
+ ctx, tn(LLM_TENSOR_LOOP_GATE_B, "bias", i), {n_head}
95
+ );
96
+ }
97
+ ```
98
+
99
+ ### 3. Loop Attention Forward Pass (Conceptual)
100
+
101
+ Based on available information, the loop attention likely works as follows:
102
+
103
+ ```python
104
+ # Conceptual implementation (needs verification)
105
+ def loop_attention_forward(x, layer, loop_num=2, loop_window_size=64):
106
+ """
107
+ Recurrent attention with loop_num iterations
108
+
109
+ Args:
110
+ x: input tensor [batch, seq_len, hidden_dim]
111
+ layer: transformer layer with loop_gate weights
112
+ loop_num: number of recurrent iterations (default: 2)
113
+ loop_window_size: attention window size (default: 64)
114
+
115
+ Returns:
116
+ output tensor [batch, seq_len, hidden_dim]
117
+ """
118
+ hidden_state = x
119
+
120
+ # Recurrent loop with shared parameters
121
+ for loop_iter in range(loop_num):
122
+ # Standard self-attention
123
+ attn_output = self_attention(
124
+ hidden_state,
125
+ q_proj=layer.attn_q,
126
+ k_proj=layer.attn_k,
127
+ v_proj=layer.attn_v,
128
+ output_proj=layer.attn_output
129
+ )
130
+
131
+ # Apply loop gating mechanism
132
+ # Gate shape: [num_heads, 1] per position
133
+ gates = compute_loop_gates(
134
+ hidden_state,
135
+ gate_weight=layer.loop_gate.weight, # [head_dim, num_heads]
136
+ gate_bias=layer.loop_gate.bias, # [num_heads]
137
+ window_size=loop_window_size
138
+ )
139
+
140
+ # Blend attention output with residual using gates
141
+ if loop_iter < loop_num - 1:
142
+ # Intermediate iterations: gated combination
143
+ hidden_state = gates * attn_output + (1 - gates) * hidden_state
144
+ else:
145
+ # Final iteration: standard residual
146
+ hidden_state = attn_output + x
147
+
148
+ return hidden_state
149
+
150
+ def compute_loop_gates(hidden_state, gate_weight, gate_bias, window_size):
151
+ """
152
+ Compute per-head gating values
153
+
154
+ Args:
155
+ hidden_state: [batch, seq_len, hidden_dim]
156
+ gate_weight: [head_dim, num_heads]
157
+ gate_bias: [num_heads]
158
+ window_size: local attention window
159
+
160
+ Returns:
161
+ gates: [batch, seq_len, num_heads, 1]
162
+ """
163
+ # Reshape hidden_state to [batch, seq_len, num_heads, head_dim]
164
+ batch, seq_len, hidden_dim = hidden_state.shape
165
+ num_heads = gate_bias.shape[0]
166
+ head_dim = hidden_dim // num_heads
167
+
168
+ x = hidden_state.view(batch, seq_len, num_heads, head_dim)
169
+
170
+ # Project through gate weight: [batch, seq_len, num_heads, head_dim] @ [head_dim, 1]
171
+ # This gives per-head activation
172
+ gate_logits = torch.einsum('bsnh,hk->bsnk', x, gate_weight) + gate_bias
173
+
174
+ # Apply sigmoid for gating in [0, 1]
175
+ gates = torch.sigmoid(gate_logits)
176
+
177
+ return gates
178
+ ```
179
+
180
+ ### 4. C++/CUDA Implementation Outline
181
+
182
+ **File**: `ggml-cuda.cu` (CUDA kernels) or `ggml.c` (CPU implementation)
183
+
184
+ Required kernel functions:
185
+
186
+ ```cpp
187
+ // Kernel 1: Compute loop gates
188
+ struct ggml_tensor * ggml_loop_gate(
189
+ struct ggml_context * ctx,
190
+ struct ggml_tensor * hidden_state, // [batch, seq_len, n_embd]
191
+ struct ggml_tensor * gate_weight, // [n_embd_head, n_head]
192
+ struct ggml_tensor * gate_bias, // [n_head]
193
+ int window_size
194
+ ) {
195
+ // 1. Reshape hidden_state to [batch, seq_len, n_head, n_embd_head]
196
+ // 2. Project through gate_weight
197
+ // 3. Add gate_bias
198
+ // 4. Apply sigmoid activation
199
+ // 5. Return gates [batch, seq_len, n_head, 1]
200
+ }
201
+
202
+ // Kernel 2: Gated residual combination
203
+ struct ggml_tensor * ggml_gated_residual(
204
+ struct ggml_context * ctx,
205
+ struct ggml_tensor * attn_output, // [batch, seq_len, n_embd]
206
+ struct ggml_tensor * residual, // [batch, seq_len, n_embd]
207
+ struct ggml_tensor * gates // [batch, seq_len, n_head, 1]
208
+ ) {
209
+ // output = gates * attn_output + (1 - gates) * residual
210
+ // Per-head gating needs broadcasting
211
+ }
212
+
213
+ // Main loop attention function
214
+ struct ggml_tensor * ggml_loop_attention(
215
+ struct ggml_context * ctx,
216
+ struct ggml_tensor * x,
217
+ struct llama_layer * layer,
218
+ int loop_num,
219
+ int loop_window_size
220
+ ) {
221
+ struct ggml_tensor * hidden_state = x;
222
+
223
+ for (int loop_iter = 0; loop_iter < loop_num; loop_iter++) {
224
+ // Standard attention
225
+ struct ggml_tensor * attn_output = ggml_attention(
226
+ ctx, hidden_state, layer, /* ... */
227
+ );
228
+
229
+ // Compute gates
230
+ struct ggml_tensor * gates = ggml_loop_gate(
231
+ ctx, hidden_state,
232
+ layer->loop_gate_w,
233
+ layer->loop_gate_b,
234
+ loop_window_size
235
+ );
236
+
237
+ // Apply gated residual
238
+ if (loop_iter < loop_num - 1) {
239
+ hidden_state = ggml_gated_residual(
240
+ ctx, attn_output, hidden_state, gates
241
+ );
242
+ } else {
243
+ hidden_state = ggml_add(ctx, attn_output, x);
244
+ }
245
+ }
246
+
247
+ return hidden_state;
248
+ }
249
+ ```
250
+
251
+ ### 5. Integration Points
252
+
253
+ **Files to modify**:
254
+
255
+ 1. **`llama.h`**: Add loop parameters to `llama_hparams`
256
+ 2. **`llama.cpp`**:
257
+ - Read loop metadata from GGUF
258
+ - Load loop_gate tensors
259
+ - Integrate `ggml_loop_attention` into forward pass
260
+ 3. **`ggml.h`**: Add loop attention operation declarations
261
+ 4. **`ggml.c`**: Implement CPU kernels for loop gates
262
+ 5. **`ggml-cuda.cu`**: Implement CUDA kernels for GPU acceleration
263
+ 6. **`ggml-metal.m`**: Implement Metal shaders for Apple Silicon
264
+ 7. **`convert_hf_to_gguf.py`**: Already done! ✅
265
+
266
+ ## Testing Strategy
267
+
268
+ ### 1. Tensor Loading Test
269
+
270
+ Verify all 883 tensors load correctly:
271
+
272
+ ```bash
273
+ ./llama-cli --model IQuest-Coder-V1-40B-Loop-Instruct-q4_k_m.gguf --verbose
274
+ ```
275
+
276
+ Expected output:
277
+ - 80 × loop_gate.weight tensors [128, 40]
278
+ - 80 × loop_gate.bias tensors [40]
279
+ - loop_num = 2
280
+ - loop_window_size = 64
281
+
282
+ ### 2. Forward Pass Test
283
+
284
+ Compare output with PyTorch reference:
285
+
286
+ ```python
287
+ # Generate reference output with HuggingFace
288
+ from transformers import AutoModelForCausalLM, AutoTokenizer
289
+
290
+ model = AutoModelForCausalLM.from_pretrained(
291
+ "IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct",
292
+ trust_remote_code=True
293
+ )
294
+ tokenizer = AutoTokenizer.from_pretrained(...)
295
+
296
+ input_text = "def fibonacci(n):"
297
+ inputs = tokenizer(input_text, return_tensors="pt")
298
+
299
+ with torch.no_grad():
300
+ pytorch_output = model.generate(**inputs, max_new_tokens=50)
301
+
302
+ print("Reference:", tokenizer.decode(pytorch_output[0]))
303
+ ```
304
+
305
+ Then test llama.cpp:
306
+
307
+ ```bash
308
+ ./llama-cli --model IQuest-Coder-V1-40B-Loop-Instruct-q4_k_m.gguf \
309
+ --prompt "def fibonacci(n):" --n-predict 50
310
+ ```
311
+
312
+ Compare token-by-token outputs.
313
+
314
+ ### 3. Performance Benchmarks
315
+
316
+ - **Throughput**: tokens/second
317
+ - **Latency**: time to first token
318
+ - **Memory**: peak GPU/CPU memory usage
319
+ - **Quality**: Compare perplexity with reference
320
+
321
+ ## Unknown Implementation Details
322
+
323
+ The following need verification from original implementation or technical paper:
324
+
325
+ 1. **Gate activation function**: Sigmoid? Tanh? Softmax?
326
+ 2. **Gate application**: Per-head? Per-token? Global?
327
+ 3. **Loop window**: How is window_size=64 used? Sliding window? Chunking?
328
+ 4. **Residual connection**: Standard or modified for loops?
329
+ 5. **Positional encoding**: Modified during loop iterations?
330
+ 6. **KV cache**: Recomputed each loop? Shared across iterations?
331
+
332
+ ## References for Implementation
333
+
334
+ 1. **vLLM PR #31575**: https://github.com/vllm-project/vllm/pull/31575
335
+ - Shows integration patterns
336
+ - LoopCoderNorm → RMSNorm refactoring noted
337
+
338
+ 2. **Model Config**: `/workspace/.cache/huggingface/.../config.json`
339
+ - Contains: loop_num=2, loop_window_size=64
340
+
341
+ 3. **Converted GGUFs**: `/workspace/models/converted/`
342
+ - Reference for tensor shapes and names
343
+ - Test files for validation
344
+
345
+ 4. **Issue #18517**: https://github.com/ggerganov/llama.cpp/issues/18517
346
+ - Community request for Loop support
347
+
348
+ ## Recommended Approach
349
+
350
+ ### Phase 1: Minimal Implementation
351
+ 1. Load loop_gate tensors (no-op in forward pass)
352
+ 2. Verify GGUF files load without errors
353
+ 3. Run standard Llama forward pass (ignoring loop for now)
354
+ 4. **Result**: Model runs but without loop benefits
355
+
356
+ ### Phase 2: Basic Loop Implementation
357
+ 1. Implement `ggml_loop_gate` CPU kernel
358
+ 2. Implement gated residual combination
359
+ 3. Integrate 2-iteration loop in forward pass
360
+ 4. Test on CPU with small models
361
+
362
+ ### Phase 3: GPU Acceleration
363
+ 1. Port kernels to CUDA
364
+ 2. Optimize memory layout for coalesced access
365
+ 3. Implement fused kernels where beneficial
366
+ 4. Benchmark against CPU
367
+
368
+ ### Phase 4: Optimization
369
+ 1. Profile hotspots
370
+ 2. Implement kernel fusion
371
+ 3. Add quantization support for loop gates
372
+ 4. Optimize KV cache handling
373
+
374
+ ## Community Contribution
375
+
376
+ This implementation requires significant C++/CUDA expertise. Recommended contributors:
377
+
378
+ - **C++ developers**: Familiar with ggml tensor operations
379
+ - **CUDA developers**: For GPU kernel implementation
380
+ - **ML researchers**: To verify loop attention correctness
381
+
382
+ **Coordination**: Use llama.cpp Issue #18517 for discussion and implementation tracking.
383
+
384
+ ## Current Status
385
+
386
+ ✅ **Completed**:
387
+ - Converter implementation (IQuestLoopCoderModel)
388
+ - GGUF file generation (F16, Q4_K_M, Q5_K_M, Q8_0)
389
+ - Tensor mapping documentation
390
+ - Loop parameter preservation
391
+
392
+ ⏳ **Needed**:
393
+ - Runtime loop attention mechanism
394
+ - CUDA/CPU kernel implementation
395
+ - Testing against PyTorch reference
396
+ - Performance optimization
397
+
398
+ ---
399
+
400
+ **Last Updated**: 2026-01-07
401
+ **Contributors**: First GGUF conversion and converter implementation
402
+ **Next Steps**: Submit PR with converter + documentation, community implements runtime