| # IQuest Loop Attention Runtime Implementation Guide | |
| **Status**: Converter implemented β | Runtime support needed β³ | |
| ## Overview | |
| This document outlines the requirements for implementing IQuestLoopCoder runtime support in llama.cpp. The converter (`IQuestLoopCoderModel`) successfully creates GGUF files with all loop-specific tensors, but the inference runtime needs to be implemented. | |
| ## What We Know | |
| ### Architecture Summary | |
| **Loop Mechanism**: Recurrent transformer design with shared parameters across two iterations (loop_num=2) | |
| **Key Parameters**: | |
| - `llama.loop.num`: 2 (iterations of recurrent processing) | |
| - `llama.loop.window_size`: 64 (attention window for loop mechanism) | |
| **Additional Tensors** (160 total): | |
| - `blk.{0-79}.loop_gate.weight`: [128, 40] per layer | |
| - `blk.{0-79}.loop_gate.bias`: [40] per layer | |
| ### Tensor Layout in GGUF | |
| ``` | |
| Standard Llama tensors (721): | |
| βββ blk.{0-79}.attn_q.weight [5120, 5120] | |
| βββ blk.{0-79}.attn_k.weight [5120, 1024] | |
| βββ blk.{0-79}.attn_v.weight [5120, 1024] | |
| βββ blk.{0-79}.attn_output.weight [5120, 5120] | |
| βββ blk.{0-79}.attn_norm.weight [5120] | |
| βββ blk.{0-79}.ffn_gate.weight [5120, 27648] | |
| βββ blk.{0-79}.ffn_up.weight [5120, 27648] | |
| βββ blk.{0-79}.ffn_down.weight [27648, 5120] | |
| βββ blk.{0-79}.ffn_norm.weight [5120] | |
| Loop-specific tensors (160): | |
| βββ blk.{0-79}.loop_gate.weight [128, 40] β NEW | |
| βββ blk.{0-79}.loop_gate.bias [40] β NEW | |
| Embeddings (2): | |
| βββ token_embd.weight [5120, 76800] | |
| βββ output.weight [5120, 76800] | |
| ``` | |
| ### Gate Projection Shape Analysis | |
| - **Weight**: [128, 40] = [head_dim, num_heads] | |
| - **Bias**: [40] = [num_heads] | |
| - **Per layer**: 1 weight + 1 bias tensor | |
| - **Total layers**: 80 | |
| - **Total loop tensors**: 160 | |
| This suggests the gate projects from head dimension to per-head gates. | |
| ## Runtime Implementation Requirements | |
| ### 1. GGUF Metadata Reading | |
| **File**: `llama.cpp` (or equivalent model loader) | |
| Add support for reading loop parameters: | |
| ```cpp | |
| // In llama_model_loader or similar | |
| uint32_t loop_num = 0; | |
| uint32_t loop_window_size = 0; | |
| // Read from GGUF metadata | |
| gguf_get_val_u32(ctx, gguf_find_key(ctx, "llama.loop.num"), &loop_num); | |
| gguf_get_val_u32(ctx, gguf_find_key(ctx, "llama.loop.window_size"), &loop_window_size); | |
| // Store in model struct | |
| model->hparams.loop_num = loop_num; | |
| model->hparams.loop_window_size = loop_window_size; | |
| ``` | |
| ### 2. Tensor Loading | |
| **File**: `llama.cpp` tensor loading section | |
| Add loop gate tensor loading: | |
| ```cpp | |
| // In tensor loading loop | |
| for (int i = 0; i < n_layer; i++) { | |
| // Existing tensors... | |
| // NEW: Load loop gate tensors | |
| model.layers[i].loop_gate_w = ml.create_tensor( | |
| ctx, tn(LLM_TENSOR_LOOP_GATE_W, "weight", i), {n_embd_head, n_head} | |
| ); | |
| model.layers[i].loop_gate_b = ml.create_tensor( | |
| ctx, tn(LLM_TENSOR_LOOP_GATE_B, "bias", i), {n_head} | |
| ); | |
| } | |
| ``` | |
| ### 3. Loop Attention Forward Pass (Conceptual) | |
| Based on available information, the loop attention likely works as follows: | |
| ```python | |
| # Conceptual implementation (needs verification) | |
| def loop_attention_forward(x, layer, loop_num=2, loop_window_size=64): | |
| """ | |
| Recurrent attention with loop_num iterations | |
| Args: | |
| x: input tensor [batch, seq_len, hidden_dim] | |
| layer: transformer layer with loop_gate weights | |
| loop_num: number of recurrent iterations (default: 2) | |
| loop_window_size: attention window size (default: 64) | |
| Returns: | |
| output tensor [batch, seq_len, hidden_dim] | |
| """ | |
| hidden_state = x | |
| # Recurrent loop with shared parameters | |
| for loop_iter in range(loop_num): | |
| # Standard self-attention | |
| attn_output = self_attention( | |
| hidden_state, | |
| q_proj=layer.attn_q, | |
| k_proj=layer.attn_k, | |
| v_proj=layer.attn_v, | |
| output_proj=layer.attn_output | |
| ) | |
| # Apply loop gating mechanism | |
| # Gate shape: [num_heads, 1] per position | |
| gates = compute_loop_gates( | |
| hidden_state, | |
| gate_weight=layer.loop_gate.weight, # [head_dim, num_heads] | |
| gate_bias=layer.loop_gate.bias, # [num_heads] | |
| window_size=loop_window_size | |
| ) | |
| # Blend attention output with residual using gates | |
| if loop_iter < loop_num - 1: | |
| # Intermediate iterations: gated combination | |
| hidden_state = gates * attn_output + (1 - gates) * hidden_state | |
| else: | |
| # Final iteration: standard residual | |
| hidden_state = attn_output + x | |
| return hidden_state | |
| def compute_loop_gates(hidden_state, gate_weight, gate_bias, window_size): | |
| """ | |
| Compute per-head gating values | |
| Args: | |
| hidden_state: [batch, seq_len, hidden_dim] | |
| gate_weight: [head_dim, num_heads] | |
| gate_bias: [num_heads] | |
| window_size: local attention window | |
| Returns: | |
| gates: [batch, seq_len, num_heads, 1] | |
| """ | |
| # Reshape hidden_state to [batch, seq_len, num_heads, head_dim] | |
| batch, seq_len, hidden_dim = hidden_state.shape | |
| num_heads = gate_bias.shape[0] | |
| head_dim = hidden_dim // num_heads | |
| x = hidden_state.view(batch, seq_len, num_heads, head_dim) | |
| # Project through gate weight: [batch, seq_len, num_heads, head_dim] @ [head_dim, 1] | |
| # This gives per-head activation | |
| gate_logits = torch.einsum('bsnh,hk->bsnk', x, gate_weight) + gate_bias | |
| # Apply sigmoid for gating in [0, 1] | |
| gates = torch.sigmoid(gate_logits) | |
| return gates | |
| ``` | |
| ### 4. C++/CUDA Implementation Outline | |
| **File**: `ggml-cuda.cu` (CUDA kernels) or `ggml.c` (CPU implementation) | |
| Required kernel functions: | |
| ```cpp | |
| // Kernel 1: Compute loop gates | |
| struct ggml_tensor * ggml_loop_gate( | |
| struct ggml_context * ctx, | |
| struct ggml_tensor * hidden_state, // [batch, seq_len, n_embd] | |
| struct ggml_tensor * gate_weight, // [n_embd_head, n_head] | |
| struct ggml_tensor * gate_bias, // [n_head] | |
| int window_size | |
| ) { | |
| // 1. Reshape hidden_state to [batch, seq_len, n_head, n_embd_head] | |
| // 2. Project through gate_weight | |
| // 3. Add gate_bias | |
| // 4. Apply sigmoid activation | |
| // 5. Return gates [batch, seq_len, n_head, 1] | |
| } | |
| // Kernel 2: Gated residual combination | |
| struct ggml_tensor * ggml_gated_residual( | |
| struct ggml_context * ctx, | |
| struct ggml_tensor * attn_output, // [batch, seq_len, n_embd] | |
| struct ggml_tensor * residual, // [batch, seq_len, n_embd] | |
| struct ggml_tensor * gates // [batch, seq_len, n_head, 1] | |
| ) { | |
| // output = gates * attn_output + (1 - gates) * residual | |
| // Per-head gating needs broadcasting | |
| } | |
| // Main loop attention function | |
| struct ggml_tensor * ggml_loop_attention( | |
| struct ggml_context * ctx, | |
| struct ggml_tensor * x, | |
| struct llama_layer * layer, | |
| int loop_num, | |
| int loop_window_size | |
| ) { | |
| struct ggml_tensor * hidden_state = x; | |
| for (int loop_iter = 0; loop_iter < loop_num; loop_iter++) { | |
| // Standard attention | |
| struct ggml_tensor * attn_output = ggml_attention( | |
| ctx, hidden_state, layer, /* ... */ | |
| ); | |
| // Compute gates | |
| struct ggml_tensor * gates = ggml_loop_gate( | |
| ctx, hidden_state, | |
| layer->loop_gate_w, | |
| layer->loop_gate_b, | |
| loop_window_size | |
| ); | |
| // Apply gated residual | |
| if (loop_iter < loop_num - 1) { | |
| hidden_state = ggml_gated_residual( | |
| ctx, attn_output, hidden_state, gates | |
| ); | |
| } else { | |
| hidden_state = ggml_add(ctx, attn_output, x); | |
| } | |
| } | |
| return hidden_state; | |
| } | |
| ``` | |
| ### 5. Integration Points | |
| **Files to modify**: | |
| 1. **`llama.h`**: Add loop parameters to `llama_hparams` | |
| 2. **`llama.cpp`**: | |
| - Read loop metadata from GGUF | |
| - Load loop_gate tensors | |
| - Integrate `ggml_loop_attention` into forward pass | |
| 3. **`ggml.h`**: Add loop attention operation declarations | |
| 4. **`ggml.c`**: Implement CPU kernels for loop gates | |
| 5. **`ggml-cuda.cu`**: Implement CUDA kernels for GPU acceleration | |
| 6. **`ggml-metal.m`**: Implement Metal shaders for Apple Silicon | |
| 7. **`convert_hf_to_gguf.py`**: Already done! β | |
| ## Testing Strategy | |
| ### 1. Tensor Loading Test | |
| Verify all 883 tensors load correctly: | |
| ```bash | |
| ./llama-cli --model IQuest-Coder-V1-40B-Loop-Instruct-q4_k_m.gguf --verbose | |
| ``` | |
| Expected output: | |
| - 80 Γ loop_gate.weight tensors [128, 40] | |
| - 80 Γ loop_gate.bias tensors [40] | |
| - loop_num = 2 | |
| - loop_window_size = 64 | |
| ### 2. Forward Pass Test | |
| Compare output with PyTorch reference: | |
| ```python | |
| # Generate reference output with HuggingFace | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct", | |
| trust_remote_code=True | |
| ) | |
| tokenizer = AutoTokenizer.from_pretrained(...) | |
| input_text = "def fibonacci(n):" | |
| inputs = tokenizer(input_text, return_tensors="pt") | |
| with torch.no_grad(): | |
| pytorch_output = model.generate(**inputs, max_new_tokens=50) | |
| print("Reference:", tokenizer.decode(pytorch_output[0])) | |
| ``` | |
| Then test llama.cpp: | |
| ```bash | |
| ./llama-cli --model IQuest-Coder-V1-40B-Loop-Instruct-q4_k_m.gguf \ | |
| --prompt "def fibonacci(n):" --n-predict 50 | |
| ``` | |
| Compare token-by-token outputs. | |
| ### 3. Performance Benchmarks | |
| - **Throughput**: tokens/second | |
| - **Latency**: time to first token | |
| - **Memory**: peak GPU/CPU memory usage | |
| - **Quality**: Compare perplexity with reference | |
| ## Unknown Implementation Details | |
| The following need verification from original implementation or technical paper: | |
| 1. **Gate activation function**: Sigmoid? Tanh? Softmax? | |
| 2. **Gate application**: Per-head? Per-token? Global? | |
| 3. **Loop window**: How is window_size=64 used? Sliding window? Chunking? | |
| 4. **Residual connection**: Standard or modified for loops? | |
| 5. **Positional encoding**: Modified during loop iterations? | |
| 6. **KV cache**: Recomputed each loop? Shared across iterations? | |
| ## References for Implementation | |
| 1. **vLLM PR #31575**: https://github.com/vllm-project/vllm/pull/31575 | |
| - Shows integration patterns | |
| - LoopCoderNorm β RMSNorm refactoring noted | |
| 2. **Model Config**: `/workspace/.cache/huggingface/.../config.json` | |
| - Contains: loop_num=2, loop_window_size=64 | |
| 3. **Converted GGUFs**: `/workspace/models/converted/` | |
| - Reference for tensor shapes and names | |
| - Test files for validation | |
| 4. **Issue #18517**: https://github.com/ggerganov/llama.cpp/issues/18517 | |
| - Community request for Loop support | |
| ## Recommended Approach | |
| ### Phase 1: Minimal Implementation | |
| 1. Load loop_gate tensors (no-op in forward pass) | |
| 2. Verify GGUF files load without errors | |
| 3. Run standard Llama forward pass (ignoring loop for now) | |
| 4. **Result**: Model runs but without loop benefits | |
| ### Phase 2: Basic Loop Implementation | |
| 1. Implement `ggml_loop_gate` CPU kernel | |
| 2. Implement gated residual combination | |
| 3. Integrate 2-iteration loop in forward pass | |
| 4. Test on CPU with small models | |
| ### Phase 3: GPU Acceleration | |
| 1. Port kernels to CUDA | |
| 2. Optimize memory layout for coalesced access | |
| 3. Implement fused kernels where beneficial | |
| 4. Benchmark against CPU | |
| ### Phase 4: Optimization | |
| 1. Profile hotspots | |
| 2. Implement kernel fusion | |
| 3. Add quantization support for loop gates | |
| 4. Optimize KV cache handling | |
| ## Community Contribution | |
| This implementation requires significant C++/CUDA expertise. Recommended contributors: | |
| - **C++ developers**: Familiar with ggml tensor operations | |
| - **CUDA developers**: For GPU kernel implementation | |
| - **ML researchers**: To verify loop attention correctness | |
| **Coordination**: Use llama.cpp Issue #18517 for discussion and implementation tracking. | |
| ## Current Status | |
| β **Completed**: | |
| - Converter implementation (IQuestLoopCoderModel) | |
| - GGUF file generation (F16, Q4_K_M, Q5_K_M, Q8_0) | |
| - Tensor mapping documentation | |
| - Loop parameter preservation | |
| β³ **Needed**: | |
| - Runtime loop attention mechanism | |
| - CUDA/CPU kernel implementation | |
| - Testing against PyTorch reference | |
| - Performance optimization | |
| --- | |
| **Last Updated**: 2026-01-07 | |
| **Contributors**: First GGUF conversion and converter implementation | |
| **Next Steps**: Submit PR with converter + documentation, community implements runtime | |