# IQuest Loop Attention Runtime Implementation Guide **Status**: Converter implemented ✅ | Runtime support needed ⏳ ## Overview This document outlines the requirements for implementing IQuestLoopCoder runtime support in llama.cpp. The converter (`IQuestLoopCoderModel`) successfully creates GGUF files with all loop-specific tensors, but the inference runtime needs to be implemented. ## What We Know ### Architecture Summary **Loop Mechanism**: Recurrent transformer design with shared parameters across two iterations (loop_num=2) **Key Parameters**: - `llama.loop.num`: 2 (iterations of recurrent processing) - `llama.loop.window_size`: 64 (attention window for loop mechanism) **Additional Tensors** (160 total): - `blk.{0-79}.loop_gate.weight`: [128, 40] per layer - `blk.{0-79}.loop_gate.bias`: [40] per layer ### Tensor Layout in GGUF ``` Standard Llama tensors (721): ├── blk.{0-79}.attn_q.weight [5120, 5120] ├── blk.{0-79}.attn_k.weight [5120, 1024] ├── blk.{0-79}.attn_v.weight [5120, 1024] ├── blk.{0-79}.attn_output.weight [5120, 5120] ├── blk.{0-79}.attn_norm.weight [5120] ├── blk.{0-79}.ffn_gate.weight [5120, 27648] ├── blk.{0-79}.ffn_up.weight [5120, 27648] ├── blk.{0-79}.ffn_down.weight [27648, 5120] └── blk.{0-79}.ffn_norm.weight [5120] Loop-specific tensors (160): ├── blk.{0-79}.loop_gate.weight [128, 40] ← NEW └── blk.{0-79}.loop_gate.bias [40] ← NEW Embeddings (2): ├── token_embd.weight [5120, 76800] └── output.weight [5120, 76800] ``` ### Gate Projection Shape Analysis - **Weight**: [128, 40] = [head_dim, num_heads] - **Bias**: [40] = [num_heads] - **Per layer**: 1 weight + 1 bias tensor - **Total layers**: 80 - **Total loop tensors**: 160 This suggests the gate projects from head dimension to per-head gates. ## Runtime Implementation Requirements ### 1. GGUF Metadata Reading **File**: `llama.cpp` (or equivalent model loader) Add support for reading loop parameters: ```cpp // In llama_model_loader or similar uint32_t loop_num = 0; uint32_t loop_window_size = 0; // Read from GGUF metadata gguf_get_val_u32(ctx, gguf_find_key(ctx, "llama.loop.num"), &loop_num); gguf_get_val_u32(ctx, gguf_find_key(ctx, "llama.loop.window_size"), &loop_window_size); // Store in model struct model->hparams.loop_num = loop_num; model->hparams.loop_window_size = loop_window_size; ``` ### 2. Tensor Loading **File**: `llama.cpp` tensor loading section Add loop gate tensor loading: ```cpp // In tensor loading loop for (int i = 0; i < n_layer; i++) { // Existing tensors... // NEW: Load loop gate tensors model.layers[i].loop_gate_w = ml.create_tensor( ctx, tn(LLM_TENSOR_LOOP_GATE_W, "weight", i), {n_embd_head, n_head} ); model.layers[i].loop_gate_b = ml.create_tensor( ctx, tn(LLM_TENSOR_LOOP_GATE_B, "bias", i), {n_head} ); } ``` ### 3. Loop Attention Forward Pass (Conceptual) Based on available information, the loop attention likely works as follows: ```python # Conceptual implementation (needs verification) def loop_attention_forward(x, layer, loop_num=2, loop_window_size=64): """ Recurrent attention with loop_num iterations Args: x: input tensor [batch, seq_len, hidden_dim] layer: transformer layer with loop_gate weights loop_num: number of recurrent iterations (default: 2) loop_window_size: attention window size (default: 64) Returns: output tensor [batch, seq_len, hidden_dim] """ hidden_state = x # Recurrent loop with shared parameters for loop_iter in range(loop_num): # Standard self-attention attn_output = self_attention( hidden_state, q_proj=layer.attn_q, k_proj=layer.attn_k, v_proj=layer.attn_v, output_proj=layer.attn_output ) # Apply loop gating mechanism # Gate shape: [num_heads, 1] per position gates = compute_loop_gates( hidden_state, gate_weight=layer.loop_gate.weight, # [head_dim, num_heads] gate_bias=layer.loop_gate.bias, # [num_heads] window_size=loop_window_size ) # Blend attention output with residual using gates if loop_iter < loop_num - 1: # Intermediate iterations: gated combination hidden_state = gates * attn_output + (1 - gates) * hidden_state else: # Final iteration: standard residual hidden_state = attn_output + x return hidden_state def compute_loop_gates(hidden_state, gate_weight, gate_bias, window_size): """ Compute per-head gating values Args: hidden_state: [batch, seq_len, hidden_dim] gate_weight: [head_dim, num_heads] gate_bias: [num_heads] window_size: local attention window Returns: gates: [batch, seq_len, num_heads, 1] """ # Reshape hidden_state to [batch, seq_len, num_heads, head_dim] batch, seq_len, hidden_dim = hidden_state.shape num_heads = gate_bias.shape[0] head_dim = hidden_dim // num_heads x = hidden_state.view(batch, seq_len, num_heads, head_dim) # Project through gate weight: [batch, seq_len, num_heads, head_dim] @ [head_dim, 1] # This gives per-head activation gate_logits = torch.einsum('bsnh,hk->bsnk', x, gate_weight) + gate_bias # Apply sigmoid for gating in [0, 1] gates = torch.sigmoid(gate_logits) return gates ``` ### 4. C++/CUDA Implementation Outline **File**: `ggml-cuda.cu` (CUDA kernels) or `ggml.c` (CPU implementation) Required kernel functions: ```cpp // Kernel 1: Compute loop gates struct ggml_tensor * ggml_loop_gate( struct ggml_context * ctx, struct ggml_tensor * hidden_state, // [batch, seq_len, n_embd] struct ggml_tensor * gate_weight, // [n_embd_head, n_head] struct ggml_tensor * gate_bias, // [n_head] int window_size ) { // 1. Reshape hidden_state to [batch, seq_len, n_head, n_embd_head] // 2. Project through gate_weight // 3. Add gate_bias // 4. Apply sigmoid activation // 5. Return gates [batch, seq_len, n_head, 1] } // Kernel 2: Gated residual combination struct ggml_tensor * ggml_gated_residual( struct ggml_context * ctx, struct ggml_tensor * attn_output, // [batch, seq_len, n_embd] struct ggml_tensor * residual, // [batch, seq_len, n_embd] struct ggml_tensor * gates // [batch, seq_len, n_head, 1] ) { // output = gates * attn_output + (1 - gates) * residual // Per-head gating needs broadcasting } // Main loop attention function struct ggml_tensor * ggml_loop_attention( struct ggml_context * ctx, struct ggml_tensor * x, struct llama_layer * layer, int loop_num, int loop_window_size ) { struct ggml_tensor * hidden_state = x; for (int loop_iter = 0; loop_iter < loop_num; loop_iter++) { // Standard attention struct ggml_tensor * attn_output = ggml_attention( ctx, hidden_state, layer, /* ... */ ); // Compute gates struct ggml_tensor * gates = ggml_loop_gate( ctx, hidden_state, layer->loop_gate_w, layer->loop_gate_b, loop_window_size ); // Apply gated residual if (loop_iter < loop_num - 1) { hidden_state = ggml_gated_residual( ctx, attn_output, hidden_state, gates ); } else { hidden_state = ggml_add(ctx, attn_output, x); } } return hidden_state; } ``` ### 5. Integration Points **Files to modify**: 1. **`llama.h`**: Add loop parameters to `llama_hparams` 2. **`llama.cpp`**: - Read loop metadata from GGUF - Load loop_gate tensors - Integrate `ggml_loop_attention` into forward pass 3. **`ggml.h`**: Add loop attention operation declarations 4. **`ggml.c`**: Implement CPU kernels for loop gates 5. **`ggml-cuda.cu`**: Implement CUDA kernels for GPU acceleration 6. **`ggml-metal.m`**: Implement Metal shaders for Apple Silicon 7. **`convert_hf_to_gguf.py`**: Already done! ✅ ## Testing Strategy ### 1. Tensor Loading Test Verify all 883 tensors load correctly: ```bash ./llama-cli --model IQuest-Coder-V1-40B-Loop-Instruct-q4_k_m.gguf --verbose ``` Expected output: - 80 × loop_gate.weight tensors [128, 40] - 80 × loop_gate.bias tensors [40] - loop_num = 2 - loop_window_size = 64 ### 2. Forward Pass Test Compare output with PyTorch reference: ```python # Generate reference output with HuggingFace from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained( "IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct", trust_remote_code=True ) tokenizer = AutoTokenizer.from_pretrained(...) input_text = "def fibonacci(n):" inputs = tokenizer(input_text, return_tensors="pt") with torch.no_grad(): pytorch_output = model.generate(**inputs, max_new_tokens=50) print("Reference:", tokenizer.decode(pytorch_output[0])) ``` Then test llama.cpp: ```bash ./llama-cli --model IQuest-Coder-V1-40B-Loop-Instruct-q4_k_m.gguf \ --prompt "def fibonacci(n):" --n-predict 50 ``` Compare token-by-token outputs. ### 3. Performance Benchmarks - **Throughput**: tokens/second - **Latency**: time to first token - **Memory**: peak GPU/CPU memory usage - **Quality**: Compare perplexity with reference ## Unknown Implementation Details The following need verification from original implementation or technical paper: 1. **Gate activation function**: Sigmoid? Tanh? Softmax? 2. **Gate application**: Per-head? Per-token? Global? 3. **Loop window**: How is window_size=64 used? Sliding window? Chunking? 4. **Residual connection**: Standard or modified for loops? 5. **Positional encoding**: Modified during loop iterations? 6. **KV cache**: Recomputed each loop? Shared across iterations? ## References for Implementation 1. **vLLM PR #31575**: https://github.com/vllm-project/vllm/pull/31575 - Shows integration patterns - LoopCoderNorm → RMSNorm refactoring noted 2. **Model Config**: `/workspace/.cache/huggingface/.../config.json` - Contains: loop_num=2, loop_window_size=64 3. **Converted GGUFs**: `/workspace/models/converted/` - Reference for tensor shapes and names - Test files for validation 4. **Issue #18517**: https://github.com/ggerganov/llama.cpp/issues/18517 - Community request for Loop support ## Recommended Approach ### Phase 1: Minimal Implementation 1. Load loop_gate tensors (no-op in forward pass) 2. Verify GGUF files load without errors 3. Run standard Llama forward pass (ignoring loop for now) 4. **Result**: Model runs but without loop benefits ### Phase 2: Basic Loop Implementation 1. Implement `ggml_loop_gate` CPU kernel 2. Implement gated residual combination 3. Integrate 2-iteration loop in forward pass 4. Test on CPU with small models ### Phase 3: GPU Acceleration 1. Port kernels to CUDA 2. Optimize memory layout for coalesced access 3. Implement fused kernels where beneficial 4. Benchmark against CPU ### Phase 4: Optimization 1. Profile hotspots 2. Implement kernel fusion 3. Add quantization support for loop gates 4. Optimize KV cache handling ## Community Contribution This implementation requires significant C++/CUDA expertise. Recommended contributors: - **C++ developers**: Familiar with ggml tensor operations - **CUDA developers**: For GPU kernel implementation - **ML researchers**: To verify loop attention correctness **Coordination**: Use llama.cpp Issue #18517 for discussion and implementation tracking. ## Current Status ✅ **Completed**: - Converter implementation (IQuestLoopCoderModel) - GGUF file generation (F16, Q4_K_M, Q5_K_M, Q8_0) - Tensor mapping documentation - Loop parameter preservation ⏳ **Needed**: - Runtime loop attention mechanism - CUDA/CPU kernel implementation - Testing against PyTorch reference - Performance optimization --- **Last Updated**: 2026-01-07 **Contributors**: First GGUF conversion and converter implementation **Next Steps**: Submit PR with converter + documentation, community implements runtime