Avarok
/

IQuest-Coder-V1-40B-Loop-Instruct-GGUF

+# IQuest Loop Attention Runtime Implementation Guide
+**Status**: Converter implemented ✅ | Runtime support needed ⏳
+## Overview
+This document outlines the requirements for implementing IQuestLoopCoder runtime support in llama.cpp. The converter (`IQuestLoopCoderModel`) successfully creates GGUF files with all loop-specific tensors, but the inference runtime needs to be implemented.
+## What We Know
+### Architecture Summary
+**Loop Mechanism**: Recurrent transformer design with shared parameters across two iterations (loop_num=2)
+**Key Parameters**:
+- `llama.loop.num`: 2 (iterations of recurrent processing)
+- `llama.loop.window_size`: 64 (attention window for loop mechanism)
+**Additional Tensors** (160 total):
+- `blk.{0-79}.loop_gate.weight`: [128, 40] per layer
+- `blk.{0-79}.loop_gate.bias`: [40] per layer
+### Tensor Layout in GGUF
+```
+Standard Llama tensors (721):
+├── blk.{0-79}.attn_q.weight [5120, 5120]
+├── blk.{0-79}.attn_k.weight [5120, 1024]
+├── blk.{0-79}.attn_v.weight [5120, 1024]
+├── blk.{0-79}.attn_output.weight [5120, 5120]
+├── blk.{0-79}.attn_norm.weight [5120]
+├── blk.{0-79}.ffn_gate.weight [5120, 27648]
+├── blk.{0-79}.ffn_up.weight [5120, 27648]
+├── blk.{0-79}.ffn_down.weight [27648, 5120]
+└── blk.{0-79}.ffn_norm.weight [5120]
+Loop-specific tensors (160):
+├── blk.{0-79}.loop_gate.weight [128, 40]  ← NEW
+└── blk.{0-79}.loop_gate.bias [40]         ← NEW
+Embeddings (2):
+├── token_embd.weight [5120, 76800]
+└── output.weight [5120, 76800]
+```
+### Gate Projection Shape Analysis
+- **Weight**: [128, 40] = [head_dim, num_heads]
+- **Bias**: [40] = [num_heads]
+- **Per layer**: 1 weight + 1 bias tensor
+- **Total layers**: 80
+- **Total loop tensors**: 160
+This suggests the gate projects from head dimension to per-head gates.
+## Runtime Implementation Requirements
+### 1. GGUF Metadata Reading
+**File**: `llama.cpp` (or equivalent model loader)
+Add support for reading loop parameters:
+```cpp
+// In llama_model_loader or similar
+uint32_t loop_num = 0;
+uint32_t loop_window_size = 0;
+// Read from GGUF metadata
+gguf_get_val_u32(ctx, gguf_find_key(ctx, "llama.loop.num"), &loop_num);
+gguf_get_val_u32(ctx, gguf_find_key(ctx, "llama.loop.window_size"), &loop_window_size);
+// Store in model struct
+model->hparams.loop_num = loop_num;
+model->hparams.loop_window_size = loop_window_size;
+```
+### 2. Tensor Loading
+**File**: `llama.cpp` tensor loading section
+Add loop gate tensor loading:
+```cpp
+// In tensor loading loop
+for (int i = 0; i < n_layer; i++) {
+    // Existing tensors...
+    // NEW: Load loop gate tensors
+    model.layers[i].loop_gate_w = ml.create_tensor(
+        ctx, tn(LLM_TENSOR_LOOP_GATE_W, "weight", i), {n_embd_head, n_head}
+    );
+    model.layers[i].loop_gate_b = ml.create_tensor(
+        ctx, tn(LLM_TENSOR_LOOP_GATE_B, "bias", i), {n_head}
+    );
+}
+```
+### 3. Loop Attention Forward Pass (Conceptual)
+Based on available information, the loop attention likely works as follows:
+```python
+# Conceptual implementation (needs verification)
+def loop_attention_forward(x, layer, loop_num=2, loop_window_size=64):
+    """
+    Recurrent attention with loop_num iterations
+    Args:
+        x: input tensor [batch, seq_len, hidden_dim]
+        layer: transformer layer with loop_gate weights
+        loop_num: number of recurrent iterations (default: 2)
+        loop_window_size: attention window size (default: 64)
+    Returns:
+        output tensor [batch, seq_len, hidden_dim]
+    """
+    hidden_state = x
+    # Recurrent loop with shared parameters
+    for loop_iter in range(loop_num):
+        # Standard self-attention
+        attn_output = self_attention(
+            hidden_state,
+            q_proj=layer.attn_q,
+            k_proj=layer.attn_k,
+            v_proj=layer.attn_v,
+            output_proj=layer.attn_output
+        )
+        # Apply loop gating mechanism
+        # Gate shape: [num_heads, 1] per position
+        gates = compute_loop_gates(
+            hidden_state,
+            gate_weight=layer.loop_gate.weight,  # [head_dim, num_heads]
+            gate_bias=layer.loop_gate.bias,       # [num_heads]
+            window_size=loop_window_size
+        )
+        # Blend attention output with residual using gates
+        if loop_iter < loop_num - 1:
+            # Intermediate iterations: gated combination
+            hidden_state = gates * attn_output + (1 - gates) * hidden_state
+        else:
+            # Final iteration: standard residual
+            hidden_state = attn_output + x
+    return hidden_state
+def compute_loop_gates(hidden_state, gate_weight, gate_bias, window_size):
+    """
+    Compute per-head gating values
+    Args:
+        hidden_state: [batch, seq_len, hidden_dim]
+        gate_weight: [head_dim, num_heads]
+        gate_bias: [num_heads]
+        window_size: local attention window
+    Returns:
+        gates: [batch, seq_len, num_heads, 1]
+    """
+    # Reshape hidden_state to [batch, seq_len, num_heads, head_dim]
+    batch, seq_len, hidden_dim = hidden_state.shape
+    num_heads = gate_bias.shape[0]
+    head_dim = hidden_dim // num_heads
+    x = hidden_state.view(batch, seq_len, num_heads, head_dim)
+    # Project through gate weight: [batch, seq_len, num_heads, head_dim] @ [head_dim, 1]
+    # This gives per-head activation
+    gate_logits = torch.einsum('bsnh,hk->bsnk', x, gate_weight) + gate_bias
+    # Apply sigmoid for gating in [0, 1]
+    gates = torch.sigmoid(gate_logits)
+    return gates
+```
+### 4. C++/CUDA Implementation Outline
+**File**: `ggml-cuda.cu` (CUDA kernels) or `ggml.c` (CPU implementation)
+Required kernel functions:
+```cpp
+// Kernel 1: Compute loop gates
+struct ggml_tensor * ggml_loop_gate(
+    struct ggml_context * ctx,
+    struct ggml_tensor * hidden_state,  // [batch, seq_len, n_embd]
+    struct ggml_tensor * gate_weight,   // [n_embd_head, n_head]
+    struct ggml_tensor * gate_bias,     // [n_head]
+    int window_size
+) {
+    // 1. Reshape hidden_state to [batch, seq_len, n_head, n_embd_head]
+    // 2. Project through gate_weight
+    // 3. Add gate_bias
+    // 4. Apply sigmoid activation
+    // 5. Return gates [batch, seq_len, n_head, 1]
+}
+// Kernel 2: Gated residual combination
+struct ggml_tensor * ggml_gated_residual(
+    struct ggml_context * ctx,
+    struct ggml_tensor * attn_output,  // [batch, seq_len, n_embd]
+    struct ggml_tensor * residual,     // [batch, seq_len, n_embd]
+    struct ggml_tensor * gates         // [batch, seq_len, n_head, 1]
+) {
+    // output = gates * attn_output + (1 - gates) * residual
+    // Per-head gating needs broadcasting
+}
+// Main loop attention function
+struct ggml_tensor * ggml_loop_attention(
+    struct ggml_context * ctx,
+    struct ggml_tensor * x,
+    struct llama_layer * layer,
+    int loop_num,
+    int loop_window_size
+) {
+    struct ggml_tensor * hidden_state = x;
+    for (int loop_iter = 0; loop_iter < loop_num; loop_iter++) {
+        // Standard attention
+        struct ggml_tensor * attn_output = ggml_attention(
+            ctx, hidden_state, layer, /* ... */
+        );
+        // Compute gates
+        struct ggml_tensor * gates = ggml_loop_gate(
+            ctx, hidden_state,
+            layer->loop_gate_w,
+            layer->loop_gate_b,
+            loop_window_size
+        );
+        // Apply gated residual
+        if (loop_iter < loop_num - 1) {
+            hidden_state = ggml_gated_residual(
+                ctx, attn_output, hidden_state, gates
+            );
+        } else {
+            hidden_state = ggml_add(ctx, attn_output, x);
+        }
+    }
+    return hidden_state;
+}
+```
+### 5. Integration Points
+**Files to modify**:
+1. **`llama.h`**: Add loop parameters to `llama_hparams`
+2. **`llama.cpp`**:
+   - Read loop metadata from GGUF
+   - Load loop_gate tensors
+   - Integrate `ggml_loop_attention` into forward pass
+3. **`ggml.h`**: Add loop attention operation declarations
+4. **`ggml.c`**: Implement CPU kernels for loop gates
+5. **`ggml-cuda.cu`**: Implement CUDA kernels for GPU acceleration
+6. **`ggml-metal.m`**: Implement Metal shaders for Apple Silicon
+7. **`convert_hf_to_gguf.py`**: Already done! ✅
+## Testing Strategy
+### 1. Tensor Loading Test
+Verify all 883 tensors load correctly:
+```bash
+./llama-cli --model IQuest-Coder-V1-40B-Loop-Instruct-q4_k_m.gguf --verbose
+```
+Expected output:
+- 80 × loop_gate.weight tensors [128, 40]
+- 80 × loop_gate.bias tensors [40]
+- loop_num = 2
+- loop_window_size = 64
+### 2. Forward Pass Test
+Compare output with PyTorch reference:
+```python
+# Generate reference output with HuggingFace
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model = AutoModelForCausalLM.from_pretrained(
+    "IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct",
+    trust_remote_code=True
+)
+tokenizer = AutoTokenizer.from_pretrained(...)
+input_text = "def fibonacci(n):"
+inputs = tokenizer(input_text, return_tensors="pt")
+with torch.no_grad():
+    pytorch_output = model.generate(**inputs, max_new_tokens=50)
+print("Reference:", tokenizer.decode(pytorch_output[0]))
+```
+Then test llama.cpp:
+```bash
+./llama-cli --model IQuest-Coder-V1-40B-Loop-Instruct-q4_k_m.gguf \
+    --prompt "def fibonacci(n):" --n-predict 50
+```
+Compare token-by-token outputs.
+### 3. Performance Benchmarks
+- **Throughput**: tokens/second
+- **Latency**: time to first token
+- **Memory**: peak GPU/CPU memory usage
+- **Quality**: Compare perplexity with reference
+## Unknown Implementation Details
+The following need verification from original implementation or technical paper:
+1. **Gate activation function**: Sigmoid? Tanh? Softmax?
+2. **Gate application**: Per-head? Per-token? Global?
+3. **Loop window**: How is window_size=64 used? Sliding window? Chunking?
+4. **Residual connection**: Standard or modified for loops?
+5. **Positional encoding**: Modified during loop iterations?
+6. **KV cache**: Recomputed each loop? Shared across iterations?
+## References for Implementation
+1. **vLLM PR #31575**: https://github.com/vllm-project/vllm/pull/31575
+   - Shows integration patterns
+   - LoopCoderNorm → RMSNorm refactoring noted
+2. **Model Config**: `/workspace/.cache/huggingface/.../config.json`
+   - Contains: loop_num=2, loop_window_size=64
+3. **Converted GGUFs**: `/workspace/models/converted/`
+   - Reference for tensor shapes and names
+   - Test files for validation
+4. **Issue #18517**: https://github.com/ggerganov/llama.cpp/issues/18517
+   - Community request for Loop support
+## Recommended Approach
+### Phase 1: Minimal Implementation
+1. Load loop_gate tensors (no-op in forward pass)
+2. Verify GGUF files load without errors
+3. Run standard Llama forward pass (ignoring loop for now)
+4. **Result**: Model runs but without loop benefits
+### Phase 2: Basic Loop Implementation
+1. Implement `ggml_loop_gate` CPU kernel
+2. Implement gated residual combination
+3. Integrate 2-iteration loop in forward pass
+4. Test on CPU with small models
+### Phase 3: GPU Acceleration
+1. Port kernels to CUDA
+2. Optimize memory layout for coalesced access
+3. Implement fused kernels where beneficial
+4. Benchmark against CPU
+### Phase 4: Optimization
+1. Profile hotspots
+2. Implement kernel fusion
+3. Add quantization support for loop gates
+4. Optimize KV cache handling
+## Community Contribution
+This implementation requires significant C++/CUDA expertise. Recommended contributors:
+- **C++ developers**: Familiar with ggml tensor operations
+- **CUDA developers**: For GPU kernel implementation
+- **ML researchers**: To verify loop attention correctness
+**Coordination**: Use llama.cpp Issue #18517 for discussion and implementation tracking.
+## Current Status
+✅ **Completed**:
+- Converter implementation (IQuestLoopCoderModel)
+- GGUF file generation (F16, Q4_K_M, Q5_K_M, Q8_0)
+- Tensor mapping documentation
+- Loop parameter preservation
+⏳ **Needed**:
+- Runtime loop attention mechanism
+- CUDA/CPU kernel implementation
+- Testing against PyTorch reference
+- Performance optimization
+---
+**Last Updated**: 2026-01-07
+**Contributors**: First GGUF conversion and converter implementation
+**Next Steps**: Submit PR with converter + documentation, community implements runtime