IQuest-Coder-V1-40B-Loop-Instruct-GGUF / RUNTIME_IMPLEMENTATION_GUIDE.md
nologik's picture
Add RUNTIME_IMPLEMENTATION_GUIDE.md
a2f2377 verified

IQuest Loop Attention Runtime Implementation Guide

Status: Converter implemented βœ… | Runtime support needed ⏳

Overview

This document outlines the requirements for implementing IQuestLoopCoder runtime support in llama.cpp. The converter (IQuestLoopCoderModel) successfully creates GGUF files with all loop-specific tensors, but the inference runtime needs to be implemented.

What We Know

Architecture Summary

Loop Mechanism: Recurrent transformer design with shared parameters across two iterations (loop_num=2)

Key Parameters:

  • llama.loop.num: 2 (iterations of recurrent processing)
  • llama.loop.window_size: 64 (attention window for loop mechanism)

Additional Tensors (160 total):

  • blk.{0-79}.loop_gate.weight: [128, 40] per layer
  • blk.{0-79}.loop_gate.bias: [40] per layer

Tensor Layout in GGUF

Standard Llama tensors (721):
β”œβ”€β”€ blk.{0-79}.attn_q.weight [5120, 5120]
β”œβ”€β”€ blk.{0-79}.attn_k.weight [5120, 1024]
β”œβ”€β”€ blk.{0-79}.attn_v.weight [5120, 1024]
β”œβ”€β”€ blk.{0-79}.attn_output.weight [5120, 5120]
β”œβ”€β”€ blk.{0-79}.attn_norm.weight [5120]
β”œβ”€β”€ blk.{0-79}.ffn_gate.weight [5120, 27648]
β”œβ”€β”€ blk.{0-79}.ffn_up.weight [5120, 27648]
β”œβ”€β”€ blk.{0-79}.ffn_down.weight [27648, 5120]
└── blk.{0-79}.ffn_norm.weight [5120]

Loop-specific tensors (160):
β”œβ”€β”€ blk.{0-79}.loop_gate.weight [128, 40]  ← NEW
└── blk.{0-79}.loop_gate.bias [40]         ← NEW

Embeddings (2):
β”œβ”€β”€ token_embd.weight [5120, 76800]
└── output.weight [5120, 76800]

Gate Projection Shape Analysis

  • Weight: [128, 40] = [head_dim, num_heads]
  • Bias: [40] = [num_heads]
  • Per layer: 1 weight + 1 bias tensor
  • Total layers: 80
  • Total loop tensors: 160

This suggests the gate projects from head dimension to per-head gates.

Runtime Implementation Requirements

1. GGUF Metadata Reading

File: llama.cpp (or equivalent model loader)

Add support for reading loop parameters:

// In llama_model_loader or similar
uint32_t loop_num = 0;
uint32_t loop_window_size = 0;

// Read from GGUF metadata
gguf_get_val_u32(ctx, gguf_find_key(ctx, "llama.loop.num"), &loop_num);
gguf_get_val_u32(ctx, gguf_find_key(ctx, "llama.loop.window_size"), &loop_window_size);

// Store in model struct
model->hparams.loop_num = loop_num;
model->hparams.loop_window_size = loop_window_size;

2. Tensor Loading

File: llama.cpp tensor loading section

Add loop gate tensor loading:

// In tensor loading loop
for (int i = 0; i < n_layer; i++) {
    // Existing tensors...

    // NEW: Load loop gate tensors
    model.layers[i].loop_gate_w = ml.create_tensor(
        ctx, tn(LLM_TENSOR_LOOP_GATE_W, "weight", i), {n_embd_head, n_head}
    );
    model.layers[i].loop_gate_b = ml.create_tensor(
        ctx, tn(LLM_TENSOR_LOOP_GATE_B, "bias", i), {n_head}
    );
}

3. Loop Attention Forward Pass (Conceptual)

Based on available information, the loop attention likely works as follows:

# Conceptual implementation (needs verification)
def loop_attention_forward(x, layer, loop_num=2, loop_window_size=64):
    """
    Recurrent attention with loop_num iterations

    Args:
        x: input tensor [batch, seq_len, hidden_dim]
        layer: transformer layer with loop_gate weights
        loop_num: number of recurrent iterations (default: 2)
        loop_window_size: attention window size (default: 64)

    Returns:
        output tensor [batch, seq_len, hidden_dim]
    """
    hidden_state = x

    # Recurrent loop with shared parameters
    for loop_iter in range(loop_num):
        # Standard self-attention
        attn_output = self_attention(
            hidden_state,
            q_proj=layer.attn_q,
            k_proj=layer.attn_k,
            v_proj=layer.attn_v,
            output_proj=layer.attn_output
        )

        # Apply loop gating mechanism
        # Gate shape: [num_heads, 1] per position
        gates = compute_loop_gates(
            hidden_state,
            gate_weight=layer.loop_gate.weight,  # [head_dim, num_heads]
            gate_bias=layer.loop_gate.bias,       # [num_heads]
            window_size=loop_window_size
        )

        # Blend attention output with residual using gates
        if loop_iter < loop_num - 1:
            # Intermediate iterations: gated combination
            hidden_state = gates * attn_output + (1 - gates) * hidden_state
        else:
            # Final iteration: standard residual
            hidden_state = attn_output + x

    return hidden_state

def compute_loop_gates(hidden_state, gate_weight, gate_bias, window_size):
    """
    Compute per-head gating values

    Args:
        hidden_state: [batch, seq_len, hidden_dim]
        gate_weight: [head_dim, num_heads]
        gate_bias: [num_heads]
        window_size: local attention window

    Returns:
        gates: [batch, seq_len, num_heads, 1]
    """
    # Reshape hidden_state to [batch, seq_len, num_heads, head_dim]
    batch, seq_len, hidden_dim = hidden_state.shape
    num_heads = gate_bias.shape[0]
    head_dim = hidden_dim // num_heads

    x = hidden_state.view(batch, seq_len, num_heads, head_dim)

    # Project through gate weight: [batch, seq_len, num_heads, head_dim] @ [head_dim, 1]
    # This gives per-head activation
    gate_logits = torch.einsum('bsnh,hk->bsnk', x, gate_weight) + gate_bias

    # Apply sigmoid for gating in [0, 1]
    gates = torch.sigmoid(gate_logits)

    return gates

4. C++/CUDA Implementation Outline

File: ggml-cuda.cu (CUDA kernels) or ggml.c (CPU implementation)

Required kernel functions:

// Kernel 1: Compute loop gates
struct ggml_tensor * ggml_loop_gate(
    struct ggml_context * ctx,
    struct ggml_tensor * hidden_state,  // [batch, seq_len, n_embd]
    struct ggml_tensor * gate_weight,   // [n_embd_head, n_head]
    struct ggml_tensor * gate_bias,     // [n_head]
    int window_size
) {
    // 1. Reshape hidden_state to [batch, seq_len, n_head, n_embd_head]
    // 2. Project through gate_weight
    // 3. Add gate_bias
    // 4. Apply sigmoid activation
    // 5. Return gates [batch, seq_len, n_head, 1]
}

// Kernel 2: Gated residual combination
struct ggml_tensor * ggml_gated_residual(
    struct ggml_context * ctx,
    struct ggml_tensor * attn_output,  // [batch, seq_len, n_embd]
    struct ggml_tensor * residual,     // [batch, seq_len, n_embd]
    struct ggml_tensor * gates         // [batch, seq_len, n_head, 1]
) {
    // output = gates * attn_output + (1 - gates) * residual
    // Per-head gating needs broadcasting
}

// Main loop attention function
struct ggml_tensor * ggml_loop_attention(
    struct ggml_context * ctx,
    struct ggml_tensor * x,
    struct llama_layer * layer,
    int loop_num,
    int loop_window_size
) {
    struct ggml_tensor * hidden_state = x;

    for (int loop_iter = 0; loop_iter < loop_num; loop_iter++) {
        // Standard attention
        struct ggml_tensor * attn_output = ggml_attention(
            ctx, hidden_state, layer, /* ... */
        );

        // Compute gates
        struct ggml_tensor * gates = ggml_loop_gate(
            ctx, hidden_state,
            layer->loop_gate_w,
            layer->loop_gate_b,
            loop_window_size
        );

        // Apply gated residual
        if (loop_iter < loop_num - 1) {
            hidden_state = ggml_gated_residual(
                ctx, attn_output, hidden_state, gates
            );
        } else {
            hidden_state = ggml_add(ctx, attn_output, x);
        }
    }

    return hidden_state;
}

5. Integration Points

Files to modify:

  1. llama.h: Add loop parameters to llama_hparams
  2. llama.cpp:
    • Read loop metadata from GGUF
    • Load loop_gate tensors
    • Integrate ggml_loop_attention into forward pass
  3. ggml.h: Add loop attention operation declarations
  4. ggml.c: Implement CPU kernels for loop gates
  5. ggml-cuda.cu: Implement CUDA kernels for GPU acceleration
  6. ggml-metal.m: Implement Metal shaders for Apple Silicon
  7. convert_hf_to_gguf.py: Already done! βœ…

Testing Strategy

1. Tensor Loading Test

Verify all 883 tensors load correctly:

./llama-cli --model IQuest-Coder-V1-40B-Loop-Instruct-q4_k_m.gguf --verbose

Expected output:

  • 80 Γ— loop_gate.weight tensors [128, 40]
  • 80 Γ— loop_gate.bias tensors [40]
  • loop_num = 2
  • loop_window_size = 64

2. Forward Pass Test

Compare output with PyTorch reference:

# Generate reference output with HuggingFace
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(...)

input_text = "def fibonacci(n):"
inputs = tokenizer(input_text, return_tensors="pt")

with torch.no_grad():
    pytorch_output = model.generate(**inputs, max_new_tokens=50)

print("Reference:", tokenizer.decode(pytorch_output[0]))

Then test llama.cpp:

./llama-cli --model IQuest-Coder-V1-40B-Loop-Instruct-q4_k_m.gguf \
    --prompt "def fibonacci(n):" --n-predict 50

Compare token-by-token outputs.

3. Performance Benchmarks

  • Throughput: tokens/second
  • Latency: time to first token
  • Memory: peak GPU/CPU memory usage
  • Quality: Compare perplexity with reference

Unknown Implementation Details

The following need verification from original implementation or technical paper:

  1. Gate activation function: Sigmoid? Tanh? Softmax?
  2. Gate application: Per-head? Per-token? Global?
  3. Loop window: How is window_size=64 used? Sliding window? Chunking?
  4. Residual connection: Standard or modified for loops?
  5. Positional encoding: Modified during loop iterations?
  6. KV cache: Recomputed each loop? Shared across iterations?

References for Implementation

  1. vLLM PR #31575: https://github.com/vllm-project/vllm/pull/31575

    • Shows integration patterns
    • LoopCoderNorm β†’ RMSNorm refactoring noted
  2. Model Config: /workspace/.cache/huggingface/.../config.json

    • Contains: loop_num=2, loop_window_size=64
  3. Converted GGUFs: /workspace/models/converted/

    • Reference for tensor shapes and names
    • Test files for validation
  4. Issue #18517: https://github.com/ggerganov/llama.cpp/issues/18517

    • Community request for Loop support

Recommended Approach

Phase 1: Minimal Implementation

  1. Load loop_gate tensors (no-op in forward pass)
  2. Verify GGUF files load without errors
  3. Run standard Llama forward pass (ignoring loop for now)
  4. Result: Model runs but without loop benefits

Phase 2: Basic Loop Implementation

  1. Implement ggml_loop_gate CPU kernel
  2. Implement gated residual combination
  3. Integrate 2-iteration loop in forward pass
  4. Test on CPU with small models

Phase 3: GPU Acceleration

  1. Port kernels to CUDA
  2. Optimize memory layout for coalesced access
  3. Implement fused kernels where beneficial
  4. Benchmark against CPU

Phase 4: Optimization

  1. Profile hotspots
  2. Implement kernel fusion
  3. Add quantization support for loop gates
  4. Optimize KV cache handling

Community Contribution

This implementation requires significant C++/CUDA expertise. Recommended contributors:

  • C++ developers: Familiar with ggml tensor operations
  • CUDA developers: For GPU kernel implementation
  • ML researchers: To verify loop attention correctness

Coordination: Use llama.cpp Issue #18517 for discussion and implementation tracking.

Current Status

βœ… Completed:

  • Converter implementation (IQuestLoopCoderModel)
  • GGUF file generation (F16, Q4_K_M, Q5_K_M, Q8_0)
  • Tensor mapping documentation
  • Loop parameter preservation

⏳ Needed:

  • Runtime loop attention mechanism
  • CUDA/CPU kernel implementation
  • Testing against PyTorch reference
  • Performance optimization

Last Updated: 2026-01-07 Contributors: First GGUF conversion and converter implementation Next Steps: Submit PR with converter + documentation, community implements runtime