IQuest Loop Attention Runtime Implementation Guide
Status: Converter implemented β | Runtime support needed β³
Overview
This document outlines the requirements for implementing IQuestLoopCoder runtime support in llama.cpp. The converter (IQuestLoopCoderModel) successfully creates GGUF files with all loop-specific tensors, but the inference runtime needs to be implemented.
What We Know
Architecture Summary
Loop Mechanism: Recurrent transformer design with shared parameters across two iterations (loop_num=2)
Key Parameters:
llama.loop.num: 2 (iterations of recurrent processing)llama.loop.window_size: 64 (attention window for loop mechanism)
Additional Tensors (160 total):
blk.{0-79}.loop_gate.weight: [128, 40] per layerblk.{0-79}.loop_gate.bias: [40] per layer
Tensor Layout in GGUF
Standard Llama tensors (721):
βββ blk.{0-79}.attn_q.weight [5120, 5120]
βββ blk.{0-79}.attn_k.weight [5120, 1024]
βββ blk.{0-79}.attn_v.weight [5120, 1024]
βββ blk.{0-79}.attn_output.weight [5120, 5120]
βββ blk.{0-79}.attn_norm.weight [5120]
βββ blk.{0-79}.ffn_gate.weight [5120, 27648]
βββ blk.{0-79}.ffn_up.weight [5120, 27648]
βββ blk.{0-79}.ffn_down.weight [27648, 5120]
βββ blk.{0-79}.ffn_norm.weight [5120]
Loop-specific tensors (160):
βββ blk.{0-79}.loop_gate.weight [128, 40] β NEW
βββ blk.{0-79}.loop_gate.bias [40] β NEW
Embeddings (2):
βββ token_embd.weight [5120, 76800]
βββ output.weight [5120, 76800]
Gate Projection Shape Analysis
- Weight: [128, 40] = [head_dim, num_heads]
- Bias: [40] = [num_heads]
- Per layer: 1 weight + 1 bias tensor
- Total layers: 80
- Total loop tensors: 160
This suggests the gate projects from head dimension to per-head gates.
Runtime Implementation Requirements
1. GGUF Metadata Reading
File: llama.cpp (or equivalent model loader)
Add support for reading loop parameters:
// In llama_model_loader or similar
uint32_t loop_num = 0;
uint32_t loop_window_size = 0;
// Read from GGUF metadata
gguf_get_val_u32(ctx, gguf_find_key(ctx, "llama.loop.num"), &loop_num);
gguf_get_val_u32(ctx, gguf_find_key(ctx, "llama.loop.window_size"), &loop_window_size);
// Store in model struct
model->hparams.loop_num = loop_num;
model->hparams.loop_window_size = loop_window_size;
2. Tensor Loading
File: llama.cpp tensor loading section
Add loop gate tensor loading:
// In tensor loading loop
for (int i = 0; i < n_layer; i++) {
// Existing tensors...
// NEW: Load loop gate tensors
model.layers[i].loop_gate_w = ml.create_tensor(
ctx, tn(LLM_TENSOR_LOOP_GATE_W, "weight", i), {n_embd_head, n_head}
);
model.layers[i].loop_gate_b = ml.create_tensor(
ctx, tn(LLM_TENSOR_LOOP_GATE_B, "bias", i), {n_head}
);
}
3. Loop Attention Forward Pass (Conceptual)
Based on available information, the loop attention likely works as follows:
# Conceptual implementation (needs verification)
def loop_attention_forward(x, layer, loop_num=2, loop_window_size=64):
"""
Recurrent attention with loop_num iterations
Args:
x: input tensor [batch, seq_len, hidden_dim]
layer: transformer layer with loop_gate weights
loop_num: number of recurrent iterations (default: 2)
loop_window_size: attention window size (default: 64)
Returns:
output tensor [batch, seq_len, hidden_dim]
"""
hidden_state = x
# Recurrent loop with shared parameters
for loop_iter in range(loop_num):
# Standard self-attention
attn_output = self_attention(
hidden_state,
q_proj=layer.attn_q,
k_proj=layer.attn_k,
v_proj=layer.attn_v,
output_proj=layer.attn_output
)
# Apply loop gating mechanism
# Gate shape: [num_heads, 1] per position
gates = compute_loop_gates(
hidden_state,
gate_weight=layer.loop_gate.weight, # [head_dim, num_heads]
gate_bias=layer.loop_gate.bias, # [num_heads]
window_size=loop_window_size
)
# Blend attention output with residual using gates
if loop_iter < loop_num - 1:
# Intermediate iterations: gated combination
hidden_state = gates * attn_output + (1 - gates) * hidden_state
else:
# Final iteration: standard residual
hidden_state = attn_output + x
return hidden_state
def compute_loop_gates(hidden_state, gate_weight, gate_bias, window_size):
"""
Compute per-head gating values
Args:
hidden_state: [batch, seq_len, hidden_dim]
gate_weight: [head_dim, num_heads]
gate_bias: [num_heads]
window_size: local attention window
Returns:
gates: [batch, seq_len, num_heads, 1]
"""
# Reshape hidden_state to [batch, seq_len, num_heads, head_dim]
batch, seq_len, hidden_dim = hidden_state.shape
num_heads = gate_bias.shape[0]
head_dim = hidden_dim // num_heads
x = hidden_state.view(batch, seq_len, num_heads, head_dim)
# Project through gate weight: [batch, seq_len, num_heads, head_dim] @ [head_dim, 1]
# This gives per-head activation
gate_logits = torch.einsum('bsnh,hk->bsnk', x, gate_weight) + gate_bias
# Apply sigmoid for gating in [0, 1]
gates = torch.sigmoid(gate_logits)
return gates
4. C++/CUDA Implementation Outline
File: ggml-cuda.cu (CUDA kernels) or ggml.c (CPU implementation)
Required kernel functions:
// Kernel 1: Compute loop gates
struct ggml_tensor * ggml_loop_gate(
struct ggml_context * ctx,
struct ggml_tensor * hidden_state, // [batch, seq_len, n_embd]
struct ggml_tensor * gate_weight, // [n_embd_head, n_head]
struct ggml_tensor * gate_bias, // [n_head]
int window_size
) {
// 1. Reshape hidden_state to [batch, seq_len, n_head, n_embd_head]
// 2. Project through gate_weight
// 3. Add gate_bias
// 4. Apply sigmoid activation
// 5. Return gates [batch, seq_len, n_head, 1]
}
// Kernel 2: Gated residual combination
struct ggml_tensor * ggml_gated_residual(
struct ggml_context * ctx,
struct ggml_tensor * attn_output, // [batch, seq_len, n_embd]
struct ggml_tensor * residual, // [batch, seq_len, n_embd]
struct ggml_tensor * gates // [batch, seq_len, n_head, 1]
) {
// output = gates * attn_output + (1 - gates) * residual
// Per-head gating needs broadcasting
}
// Main loop attention function
struct ggml_tensor * ggml_loop_attention(
struct ggml_context * ctx,
struct ggml_tensor * x,
struct llama_layer * layer,
int loop_num,
int loop_window_size
) {
struct ggml_tensor * hidden_state = x;
for (int loop_iter = 0; loop_iter < loop_num; loop_iter++) {
// Standard attention
struct ggml_tensor * attn_output = ggml_attention(
ctx, hidden_state, layer, /* ... */
);
// Compute gates
struct ggml_tensor * gates = ggml_loop_gate(
ctx, hidden_state,
layer->loop_gate_w,
layer->loop_gate_b,
loop_window_size
);
// Apply gated residual
if (loop_iter < loop_num - 1) {
hidden_state = ggml_gated_residual(
ctx, attn_output, hidden_state, gates
);
} else {
hidden_state = ggml_add(ctx, attn_output, x);
}
}
return hidden_state;
}
5. Integration Points
Files to modify:
llama.h: Add loop parameters tollama_hparamsllama.cpp:- Read loop metadata from GGUF
- Load loop_gate tensors
- Integrate
ggml_loop_attentioninto forward pass
ggml.h: Add loop attention operation declarationsggml.c: Implement CPU kernels for loop gatesggml-cuda.cu: Implement CUDA kernels for GPU accelerationggml-metal.m: Implement Metal shaders for Apple Siliconconvert_hf_to_gguf.py: Already done! β
Testing Strategy
1. Tensor Loading Test
Verify all 883 tensors load correctly:
./llama-cli --model IQuest-Coder-V1-40B-Loop-Instruct-q4_k_m.gguf --verbose
Expected output:
- 80 Γ loop_gate.weight tensors [128, 40]
- 80 Γ loop_gate.bias tensors [40]
- loop_num = 2
- loop_window_size = 64
2. Forward Pass Test
Compare output with PyTorch reference:
# Generate reference output with HuggingFace
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(...)
input_text = "def fibonacci(n):"
inputs = tokenizer(input_text, return_tensors="pt")
with torch.no_grad():
pytorch_output = model.generate(**inputs, max_new_tokens=50)
print("Reference:", tokenizer.decode(pytorch_output[0]))
Then test llama.cpp:
./llama-cli --model IQuest-Coder-V1-40B-Loop-Instruct-q4_k_m.gguf \
--prompt "def fibonacci(n):" --n-predict 50
Compare token-by-token outputs.
3. Performance Benchmarks
- Throughput: tokens/second
- Latency: time to first token
- Memory: peak GPU/CPU memory usage
- Quality: Compare perplexity with reference
Unknown Implementation Details
The following need verification from original implementation or technical paper:
- Gate activation function: Sigmoid? Tanh? Softmax?
- Gate application: Per-head? Per-token? Global?
- Loop window: How is window_size=64 used? Sliding window? Chunking?
- Residual connection: Standard or modified for loops?
- Positional encoding: Modified during loop iterations?
- KV cache: Recomputed each loop? Shared across iterations?
References for Implementation
vLLM PR #31575: https://github.com/vllm-project/vllm/pull/31575
- Shows integration patterns
- LoopCoderNorm β RMSNorm refactoring noted
Model Config:
/workspace/.cache/huggingface/.../config.json- Contains: loop_num=2, loop_window_size=64
Converted GGUFs:
/workspace/models/converted/- Reference for tensor shapes and names
- Test files for validation
Issue #18517: https://github.com/ggerganov/llama.cpp/issues/18517
- Community request for Loop support
Recommended Approach
Phase 1: Minimal Implementation
- Load loop_gate tensors (no-op in forward pass)
- Verify GGUF files load without errors
- Run standard Llama forward pass (ignoring loop for now)
- Result: Model runs but without loop benefits
Phase 2: Basic Loop Implementation
- Implement
ggml_loop_gateCPU kernel - Implement gated residual combination
- Integrate 2-iteration loop in forward pass
- Test on CPU with small models
Phase 3: GPU Acceleration
- Port kernels to CUDA
- Optimize memory layout for coalesced access
- Implement fused kernels where beneficial
- Benchmark against CPU
Phase 4: Optimization
- Profile hotspots
- Implement kernel fusion
- Add quantization support for loop gates
- Optimize KV cache handling
Community Contribution
This implementation requires significant C++/CUDA expertise. Recommended contributors:
- C++ developers: Familiar with ggml tensor operations
- CUDA developers: For GPU kernel implementation
- ML researchers: To verify loop attention correctness
Coordination: Use llama.cpp Issue #18517 for discussion and implementation tracking.
Current Status
β Completed:
- Converter implementation (IQuestLoopCoderModel)
- GGUF file generation (F16, Q4_K_M, Q5_K_M, Q8_0)
- Tensor mapping documentation
- Loop parameter preservation
β³ Needed:
- Runtime loop attention mechanism
- CUDA/CPU kernel implementation
- Testing against PyTorch reference
- Performance optimization
Last Updated: 2026-01-07 Contributors: First GGUF conversion and converter implementation Next Steps: Submit PR with converter + documentation, community implements runtime