IQuest-Coder-V1-40B-Loop-Instruct-GGUF / RUNTIME_IMPLEMENTATION_GUIDE.md

Add RUNTIME_IMPLEMENTATION_GUIDE.md

a2f2377 verified 7 days ago

12.2 kB

	# IQuest Loop Attention Runtime Implementation Guide

	Status: Converter implemented ✅ \| Runtime support needed ⏳

	## Overview

	This document outlines the requirements for implementing IQuestLoopCoder runtime support in llama.cpp. The converter (`IQuestLoopCoderModel`) successfully creates GGUF files with all loop-specific tensors, but the inference runtime needs to be implemented.

	## What We Know

	### Architecture Summary

	Loop Mechanism: Recurrent transformer design with shared parameters across two iterations (loop_num=2)

	Key Parameters:
	- `llama.loop.num`: 2 (iterations of recurrent processing)
	- `llama.loop.window_size`: 64 (attention window for loop mechanism)

	Additional Tensors (160 total):
	- `blk.{0-79}.loop_gate.weight`: [128, 40] per layer
	- `blk.{0-79}.loop_gate.bias`: [40] per layer

	### Tensor Layout in GGUF

	```
	Standard Llama tensors (721):
	├── blk.{0-79}.attn_q.weight [5120, 5120]
	├── blk.{0-79}.attn_k.weight [5120, 1024]
	├── blk.{0-79}.attn_v.weight [5120, 1024]
	├── blk.{0-79}.attn_output.weight [5120, 5120]
	├── blk.{0-79}.attn_norm.weight [5120]
	├── blk.{0-79}.ffn_gate.weight [5120, 27648]
	├── blk.{0-79}.ffn_up.weight [5120, 27648]
	├── blk.{0-79}.ffn_down.weight [27648, 5120]
	└── blk.{0-79}.ffn_norm.weight [5120]

	Loop-specific tensors (160):
	├── blk.{0-79}.loop_gate.weight [128, 40] ← NEW
	└── blk.{0-79}.loop_gate.bias [40] ← NEW

	Embeddings (2):
	├── token_embd.weight [5120, 76800]
	└── output.weight [5120, 76800]
	```

	### Gate Projection Shape Analysis

	- Weight: [128, 40] = [head_dim, num_heads]
	- Bias: [40] = [num_heads]
	- Per layer: 1 weight + 1 bias tensor
	- Total layers: 80
	- Total loop tensors: 160

	This suggests the gate projects from head dimension to per-head gates.

	## Runtime Implementation Requirements

	### 1. GGUF Metadata Reading

	File: `llama.cpp` (or equivalent model loader)

	Add support for reading loop parameters:

	```cpp
	// In llama_model_loader or similar
	uint32_t loop_num = 0;
	uint32_t loop_window_size = 0;

	// Read from GGUF metadata
	gguf_get_val_u32(ctx, gguf_find_key(ctx, "llama.loop.num"), &loop_num);
	gguf_get_val_u32(ctx, gguf_find_key(ctx, "llama.loop.window_size"), &loop_window_size);

	// Store in model struct
	model->hparams.loop_num = loop_num;
	model->hparams.loop_window_size = loop_window_size;
	```

	### 2. Tensor Loading

	File: `llama.cpp` tensor loading section

	Add loop gate tensor loading:

	```cpp
	// In tensor loading loop
	for (int i = 0; i < n_layer; i++) {
	// Existing tensors...

	// NEW: Load loop gate tensors
	model.layers[i].loop_gate_w = ml.create_tensor(
	ctx, tn(LLM_TENSOR_LOOP_GATE_W, "weight", i), {n_embd_head, n_head}
	);
	model.layers[i].loop_gate_b = ml.create_tensor(
	ctx, tn(LLM_TENSOR_LOOP_GATE_B, "bias", i), {n_head}
	);
	}
	```

	### 3. Loop Attention Forward Pass (Conceptual)

	Based on available information, the loop attention likely works as follows:

	```python
	# Conceptual implementation (needs verification)
	def loop_attention_forward(x, layer, loop_num=2, loop_window_size=64):
	"""
	Recurrent attention with loop_num iterations

	Args:
	x: input tensor [batch, seq_len, hidden_dim]
	layer: transformer layer with loop_gate weights
	loop_num: number of recurrent iterations (default: 2)
	loop_window_size: attention window size (default: 64)

	Returns:
	output tensor [batch, seq_len, hidden_dim]
	"""
	hidden_state = x

	# Recurrent loop with shared parameters
	for loop_iter in range(loop_num):
	# Standard self-attention
	attn_output = self_attention(
	hidden_state,
	q_proj=layer.attn_q,
	k_proj=layer.attn_k,
	v_proj=layer.attn_v,
	output_proj=layer.attn_output
	)

	# Apply loop gating mechanism
	# Gate shape: [num_heads, 1] per position
	gates = compute_loop_gates(
	hidden_state,
	gate_weight=layer.loop_gate.weight, # [head_dim, num_heads]
	gate_bias=layer.loop_gate.bias, # [num_heads]
	window_size=loop_window_size
	)

	# Blend attention output with residual using gates
	if loop_iter < loop_num - 1:
	# Intermediate iterations: gated combination
	hidden_state = gates * attn_output + (1 - gates) * hidden_state
	else:
	# Final iteration: standard residual
	hidden_state = attn_output + x

	return hidden_state

	def compute_loop_gates(hidden_state, gate_weight, gate_bias, window_size):
	"""
	Compute per-head gating values

	Args:
	hidden_state: [batch, seq_len, hidden_dim]
	gate_weight: [head_dim, num_heads]
	gate_bias: [num_heads]
	window_size: local attention window

	Returns:
	gates: [batch, seq_len, num_heads, 1]
	"""
	# Reshape hidden_state to [batch, seq_len, num_heads, head_dim]
	batch, seq_len, hidden_dim = hidden_state.shape
	num_heads = gate_bias.shape[0]
	head_dim = hidden_dim // num_heads

	x = hidden_state.view(batch, seq_len, num_heads, head_dim)

	# Project through gate weight: [batch, seq_len, num_heads, head_dim] @ [head_dim, 1]
	# This gives per-head activation
	gate_logits = torch.einsum('bsnh,hk->bsnk', x, gate_weight) + gate_bias

	# Apply sigmoid for gating in [0, 1]
	gates = torch.sigmoid(gate_logits)

	return gates
	```

	### 4. C++/CUDA Implementation Outline

	File: `ggml-cuda.cu` (CUDA kernels) or `ggml.c` (CPU implementation)

	Required kernel functions:

	```cpp
	// Kernel 1: Compute loop gates
	struct ggml_tensor * ggml_loop_gate(
	struct ggml_context * ctx,
	struct ggml_tensor * hidden_state, // [batch, seq_len, n_embd]
	struct ggml_tensor * gate_weight, // [n_embd_head, n_head]
	struct ggml_tensor * gate_bias, // [n_head]
	int window_size
	) {
	// 1. Reshape hidden_state to [batch, seq_len, n_head, n_embd_head]
	// 2. Project through gate_weight
	// 3. Add gate_bias
	// 4. Apply sigmoid activation
	// 5. Return gates [batch, seq_len, n_head, 1]
	}

	// Kernel 2: Gated residual combination
	struct ggml_tensor * ggml_gated_residual(
	struct ggml_context * ctx,
	struct ggml_tensor * attn_output, // [batch, seq_len, n_embd]
	struct ggml_tensor * residual, // [batch, seq_len, n_embd]
	struct ggml_tensor * gates // [batch, seq_len, n_head, 1]
	) {
	// output = gates * attn_output + (1 - gates) * residual
	// Per-head gating needs broadcasting
	}

	// Main loop attention function
	struct ggml_tensor * ggml_loop_attention(
	struct ggml_context * ctx,
	struct ggml_tensor * x,
	struct llama_layer * layer,
	int loop_num,
	int loop_window_size
	) {
	struct ggml_tensor * hidden_state = x;

	for (int loop_iter = 0; loop_iter < loop_num; loop_iter++) {
	// Standard attention
	struct ggml_tensor * attn_output = ggml_attention(
	ctx, hidden_state, layer, /* ... */
	);

	// Compute gates
	struct ggml_tensor * gates = ggml_loop_gate(
	ctx, hidden_state,
	layer->loop_gate_w,
	layer->loop_gate_b,
	loop_window_size
	);

	// Apply gated residual
	if (loop_iter < loop_num - 1) {
	hidden_state = ggml_gated_residual(
	ctx, attn_output, hidden_state, gates
	);
	} else {
	hidden_state = ggml_add(ctx, attn_output, x);
	}
	}

	return hidden_state;
	}
	```

	### 5. Integration Points

	Files to modify:

	1. `llama.h`: Add loop parameters to `llama_hparams`
	2. `llama.cpp`:
	- Read loop metadata from GGUF
	- Load loop_gate tensors
	- Integrate `ggml_loop_attention` into forward pass
	3. `ggml.h`: Add loop attention operation declarations
	4. `ggml.c`: Implement CPU kernels for loop gates
	5. `ggml-cuda.cu`: Implement CUDA kernels for GPU acceleration
	6. `ggml-metal.m`: Implement Metal shaders for Apple Silicon
	7. `convert_hf_to_gguf.py`: Already done! ✅

	## Testing Strategy

	### 1. Tensor Loading Test

	Verify all 883 tensors load correctly:

	```bash
	./llama-cli --model IQuest-Coder-V1-40B-Loop-Instruct-q4_k_m.gguf --verbose
	```

	Expected output:
	- 80 × loop_gate.weight tensors [128, 40]
	- 80 × loop_gate.bias tensors [40]
	- loop_num = 2
	- loop_window_size = 64

	### 2. Forward Pass Test

	Compare output with PyTorch reference:

	```python
	# Generate reference output with HuggingFace
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained(
	"IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct",
	trust_remote_code=True
	)
	tokenizer = AutoTokenizer.from_pretrained(...)

	input_text = "def fibonacci(n):"
	inputs = tokenizer(input_text, return_tensors="pt")

	with torch.no_grad():
	pytorch_output = model.generate(**inputs, max_new_tokens=50)

	print("Reference:", tokenizer.decode(pytorch_output[0]))
	```

	Then test llama.cpp:

	```bash
	./llama-cli --model IQuest-Coder-V1-40B-Loop-Instruct-q4_k_m.gguf \
	--prompt "def fibonacci(n):" --n-predict 50
	```

	Compare token-by-token outputs.

	### 3. Performance Benchmarks

	- Throughput: tokens/second
	- Latency: time to first token
	- Memory: peak GPU/CPU memory usage
	- Quality: Compare perplexity with reference

	## Unknown Implementation Details

	The following need verification from original implementation or technical paper:

	1. Gate activation function: Sigmoid? Tanh? Softmax?
	2. Gate application: Per-head? Per-token? Global?
	3. Loop window: How is window_size=64 used? Sliding window? Chunking?
	4. Residual connection: Standard or modified for loops?
	5. Positional encoding: Modified during loop iterations?
	6. KV cache: Recomputed each loop? Shared across iterations?

	## References for Implementation

	1. vLLM PR #31575: https://github.com/vllm-project/vllm/pull/31575
	- Shows integration patterns
	- LoopCoderNorm → RMSNorm refactoring noted

	2. Model Config: `/workspace/.cache/huggingface/.../config.json`
	- Contains: loop_num=2, loop_window_size=64

	3. Converted GGUFs: `/workspace/models/converted/`
	- Reference for tensor shapes and names
	- Test files for validation

	4. Issue #18517: https://github.com/ggerganov/llama.cpp/issues/18517
	- Community request for Loop support

	## Recommended Approach

	### Phase 1: Minimal Implementation
	1. Load loop_gate tensors (no-op in forward pass)
	2. Verify GGUF files load without errors
	3. Run standard Llama forward pass (ignoring loop for now)
	4. Result: Model runs but without loop benefits

	### Phase 2: Basic Loop Implementation
	1. Implement `ggml_loop_gate` CPU kernel
	2. Implement gated residual combination
	3. Integrate 2-iteration loop in forward pass
	4. Test on CPU with small models

	### Phase 3: GPU Acceleration
	1. Port kernels to CUDA
	2. Optimize memory layout for coalesced access
	3. Implement fused kernels where beneficial
	4. Benchmark against CPU

	### Phase 4: Optimization
	1. Profile hotspots
	2. Implement kernel fusion
	3. Add quantization support for loop gates
	4. Optimize KV cache handling

	## Community Contribution

	This implementation requires significant C++/CUDA expertise. Recommended contributors:

	- C++ developers: Familiar with ggml tensor operations
	- CUDA developers: For GPU kernel implementation
	- ML researchers: To verify loop attention correctness

	Coordination: Use llama.cpp Issue #18517 for discussion and implementation tracking.

	## Current Status

	✅ Completed:
	- Converter implementation (IQuestLoopCoderModel)
	- GGUF file generation (F16, Q4_K_M, Q5_K_M, Q8_0)
	- Tensor mapping documentation
	- Loop parameter preservation

	⏳ Needed:
	- Runtime loop attention mechanism
	- CUDA/CPU kernel implementation
	- Testing against PyTorch reference
	- Performance optimization

	---

	Last Updated: 2026-01-07
	Contributors: First GGUF conversion and converter implementation
	Next Steps: Submit PR with converter + documentation, community implements runtime