Update README with DGX Spark testing info and prominent PR link

321bfae verified 10 days ago

4.04 kB

	---
	license: apache-2.0
	tags:
	- code
	- llama
	- loop-attention
	- gguf
	- llama.cpp
	language:
	- en
	pipeline_tag: text-generation
	---

	# IQuest-Coder-V1-40B-Loop-Instruct GGUF

	This repository contains GGUF format models for [IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct](https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct),
	optimized for use with llama.cpp.

	🚨 IMPORTANT: This model requires a custom llama.cpp build with loop attention support!
	See PR: [llama.cpp#18680](https://github.com/ggml-org/llama.cpp/pull/18680)

	Built and tested on NVIDIA DGX Spark infrastructure.

	## Model Architecture

	This model implements Loop Attention, a novel recurrent attention mechanism that processes all layers multiple times:

	- loop_num=2: All 80 transformer layers are processed twice (160 total operations)
	- Loop 0: Standard attention with global K/V caching
	- Loop 1: Dual attention (local + global) with learned per-head gating

	### Loop Attention Formula

	```
	gate = sigmoid(sum(Q * gate_weight) + gate_bias)
	output = local_attn + gate * (global_attn - local_attn)
	```

	## llama.cpp Support

	IMPORTANT: Loop attention support requires a custom branch of llama.cpp.

	See PR: https://github.com/ggml-org/llama.cpp/pull/18680

	### Quick Start

	```bash
	# Clone llama.cpp with loop attention support
	git clone https://github.com/tbraun96/llama.cpp
	cd llama.cpp
	git checkout feature/iquest-loop-attention

	# Build
	mkdir build && cd build
	cmake .. -DGGML_CUDA=ON
	cmake --build . --config Release -j$(nproc)

	# Download a quantized model
	huggingface-cli download Avarok/IQuest-Coder-V1-40B-Loop-Instruct-GGUF IQuest-Coder-V1-40B-Loop-Instruct-Q4_K_M.gguf --local-dir .

	# Run inference
	./bin/llama-cli -m IQuest-Coder-V1-40B-Loop-Instruct-Q4_K_M.gguf -p "def fibonacci(n):" -n 200
	```

	## Available Models

	\| Filename \| Quantization \| Size \| Description \| Use Case \|
	\|----------\|-------------\|------\|-------------\|----------\|
	\| IQuest-Coder-V1-40B-Loop-Instruct-F16.gguf \| F16 \| 75GB \| Unquantized, highest quality \| Maximum accuracy \|
	\| IQuest-Coder-V1-40B-Loop-Instruct-Q8_0.gguf \| Q8_0 \| 40GB \| Very high quality \| Near-F16 quality \|
	\| IQuest-Coder-V1-40B-Loop-Instruct-Q5_K_M.gguf \| Q5_K_M \| 27GB \| High quality \| Balanced quality/size \|
	\| IQuest-Coder-V1-40B-Loop-Instruct-Q4_K_M.gguf \| Q4_K_M \| 23GB \| Good quality \| Recommended \|

	## Performance Benchmarks

	Testing Platform: NVIDIA DGX Spark with GB10 (Blackwell) GPU, compute capability 12.1

	Q4_K_M (23GB) - Recommended:
	- Prompt processing: 106.2 tokens/second
	- Text generation: 4.2 tokens/second

	F16 (75GB) - Maximum quality:
	- Prompt processing: 3.4 tokens/second
	- Text generation: 0.8 tokens/second

	All testing and quantization was performed on NVIDIA DGX Spark infrastructure.

	## Model Details

	- Base Model: Llama architecture with loop attention extension
	- Parameters: 40B
	- Context Length: 32,768 tokens
	- Training: Fine-tuned for code generation and instruction following
	- License: Apache 2.0

	## Citation

	If you use this model, please cite:

	```bibtex
	@software{iquest_loop_instruct_gguf_2025,
	title={IQuest-Coder-V1-40B-Loop-Instruct GGUF},
	author={IQuestLab and Community Contributors},
	year={2025},
	url={https://huggingface.co/Avarok/IQuest-Coder-V1-40B-Loop-Instruct-GGUF}
	}
	```

	## Original Model

	Original PyTorch model: [IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct](https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct)

	## Conversion

	These models were converted using the custom GGUF converter available in the llama.cpp branch above.

	```bash
	python convert_hf_to_gguf.py /path/to/IQuest-Coder-V1-40B-Loop-Instruct --outtype f16
	```

	## World's First

	This is the world's first implementation of loop attention in GGUF format, bringing recurrent attention mechanisms to llama.cpp!

	---

	Questions or Issues? Please open an issue on the [llama.cpp PR](https://github.com/ggml-org/llama.cpp/pull/18680) or the original model repository.