|
|
--- |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- code |
|
|
- llama |
|
|
- loop-attention |
|
|
- gguf |
|
|
- llama.cpp |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
# IQuest-Coder-V1-40B-Loop-Instruct GGUF |
|
|
|
|
|
This repository contains GGUF format models for [IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct](https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct), |
|
|
optimized for use with llama.cpp. |
|
|
|
|
|
**🚨 IMPORTANT**: This model requires a custom llama.cpp build with loop attention support! |
|
|
**See PR**: [llama.cpp#18680](https://github.com/ggml-org/llama.cpp/pull/18680) |
|
|
|
|
|
Built and tested on **NVIDIA DGX Spark** infrastructure. |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
This model implements **Loop Attention**, a novel recurrent attention mechanism that processes all layers multiple times: |
|
|
|
|
|
- **loop_num=2**: All 80 transformer layers are processed twice (160 total operations) |
|
|
- **Loop 0**: Standard attention with global K/V caching |
|
|
- **Loop 1**: Dual attention (local + global) with learned per-head gating |
|
|
|
|
|
### Loop Attention Formula |
|
|
|
|
|
``` |
|
|
gate = sigmoid(sum(Q * gate_weight) + gate_bias) |
|
|
output = local_attn + gate * (global_attn - local_attn) |
|
|
``` |
|
|
|
|
|
## llama.cpp Support |
|
|
|
|
|
**IMPORTANT**: Loop attention support requires a custom branch of llama.cpp. |
|
|
|
|
|
See PR: https://github.com/ggml-org/llama.cpp/pull/18680 |
|
|
|
|
|
### Quick Start |
|
|
|
|
|
```bash |
|
|
# Clone llama.cpp with loop attention support |
|
|
git clone https://github.com/tbraun96/llama.cpp |
|
|
cd llama.cpp |
|
|
git checkout feature/iquest-loop-attention |
|
|
|
|
|
# Build |
|
|
mkdir build && cd build |
|
|
cmake .. -DGGML_CUDA=ON |
|
|
cmake --build . --config Release -j$(nproc) |
|
|
|
|
|
# Download a quantized model |
|
|
huggingface-cli download Avarok/IQuest-Coder-V1-40B-Loop-Instruct-GGUF IQuest-Coder-V1-40B-Loop-Instruct-Q4_K_M.gguf --local-dir . |
|
|
|
|
|
# Run inference |
|
|
./bin/llama-cli -m IQuest-Coder-V1-40B-Loop-Instruct-Q4_K_M.gguf -p "def fibonacci(n):" -n 200 |
|
|
``` |
|
|
|
|
|
## Available Models |
|
|
|
|
|
| Filename | Quantization | Size | Description | Use Case | |
|
|
|----------|-------------|------|-------------|----------| |
|
|
| IQuest-Coder-V1-40B-Loop-Instruct-F16.gguf | F16 | 75GB | Unquantized, highest quality | Maximum accuracy | |
|
|
| IQuest-Coder-V1-40B-Loop-Instruct-Q8_0.gguf | Q8_0 | 40GB | Very high quality | Near-F16 quality | |
|
|
| IQuest-Coder-V1-40B-Loop-Instruct-Q5_K_M.gguf | Q5_K_M | 27GB | High quality | Balanced quality/size | |
|
|
| IQuest-Coder-V1-40B-Loop-Instruct-Q4_K_M.gguf | Q4_K_M | 23GB | Good quality | **Recommended** | |
|
|
|
|
|
## Performance Benchmarks |
|
|
|
|
|
**Testing Platform**: NVIDIA DGX Spark with GB10 (Blackwell) GPU, compute capability 12.1 |
|
|
|
|
|
**Q4_K_M (23GB)** - Recommended: |
|
|
- Prompt processing: 106.2 tokens/second |
|
|
- Text generation: 4.2 tokens/second |
|
|
|
|
|
**F16 (75GB)** - Maximum quality: |
|
|
- Prompt processing: 3.4 tokens/second |
|
|
- Text generation: 0.8 tokens/second |
|
|
|
|
|
All testing and quantization was performed on NVIDIA DGX Spark infrastructure. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Base Model**: Llama architecture with loop attention extension |
|
|
- **Parameters**: 40B |
|
|
- **Context Length**: 32,768 tokens |
|
|
- **Training**: Fine-tuned for code generation and instruction following |
|
|
- **License**: Apache 2.0 |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@software{iquest_loop_instruct_gguf_2025, |
|
|
title={IQuest-Coder-V1-40B-Loop-Instruct GGUF}, |
|
|
author={IQuestLab and Community Contributors}, |
|
|
year={2025}, |
|
|
url={https://huggingface.co/Avarok/IQuest-Coder-V1-40B-Loop-Instruct-GGUF} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Original Model |
|
|
|
|
|
Original PyTorch model: [IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct](https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct) |
|
|
|
|
|
## Conversion |
|
|
|
|
|
These models were converted using the custom GGUF converter available in the llama.cpp branch above. |
|
|
|
|
|
```bash |
|
|
python convert_hf_to_gguf.py /path/to/IQuest-Coder-V1-40B-Loop-Instruct --outtype f16 |
|
|
``` |
|
|
|
|
|
## World's First |
|
|
|
|
|
This is the **world's first implementation** of loop attention in GGUF format, bringing recurrent attention mechanisms to llama.cpp! |
|
|
|
|
|
--- |
|
|
|
|
|
**Questions or Issues?** Please open an issue on the [llama.cpp PR](https://github.com/ggml-org/llama.cpp/pull/18680) or the original model repository. |
|
|
|