--- license: apache-2.0 language: - en - de - es - fr - it - pt - pl - nl - tr - sv - cs - el - hu - ro - fi - uk - sl - sk - da - lt - lv - et - bg - no - ca - hr - ga - mt - gl - zh - ru - ko - ja - ar - hi library_name: transformers base_model: - utter-project/EuroLLM-22B-2512 tags: - gguf - quantization - imatrix - multilingual - jugaad - ner - pii --- # EuroLLM-22B-Instruct-GGUF (Jugaad Optimized) This repository contains **GGUF format** quantizations of [utter-project/EuroLLM-22B-Instruct](https://huggingface.co/utter-project/EuroLLM-22B-Instruct). ## Why this release? Unlike standard automated quantizations, this release was **specifically optimized by [Jugaad](https://jugaad.digital)** to balance professional performance with consumer hardware constraints. We focused on enabling the deployment of this powerful 22B parameter model on **single 24GB VRAM GPUs** (NVIDIA RTX 3090, RTX 4090, L4) while preserving its capability in critical tasks like **PII/PHI Extraction (NER)** across European languages. ### Key Differentiators 1. **Custom Calibration:** Instead of random data, we used a **multilingual professional dataset** (Medical, Legal, Finance, GDPR) for the Importance Matrix (imatrix) calculation. 2. **Verified Performance:** We didn't just quantize; we benchmarked. Our Q4_K_M quantization achieves an **F1 Score of ~0.89** on multilingual NER tasks, outperforming even larger models. 3. **Hardware-Ready:** We provide specific memory usage data to ensure zero OOM errors in production. ## 📦 Provided Quantizations | Filename | Type | Size | Use Case | |:---|:---|:---|:---| | **`eurollm-22b-Q4_K_M.gguf`** | **Q4_K_M** | **13.0 GB** | **⭐ RECOMMENDED. Best F1/VRAM balance for 24GB cards.** | | `eurollm-22b-Q5_K_M.gguf` | Q5_K_M | 15.0 GB | Higher precision if you have >24GB VRAM. | | `eurollm-22b-Q6_K.gguf` | Q6_K | 18.0 GB | Near-fp16 performance. Tight fit on 24GB (short context only). | | `eurollm-22b-Q8_0.gguf` | Q8_0 | 23.0 GB | Maximum fidelity. **Not recommended for 24GB cards** (high OOM risk). | | `eurollm-22b-IQ4_NL.gguf` | IQ4_NL | 13.0 GB | Alternative non-linear quantization. | | `eurollm-22b-IQ4_XS.gguf` | IQ4_XS | 12.0 GB | Smaller footprint if VRAM is very tight. | | `eurollm-22b-IQ3_M.gguf` | IQ3_M | 9.8 GB | Low VRAM usage (<12GB). | | `eurollm-22b-IQ2_M.gguf` | IQ2_M | 7.5 GB | Extreme compression. | ## 🏆 Benchmark Results (Multilingual NER) We tested these models on a tough PII/PHI extraction task across 5 languages (IT, EN, FR, DE, ES). | Model | Average F1 Score | Notes | |:---|:---:|:---| | **Q4_K_M** | **0.890** | **Highest score across all tested quantizations** | | IQ4_XS | 0.886 | Excellent efficiency | | Q8_0 | 0.883 | Surprisingly slightly lower on this specific task | | IQ4_NL | 0.881 | Solid performer | *Detailed results can be found in the [`benchmark_ner_results.md`](./benchmark_ner_results.md) file.* ## ⚙️ Technical Details - **Base Model:** `utter-project/EuroLLM-22B-2512` - **Quantization Tool:** `llama.cpp` (build 4358) - **Calibration Data:** Custom mix of Wikipedia (General) + Domain Specific (Medical/Legal/Finance) articles. - **Languages Covered:** Italian, English, French, German, Spanish, Portuguese, Dutch, Polish. *Please contact us to receive the file used to calculate the optimization imatrix.* ## 💻 Usage **CLI:** ```bash ./llama-cli -m eurollm-22b-Q4_K_M.gguf -p "Extract the entities from this text..." -n 512 -c 4096 ``` **Python:** ```python from llama_cpp import Llama llm = Llama( model_path="./eurollm-22b-Q4_K_M.gguf", n_gpu_layers=-1, # Offload to GPU n_ctx=8192 # 13GB model leaves plenty of room for context on a 24GB card ) res = llm.create_chat_completion( messages=[{"role": "user", "content": "What is the capital of Italy?"}] ) print(res) ```