---
license: apache-2.0
language:
- en
- de
- es
- fr
- it
- pt
- pl
- nl
- tr
- sv
- cs
- el
- hu
- ro
- fi
- uk
- sl
- sk
- da
- lt
- lv
- et
- bg
- no
- ca
- hr
- ga
- mt
- gl
- zh
- ru
- ko
- ja
- ar
- hi
library_name: transformers
base_model:
- utter-project/EuroLLM-22B-2512
tags:
- gguf
- quantization
- imatrix
- multilingual
- jugaad
- ner
- pii
---

# EuroLLM-22B-Instruct-GGUF (Jugaad Optimized)

This repository contains **GGUF format** quantizations of [utter-project/EuroLLM-22B-Instruct](https://huggingface.co/utter-project/EuroLLM-22B-Instruct).

## Why this release?

Unlike standard automated quantizations, this release was **specifically optimized by [Jugaad](https://jugaad.digital)** to balance professional performance with consumer hardware constraints.

We focused on enabling the deployment of this powerful 22B parameter model on **single 24GB VRAM GPUs** (NVIDIA RTX 3090, RTX 4090, L4) while preserving its capability in critical tasks like **PII/PHI Extraction (NER)** across European languages.

### Key Differentiators
1.  **Custom Calibration:** Instead of random data, we used a **multilingual professional dataset** (Medical, Legal, Finance, GDPR) for the Importance Matrix (imatrix) calculation.
2.  **Verified Performance:** We didn't just quantize; we benchmarked. Our Q4_K_M quantization achieves an **F1 Score of ~0.89** on multilingual NER tasks, outperforming even larger models.
3.  **Hardware-Ready:** We provide specific memory usage data to ensure zero OOM errors in production.

## 📦 Provided Quantizations

| Filename | Type | Size | Use Case |
|:---|:---|:---|:---|
| **`eurollm-22b-Q4_K_M.gguf`** | **Q4_K_M** | **13.0 GB** | **⭐ RECOMMENDED. Best F1/VRAM balance for 24GB cards.** |
| `eurollm-22b-Q5_K_M.gguf` | Q5_K_M | 15.0 GB | Higher precision if you have >24GB VRAM. |
| `eurollm-22b-Q6_K.gguf` | Q6_K | 18.0 GB | Near-fp16 performance. Tight fit on 24GB (short context only). |
| `eurollm-22b-Q8_0.gguf` | Q8_0 | 23.0 GB | Maximum fidelity. **Not recommended for 24GB cards** (high OOM risk). |
| `eurollm-22b-IQ4_NL.gguf` | IQ4_NL | 13.0 GB | Alternative non-linear quantization. |
| `eurollm-22b-IQ4_XS.gguf` | IQ4_XS | 12.0 GB | Smaller footprint if VRAM is very tight. |
| `eurollm-22b-IQ3_M.gguf` | IQ3_M | 9.8 GB | Low VRAM usage (<12GB). |
| `eurollm-22b-IQ2_M.gguf` | IQ2_M | 7.5 GB | Extreme compression. |

## 🏆 Benchmark Results (Multilingual NER)

We tested these models on a tough PII/PHI extraction task across 5 languages (IT, EN, FR, DE, ES).

| Model | Average F1 Score | Notes |
|:---|:---:|:---|
| **Q4_K_M** | **0.890** | **Highest score across all tested quantizations** |
| IQ4_XS | 0.886 | Excellent efficiency |
| Q8_0 | 0.883 | Surprisingly slightly lower on this specific task |
| IQ4_NL | 0.881 | Solid performer |

*Detailed results can be found in the [`benchmark_ner_results.md`](./benchmark_ner_results.md) file.*

## ⚙️ Technical Details

-   **Base Model:** `utter-project/EuroLLM-22B-2512`
-   **Quantization Tool:** `llama.cpp` (build 4358)
-   **Calibration Data:** Custom mix of Wikipedia (General) + Domain Specific (Medical/Legal/Finance) articles.
-   **Languages Covered:** Italian, English, French, German, Spanish, Portuguese, Dutch, Polish.

*Please contact us to receive the file used to calculate the optimization imatrix.*

## 💻 Usage

**CLI:**
```bash
./llama-cli -m eurollm-22b-Q4_K_M.gguf -p "Extract the entities from this text..." -n 512 -c 4096
```

**Python:**
```python
from llama_cpp import Llama

llm = Llama(
    model_path="./eurollm-22b-Q4_K_M.gguf",
    n_gpu_layers=-1, # Offload to GPU
    n_ctx=8192       # 13GB model leaves plenty of room for context on a 24GB card
)

res = llm.create_chat_completion(
    messages=[{"role": "user", "content": "What is the capital of Italy?"}]
)
print(res)
```