How to use from
llama.cpp
Install from brew
brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf jugaadsrl/EuroLLM-22B-Instruct-GGUF:
# Run inference directly in the terminal:
llama-cli -hf jugaadsrl/EuroLLM-22B-Instruct-GGUF:
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf jugaadsrl/EuroLLM-22B-Instruct-GGUF:
# Run inference directly in the terminal:
llama-cli -hf jugaadsrl/EuroLLM-22B-Instruct-GGUF:
Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf jugaadsrl/EuroLLM-22B-Instruct-GGUF:
# Run inference directly in the terminal:
./llama-cli -hf jugaadsrl/EuroLLM-22B-Instruct-GGUF:
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf jugaadsrl/EuroLLM-22B-Instruct-GGUF:
# Run inference directly in the terminal:
./build/bin/llama-cli -hf jugaadsrl/EuroLLM-22B-Instruct-GGUF:
Use Docker
docker model run hf.co/jugaadsrl/EuroLLM-22B-Instruct-GGUF:
Quick Links

EuroLLM-22B-Instruct-GGUF (Jugaad Optimized)

This repository contains GGUF format quantizations of utter-project/EuroLLM-22B-Instruct.

Why this release?

Unlike standard automated quantizations, this release was specifically optimized by Jugaad to balance professional performance with consumer hardware constraints.

We focused on enabling the deployment of this powerful 22B parameter model on single 24GB VRAM GPUs (NVIDIA RTX 3090, RTX 4090, L4) while preserving its capability in critical tasks like PII/PHI Extraction (NER) across European languages.

Key Differentiators

  1. Custom Calibration: Instead of random data, we used a multilingual professional dataset (Medical, Legal, Finance, GDPR) for the Importance Matrix (imatrix) calculation.
  2. Verified Performance: We didn't just quantize; we benchmarked. Our Q4_K_M quantization achieves an F1 Score of ~0.89 on multilingual NER tasks, outperforming even larger models.
  3. Hardware-Ready: We provide specific memory usage data to ensure zero OOM errors in production.

๐Ÿ“ฆ Provided Quantizations

Filename Type Size Use Case
eurollm-22b-Q4_K_M.gguf Q4_K_M 13.0 GB โญ RECOMMENDED. Best F1/VRAM balance for 24GB cards.
eurollm-22b-Q5_K_M.gguf Q5_K_M 15.0 GB Higher precision if you have >24GB VRAM.
eurollm-22b-Q6_K.gguf Q6_K 18.0 GB Near-fp16 performance. Tight fit on 24GB (short context only).
eurollm-22b-Q8_0.gguf Q8_0 23.0 GB Maximum fidelity. Not recommended for 24GB cards (high OOM risk).
eurollm-22b-IQ4_NL.gguf IQ4_NL 13.0 GB Alternative non-linear quantization.
eurollm-22b-IQ4_XS.gguf IQ4_XS 12.0 GB Smaller footprint if VRAM is very tight.
eurollm-22b-IQ3_M.gguf IQ3_M 9.8 GB Low VRAM usage (<12GB).
eurollm-22b-IQ2_M.gguf IQ2_M 7.5 GB Extreme compression.

๐Ÿ† Benchmark Results (Multilingual NER)

We tested these models on a tough PII/PHI extraction task across 5 languages (IT, EN, FR, DE, ES).

Model Average F1 Score Notes
Q4_K_M 0.890 Highest score across all tested quantizations
IQ4_XS 0.886 Excellent efficiency
Q8_0 0.883 Surprisingly slightly lower on this specific task
IQ4_NL 0.881 Solid performer

Detailed results can be found in the benchmark_ner_results.md file.

โš™๏ธ Technical Details

  • Base Model: utter-project/EuroLLM-22B-2512
  • Quantization Tool: llama.cpp (build 4358)
  • Calibration Data: Custom mix of Wikipedia (General) + Domain Specific (Medical/Legal/Finance) articles.
  • Languages Covered: Italian, English, French, German, Spanish, Portuguese, Dutch, Polish.

Please contact us to receive the file used to calculate the optimization imatrix.

๐Ÿ’ป Usage

CLI:

./llama-cli -m eurollm-22b-Q4_K_M.gguf -p "Extract the entities from this text..." -n 512 -c 4096

Python:

from llama_cpp import Llama

llm = Llama(
    model_path="./eurollm-22b-Q4_K_M.gguf",
    n_gpu_layers=-1, # Offload to GPU
    n_ctx=8192       # 13GB model leaves plenty of room for context on a 24GB card
)

res = llm.create_chat_completion(
    messages=[{"role": "user", "content": "What is the capital of Italy?"}]
)
print(res)
Downloads last month
66
GGUF
Model size
23B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for jugaadsrl/EuroLLM-22B-Instruct-GGUF

Quantized
(3)
this model