Instructions to use jugaadsrl/EuroLLM-22B-Instruct-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use jugaadsrl/EuroLLM-22B-Instruct-GGUF with Transformers:

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("jugaadsrl/EuroLLM-22B-Instruct-GGUF", dtype="auto")

llama-cpp-python

How to use jugaadsrl/EuroLLM-22B-Instruct-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="jugaadsrl/EuroLLM-22B-Instruct-GGUF",
	filename="eurollm-22b-IQ2_M.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use jugaadsrl/EuroLLM-22B-Instruct-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf jugaadsrl/EuroLLM-22B-Instruct-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf jugaadsrl/EuroLLM-22B-Instruct-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf jugaadsrl/EuroLLM-22B-Instruct-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf jugaadsrl/EuroLLM-22B-Instruct-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf jugaadsrl/EuroLLM-22B-Instruct-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf jugaadsrl/EuroLLM-22B-Instruct-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf jugaadsrl/EuroLLM-22B-Instruct-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf jugaadsrl/EuroLLM-22B-Instruct-GGUF:Q4_K_M

Use Docker

docker model run hf.co/jugaadsrl/EuroLLM-22B-Instruct-GGUF:Q4_K_M

LM Studio
Jan
Ollama
How to use jugaadsrl/EuroLLM-22B-Instruct-GGUF with Ollama:
```
ollama run hf.co/jugaadsrl/EuroLLM-22B-Instruct-GGUF:Q4_K_M
```

Unsloth Studio

How to use jugaadsrl/EuroLLM-22B-Instruct-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for jugaadsrl/EuroLLM-22B-Instruct-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for jugaadsrl/EuroLLM-22B-Instruct-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for jugaadsrl/EuroLLM-22B-Instruct-GGUF to start chatting

Docker Model Runner
How to use jugaadsrl/EuroLLM-22B-Instruct-GGUF with Docker Model Runner:
```
docker model run hf.co/jugaadsrl/EuroLLM-22B-Instruct-GGUF:Q4_K_M
```

Lemonade

How to use jugaadsrl/EuroLLM-22B-Instruct-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull jugaadsrl/EuroLLM-22B-Instruct-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.EuroLLM-22B-Instruct-GGUF-Q4_K_M

List all available models

lemonade list

EuroLLM-22B-Instruct-GGUF (Jugaad Optimized)

This repository contains GGUF format quantizations of utter-project/EuroLLM-22B-Instruct.

Why this release?

Unlike standard automated quantizations, this release was specifically optimized by Jugaad to balance professional performance with consumer hardware constraints.

We focused on enabling the deployment of this powerful 22B parameter model on single 24GB VRAM GPUs (NVIDIA RTX 3090, RTX 4090, L4) while preserving its capability in critical tasks like PII/PHI Extraction (NER) across European languages.

Key Differentiators

Custom Calibration: Instead of random data, we used a multilingual professional dataset (Medical, Legal, Finance, GDPR) for the Importance Matrix (imatrix) calculation.
Verified Performance: We didn't just quantize; we benchmarked. Our Q4_K_M quantization achieves an F1 Score of ~0.89 on multilingual NER tasks, outperforming even larger models.
Hardware-Ready: We provide specific memory usage data to ensure zero OOM errors in production.

📦 Provided Quantizations

Filename	Type	Size	Use Case
`eurollm-22b-Q4_K_M.gguf`	Q4_K_M	13.0 GB	⭐ RECOMMENDED. Best F1/VRAM balance for 24GB cards.
`eurollm-22b-Q5_K_M.gguf`	Q5_K_M	15.0 GB	Higher precision if you have >24GB VRAM.
`eurollm-22b-Q6_K.gguf`	Q6_K	18.0 GB	Near-fp16 performance. Tight fit on 24GB (short context only).
`eurollm-22b-Q8_0.gguf`	Q8_0	23.0 GB	Maximum fidelity. Not recommended for 24GB cards (high OOM risk).
`eurollm-22b-IQ4_NL.gguf`	IQ4_NL	13.0 GB	Alternative non-linear quantization.
`eurollm-22b-IQ4_XS.gguf`	IQ4_XS	12.0 GB	Smaller footprint if VRAM is very tight.
`eurollm-22b-IQ3_M.gguf`	IQ3_M	9.8 GB	Low VRAM usage (<12GB).
`eurollm-22b-IQ2_M.gguf`	IQ2_M	7.5 GB	Extreme compression.

🏆 Benchmark Results (Multilingual NER)

We tested these models on a tough PII/PHI extraction task across 5 languages (IT, EN, FR, DE, ES).

Model	Average F1 Score	Notes
Q4_K_M	0.890	Highest score across all tested quantizations
IQ4_XS	0.886	Excellent efficiency
Q8_0	0.883	Surprisingly slightly lower on this specific task
IQ4_NL	0.881	Solid performer

Detailed results can be found in the benchmark_ner_results.md file.

⚙️ Technical Details

Base Model: utter-project/EuroLLM-22B-2512
Quantization Tool: llama.cpp (build 4358)
Calibration Data: Custom mix of Wikipedia (General) + Domain Specific (Medical/Legal/Finance) articles.
Languages Covered: Italian, English, French, German, Spanish, Portuguese, Dutch, Polish.

Please contact us to receive the file used to calculate the optimization imatrix.

💻 Usage

CLI:

./llama-cli -m eurollm-22b-Q4_K_M.gguf -p "Extract the entities from this text..." -n 512 -c 4096

Python:

from llama_cpp import Llama

llm = Llama(
    model_path="./eurollm-22b-Q4_K_M.gguf",
    n_gpu_layers=-1, # Offload to GPU
    n_ctx=8192       # 13GB model leaves plenty of room for context on a 24GB card
)

res = llm.create_chat_completion(
    messages=[{"role": "user", "content": "What is the capital of Italy?"}]
)
print(res)

Downloads last month: 66

GGUF

Model size

23B params

Architecture

llama

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jugaadsrl/EuroLLM-22B-Instruct-GGUF

Base model

utter-project/EuroLLM-22B-2512

Quantized

(3)

this model