Instructions to use SemplificaAI/EuroLLM-ISO24495-1.7b-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use SemplificaAI/EuroLLM-ISO24495-1.7b-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="SemplificaAI/EuroLLM-ISO24495-1.7b-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("SemplificaAI/EuroLLM-ISO24495-1.7b-Instruct")
model = AutoModelForCausalLM.from_pretrained("SemplificaAI/EuroLLM-ISO24495-1.7b-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use SemplificaAI/EuroLLM-ISO24495-1.7b-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "SemplificaAI/EuroLLM-ISO24495-1.7b-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SemplificaAI/EuroLLM-ISO24495-1.7b-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/SemplificaAI/EuroLLM-ISO24495-1.7b-Instruct

SGLang

How to use SemplificaAI/EuroLLM-ISO24495-1.7b-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "SemplificaAI/EuroLLM-ISO24495-1.7b-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SemplificaAI/EuroLLM-ISO24495-1.7b-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "SemplificaAI/EuroLLM-ISO24495-1.7b-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SemplificaAI/EuroLLM-ISO24495-1.7b-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use SemplificaAI/EuroLLM-ISO24495-1.7b-Instruct with Docker Model Runner:
```
docker model run hf.co/SemplificaAI/EuroLLM-ISO24495-1.7b-Instruct
```

Access to EuroLLM-ISO24495-1.7b-Instruct (v0.2)

This model is released under CC-BY-NC-4.0 (non-commercial). The form below helps us understand who is using the model and prioritize improvements. Approval is automatic once the form is submitted.

By submitting this form you confirm that (1) your intended use complies with the CC-BY-NC-4.0 license terms (non-commercial), and (2) you have read the Limitations section of the model card. For commercial use, please contact hf@semplifica.ai.

EuroLLM-ISO24495-1.7b-Instruct (v0.2)

A fine-tuned EuroLLM-1.7B-Instruct specialised in ISO 24495-1 (Plain Language) compliance analysis of legal, administrative and technical texts across six European languages: Italian, English, Portuguese, Spanish, French, German.

Given a document, the model emits a structured XML analysis with: a compliance score (0–100), a binary verdict, a list of violation spans with character-level offsets and corrective suggestions, and a prioritised checklist of corrective actions.

Version: v0.2 — trained on about 23,000 task records (v3 dataset, hybrid synthetic + human-curated), with verdict balance per language and a 21 % anti-forgetting mix (EuroBlocks instruct conversations). Target deployment: edge / consumer GPU. The model runs in about 3.4 GB VRAM in bf16 and about 1.7 GB in 8-bit on a single consumer card.

Edge profile

Aspect	Value
Parameters	1.7 B
Architecture	Llama-style decoder, native ChatML chat template
VRAM (bf16 inference)	~3.4 GB
VRAM (8-bit inference)	~1.7 GB
Context window	4,096 tokens
Languages	IT, EN, FR, DE, ES, PT
Target hardware	single consumer GPU (RTX 3060 12 GB and up), or CPU/laptop in 8-bit

Quick start

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

REPO = "SemplificaAI/EuroLLM-ISO24495-1.7b-Instruct"

# 8-bit loading → ~1.7 GB VRAM. For bf16 (~3.4 GB) drop the quantization_config.
bnb = BitsAndBytesConfig(load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained(REPO)
model = AutoModelForCausalLM.from_pretrained(
    REPO, quantization_config=bnb, device_map="auto", torch_dtype=torch.bfloat16,
)
model.eval()

SYSTEM = (
    "You are an expert in plain language according to ISO 24495-1:2023. "
    "Analyze the provided text and produce: (1) a compliance score 0-100, "
    "(2) parts to improve with specific suggestions, "
    "(3) an ordered checklist of corrective actions. "
    "Reply directly without thinking aloud."
)

text = """The Parties hereby acknowledge, in light of the foregoing premises
which form an integral and substantive part of this Agreement, that the
Confidential Information shall not include..."""

messages = [
    {"role": "system", "content": SYSTEM},
    {"role": "user", "content": f"Analyze this text for ISO 24495-1 plain language compliance:\n\n<TEXT>\n{text}\n</TEXT>"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=3072, do_sample=False,
                         pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

System prompts in the other five languages: see § Multilingual prompts.

Output format

The model emits a single XML block with four fields:

<ANALYSIS>
<SCORE>42</SCORE>
<VERDICT>non_conforme</VERDICT>

<SPANS>
[
  {
    "text_fragment": "The Parties hereby acknowledge, in light of the foregoing premises...",
    "violation_type": "legalese_overload",
    "suggestion": "Both parties agree, based on the above context, that...",
    "start_char": 0,
    "end_char": 78,
    "severity": "high"
  }
]
</SPANS>

<CHECKLIST>
1. Replace archaic legal formulas with direct expressions.
2. Break long sentences into shorter periods.
3. Define technical terms on first use.
</CHECKLIST>
</ANALYSIS>

Fields

Field	Value	Notes
`<SCORE>`	integer `0`–`100`	100 = fully compliant
`<VERDICT>`	`conforme` \| `non_conforme`	internal threshold around 60
`<SPANS>`	inline JSON array	violations with char-level spans
`<CHECKLIST>`	numbered list	corrective actions in priority order

`violation_type` vocabulary (10 ISO-aligned categories)

sentence_too_long, passive_voice_overuse, undefined_jargon, buried_action, nominalization, double_negative, ambiguous_reference, missing_structure, inconsistent_terminology, legalese_overload.

`severity`

low | medium | high

Reference parser

A tolerant Python parser (handles truncated output and non-standard JSON escapes) is shipped together with the model code (see text_utils.py in the training-scripts repository).

Examples

Qualitative examples (a real Italian NDA and an English safety manual) will be added in a follow-up commit, with the same per-document structure used for the 9B model card.

Evaluation

Evaluated on 200 blind samples drawn from the v3 held-out test split, stratified by (language × doc_type × difficulty × verdict), never seen during training or validation.

Metrics

Metric	Prod threshold	Acceptable threshold	v0.2 result	Status
`score_mae` (mean absolute error on 0–100 score)	≤ 8.0	≤ 12.0	3.66	✅ PROD
`verdict_f1` (binary F1 conforme / non_conforme)	≥ 0.88	≥ 0.80	0.9396	✅ PROD
`verdict_precision`	—	—	0.9091	(high)
`verdict_recall`	—	—	0.9722	(very high)
`false_positive_rate` (on `conforme` class)	≤ 0.08	≤ 0.15	0.0547	✅ PROD
`span_f1` (IoU char-level ≥ 0.5)	≥ 0.72	≥ 0.62	0.2721	⚠️ below accept
`checklist_rouge_l`	≥ 0.55	≥ 0.45	0.2235	⚠️ below accept (lexical)
`checklist_bertscore_f1` (mdeberta-v3-base)	≥ 0.78	≥ 0.65	0.6952	✅ accept (semantic)

Interpretation

Strengths

High verdict recall (0.97): the model is conservative on the non-compliant class and rarely misses a problematic document — useful for triage workflows where false negatives are more costly than false positives.
Production-grade score calibration: MAE 3.66 on a 0–100 scale, well below the production threshold of 8. Quantitative agreement with the ground truth is tight despite the small footprint.
Binary verdict F1 above production threshold: 0.94 vs the 0.88 threshold; precision 0.91, recall 0.97 — favouring recall is intentional in a triage context.
False positive rate (5.5 %) under the production cap of 8 %: the model does not over-flag compliant texts.
Strong semantic similarity on the checklist (BERTScore F1 0.70, above the acceptable threshold of 0.65). Despite being a 5× smaller model than the 9B, the gap on the semantic dimension of the checklist is only -3 % (vs -16 % on ROUGE-L) — paraphrasing is the main difference, not relevance.
Robust XML schema adherence: canonical tags, canonical violation vocabulary, coherent character-level offsets across all six languages.

Measured weaknesses

Span F1 0.27 (below the 0.62 acceptable threshold): on documents with many violations the model reports fewer spans than the ground truth, or with offset drift that fails the IoU ≥ 0.5 match. The reduced parameter count limits memorisation of precise span boundaries — this is the trade-off accepted in exchange for edge deployability.
Checklist ROUGE-L 0.22 (below the 0.45 acceptable threshold): corrective items are semantically plausible but lexically divergent from the ground truth (ROUGE penalises paraphrasing). A semantic metric such as BERTScore would likely reward these outputs more fairly.
Verdict precision (0.91) lower than recall (0.97): about 1 in 11 flagged documents is a false positive. Acceptable for screening, but if you need high-precision flagging consider a higher-capacity model.

Test set composition

Languages: IT 50 %, EN 15 %, PT 12 %, ES 10 %, FR 8 %, DE 5 % (natural distribution preserved in val/test, balanced in train)
Document types (10): 9 administrative/legal categories plus a catch-all other category for stylistic diversity
Difficulty buckets: easy / medium / hard / very_hard

The aggregate metrics are averaged across all six languages.

Robustness on a newer distribution (v4 test split)

To check that the model does not overfit to the v3 distribution we also evaluated on a 200-sample blind set drawn from the v4 dataset test split — a corpus that includes 87 % of records not present in v3 (different sentence-level chunking, additional document types, human-curated re-imports from public datasets).

Metric	test v3 (in-distribution)	test v4 (87 % unseen)	Δ
`score_mae`	3.66	4.83	+31.8 %
`verdict_f1`	0.9396	0.9538	+1.5 %
`verdict_precision`	0.9091	0.9300	+2.3 %
`verdict_recall`	0.9722	0.9789	+0.7 %
`false_positive_rate`	0.0547	0.0667	+1.2 pp
`span_f1`	0.2721	0.2375	−12.7 %
`checklist_rouge_l`	0.2235	0.1831	−18.1 %
`checklist_bertscore_f1`	0.6952	0.6936	−0.2 %

Reading:

The model holds up reasonably well on the v4 distribution: all metrics stay within the acceptable thresholds documented above.
verdict_f1 and verdict_precision actually improve slightly on the v4 split (the v4 corpus contains more decisive conforme/non-conforme cases that the model classifies confidently).
The hardest hits are on score_mae (+32 %) and checklist_rouge_l (−18 %). The semantic BERTScore F1 stays essentially flat (−0.2 %), meaning the corrective suggestions remain on-target even on the unseen distribution; only the surface wording diverges more.
span_f1 drops 13 %, confirming that span localisation is the weakest dimension on out-of-distribution records — consistent with the 9B sibling on the same v4 split.

For a 5×-smaller model than the 9B variant, holding the binary verdict quality essentially intact on a largely unseen distribution is the key edge-deployment signal.

Intended use

Recommended use cases

Edge / on-device automated triage of contractual, regulatory and administrative documents to flag problematic clauses from a plain-language perspective.
Decision-support tool for editors, compliance officers, in-house legal teams on hardware with limited VRAM.
First-draft generation of accessible rewrites for portions of a document.
Teaching and research on ISO 24495-1 and plain language across multilingual corpora.

Out-of-scope use cases

Fully automated decisions without human review. Output must always be validated by an expert, especially for legally consequential implications.
Domains outside training scope: clinical/medical text, purely academic scientific writing, creative literature. The model is optimised on administrative/legal document types of the training set.
Languages other than the six supported. Performance outside the EU language set is not guaranteed.
Legal or compliance advice substitute. The model identifies readability issues, not legal correctness or compliance with other regulations.

Limitations

Hybrid training set (about 23,000 task records): first about 9,000 records are fully synthetic (gemini-2.5-flash + gemini-3.5-flash recovery), remaining records are built on top of human-curated source documents with partial assisted re-annotation by gemini-3.5-flash under human review. Generator-side biases have not been formally measured.
Not validated on standard public benchmarks. The reported metrics come from an internal blind test set (200 samples) drawn from the same distribution as the training set.
Smaller model capacity than 9B-class variants. Expect lower span_f1 and checklist_rouge_l than a 9B fine-tuned on the same dataset — the trade-off here is edge deployability (1.7 GB in 8-bit vs about 9 GB).
Per-language variability. Italian is the largest single language in training (about 43 % of task split). Expect slightly better calibration on Italian than on German (14 %).
Short context window — hard 4,096-token limit (vs 32K of larger EuroLLM variants). See the dedicated section § Working with the 4K context window for the recommended input-size policy.
Long outputs may be truncated. On documents with many violations the generation can exceed 2,048 tokens; we recommend max_new_tokens=3072+ combined with a parser tolerant of unclosed XML tags.

Working with the 4K context window

The base model has a hard max_position_embeddings = 4,096 (about 3,000 input words as an absolute ceiling). During training we used a sequence length of 4,096 tokens, including the assistant XML output which itself can consume 1,000 to 2,000 tokens for documents with many violations.

General rule of thumb for using this model in production:

Feed the model complete sentences and complete sections of the document. Do not split mid-sentence. We recommend keeping each single request under 500 words of input text.

As a reference:

Input size	Approximate equivalent (A4)
500 words	~1 page A4 of dense contract text, or ~1.5 pages of standard administrative prose
1,000 words	~2 pages A4 of dense contract text
3,000 words (≈ hard ceiling)	~4–5 pages A4 — leaves little room for the output

If your document exceeds the recommended limit of about 500 words per request:

Pre-chunk at sentence boundaries (never split a sentence). Aim for chunks of 300–500 words each, abbreviation-aware (art., n., Sig., etc.) for the six supported languages.
Preserve natural document structure as chunk boundaries when possible: article, section, clause. This keeps each request semantically coherent and produces better-scoped span offsets.
Run the model once per chunk and concatenate the resulting <SPANS> arrays. The start_char / end_char offsets are chunk-local — remap them to the original document by adding the chunk offset.
Do not deduplicate spans across chunks: if the same violation appears in two adjacent chunks, both are valid local findings.

Going above the 500-word recommendation generally still works (up to about 1,000 to 1,500 words), but you trade off:

Span offset precision drops (the model has fewer training samples in that input-size bucket).
Recall on violations late in the document drops (attention spreads thin).
Risk of output truncation grows (long input + long output approaches the 4 K ceiling).

For corpora where most documents are longer than about 2,000 words, consider the 9B variant (SemplificaAI/EuroLLM-ISO24495-9b-Instruct, 32 K context) instead.

Training details


Base model	utter-project/EuroLLM-1.7B-Instruct (Apache 2.0)
Architecture	Llama-style decoder, 1.7B parameters, native ChatML chat template
Fine-tuning method	LoRA in bf16 on a bf16 (non-quantised) base
LoRA rank / alpha	32 / 64
LoRA target modules	`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
Trainable parameters	28,704,768 (1.73 % of total)
Framework	Axolotl 0.16.1; Liger kernel (fused linear + cross-entropy)
Sample packing	yes
Sequence length	4,096 tokens
Epochs	2
Optimizer steps	700 (350 per epoch)
Batch	1 micro × 16 gradient accumulation = 16 effective
Optimizer	Paged AdamW 8-bit (bitsandbytes)
Learning rate	2.0e-4, cosine schedule, 100-step warmup
Loss masking	assistant tokens only (`roles_to_train: ["assistant"]`)
Hardware	1× NVIDIA RTX 4090 (24 GB)
Training time	~1h 40m wall clock
Final loss	~0.42 (from ~0.91 at step 5, −54 %)
Peak VRAM	~8.3 GB / 24 GB

The model published in this repository is the final merge of the LoRA adapter into the base model, saved as a single model.safetensors file in bf16 (about 3.4 GB). For 8-bit inference, load with BitsAndBytesConfig(load_in_8bit=True) as shown in the Quick start.

Dataset

The model was trained on semplifica.Language v3, an internal hybrid (synthetic + human-curated) dataset of 28,410 records (23,589 train / 2,194 validation / 2,627 test) covering six European languages.

Composition

Train mix:
- task_iso24495: 18,589 records (79 %) — the primary compliance task.
- euroblocks_instruct: 5,000 records (21 %) — anti-forgetting, general-purpose instruct conversations to retain broad capability.
Origin of the task records (the task_iso24495 portion):
- First about 9,000 records: fully synthetic, generated with gemini-2.5-flash (with gemini-3.5-flash recovery passes on blocking defects).
- Remaining about 9,500+ records: built on top of human-curated source documents from selected public/proprietary datasets, cleaned and normalised, then partially re-annotated with assistance from gemini-3.5-flash under human review. This phase brought real-world stylistic variety, edge-case clauses, and harder negative examples that pure synthetic generation underproduced.
Format: ChatML triples (system, user, assistant) with structured XML output (matching the schema documented in § Output format).
Languages (task split): IT 43 %, EN 26 %, FR 17 %, PT 16 %, DE 14 %, ES 11 %.
Document types (10): service contracts, privacy notices, general terms & conditions, business letters, internal regulations, tender notices, insurance policies, consent forms, administrative communications, plus an other catch-all for stylistic diversity.
Difficulty buckets: easy / medium / hard / very_hard, with target word counts and violation density scaled accordingly.
Splits: stratified by (lang × doc_type × difficulty × verdict) to keep distribution consistent across train / val / test. Val/test preserve natural distribution; train is balanced for verdict (40–60 % conforme per language).

Generation and curation pipeline

Synthetic generation (first about 9,000 records): initial bulk generation with gemini-2.5-flash, recovery pass with gemini-3.5-flash for blocking defects.
Human-curated phase (later records): source documents from selected datasets, cleaned and normalised, then passed through gemini-3.5-flash for assisted re-annotation, with human review on the violation labels and span boundaries.
Sentence-aware chunking for long documents (max 500 words per chunk, abbreviation-aware for IT/EN/FR/DE/ES/PT).
Algorithmic defect scan and repair across the whole corpus: case-insensitive matching, whitespace normalisation, span re-localization.

Each record carries provenance metadata: id, lang, doc_type, difficulty, score, verdict, source.

Distribution

The dataset is not currently published. The decision on public release is being evaluated separately from this model release.

Multilingual prompts

The model accepts system prompts in all six target languages. Examples optimised to match the training distribution:

SYSTEM_PROMPTS = {
    "it": "Sei un esperto di plain language secondo ISO 24495-1:2023. ...",
    "en": "You are an expert in plain language according to ISO 24495-1:2023. ...",
    "fr": "Vous êtes expert en langage clair selon ISO 24495-1:2023. ...",
    "de": "Sie sind Experte für Verständlichkeit gemäß ISO 24495-1:2023. ...",
    "es": "Eres experto en lenguaje claro según ISO 24495-1:2023. ...",
    "pt": "Você é especialista em linguagem simples segundo a ISO 24495-1:2023. ...",
}

The full set of prompts is available in iso_principles.py in the companion training-scripts repository.

License

The fine-tuned model (this repository) is released under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license.

Non-commercial use is freely permitted (research, academia, internal evaluation). For commercial use, please contact the authors (see § Contact).

The base model (utter-project/EuroLLM-1.7B-Instruct) is released under the Apache License 2.0 (© 2024 UTTER project). The distribution of this derivative work incorporates and attributes the base model as required by Apache 2.0. See ATTRIBUTION.md for full details.

Citation

If you use this model in academic publications or research materials, please cite as:

@misc{semplifica_iso24495_1_7b_v02_2026,
  title  = {EuroLLM-ISO24495-1.7b-Instruct (v0.2): A Fine-Tuned EuroLLM-1.7B
            for ISO 24495-1 Plain Language Compliance Analysis in Six EU Languages},
  author = {SemplificaAI},
  year   = {2026},
  url    = {https://huggingface.co/SemplificaAI/EuroLLM-ISO24495-1.7b-Instruct},
  note   = {v0.2},
}

Please also cite the base model:

@misc{eurollm1_7b_2024,
  title  = {EuroLLM-1.7B: Open-Weight European LLM},
  author = {UTTER project},
  year   = {2024},
  url    = {https://huggingface.co/utter-project/EuroLLM-1.7B-Instruct},
}

Contact

Commercial use: hf@semplifica.ai
Issues, bugs, qualitative feedback: use the Community tab of this HF repository.
Academic collaboration: contact the authors for joint dataset / benchmark initiatives.

Downloads last month: -

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for SemplificaAI/EuroLLM-ISO24495-1.7b-Instruct

Base model

utter-project/EuroLLM-1.7B

Finetuned

utter-project/EuroLLM-1.7B-Instruct