How to use from
SGLang
# Gated model: Login with a HF token with gated access permission
hf auth login
Install from pip and serve model
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "SemplificaAI/EuroLLM-ISO24495-9b-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SemplificaAI/EuroLLM-ISO24495-9b-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'
Use Docker images
docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "SemplificaAI/EuroLLM-ISO24495-9b-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SemplificaAI/EuroLLM-ISO24495-9b-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'
Quick Links

Access to EuroLLM-ISO24495-9b-Instruct (v0.2)

This model is released under CC-BY-NC-4.0 (non-commercial). The form below helps us understand who is using the model and prioritize improvements for v1.0. Approval is automatic once the form is submitted.

By submitting this form you confirm that (1) your intended use complies with the CC-BY-NC-4.0 license terms (non-commercial), and (2) you have read the Limitations section of the model card. For commercial use, please contact hf@semplifica.ai.

Log in or Sign Up to review the conditions and access this model content.

EuroLLM-ISO24495-9b-Instruct (v0.2)

A fine-tuned EuroLLM-9B-Instruct-2512 specialised in ISO 24495-1 (Plain Language) compliance analysis of legal, administrative and technical texts across six European languages: Italian, English, Portuguese, Spanish, French, German.

Given a document, the model emits a structured XML analysis with: a compliance score (0–100), a binary verdict, a list of violation spans with character-level offsets and corrective suggestions, and a prioritised checklist of corrective actions.

Version: v0.2 — trained on about 28,000 records (v3 dataset, hybrid synthetic + human-curated), with verdict balance per language and a 21 % anti-forgetting mix (EuroBlocks instruct conversations). Previous: v0.1-base — trained on 10 K records, see git tag. Next: v1.0 — adds manually-annotated samples from domain experts (in preparation).

What changed in v0.2

Compared to v0.1-base (the first public release):

  • 2.8× larger training set (28,410 records vs 10,225): same 9 document types in 6 EU languages, plus a new other catch-all category for greater stylistic diversity.
  • Per-language verdict balance of 40–60 % conforme (v0.1 was skewed to about 30 % conforme): reduces the model's prior bias toward "non_conforme" verdicts on borderline cases.
  • Anti-forgetting mix: 21 % of training is general-purpose instruct conversation (euroblocks_instruct) so the model retains broad instruction-following capability when asked questions outside the ISO 24495-1 task.
  • Better language coverage: Italian went from 50 % → 43 %; German tripled (5 % → 14 %); English nearly doubled (15 % → 26 %).
  • Sentence-aware document chunking: long documents are split at sentence boundaries (max 500 words / chunk) with violation spans re-localized to the new offsets.
  • Conservative training: 2 epochs (instead of 3), learning rate 1.5e-4 (instead of 2e-4), warmup 100 steps (instead of 50). All to reduce overfitting risk on the larger, more diverse corpus.

Headline metric improvements (200-sample blind test)

Metric v0.1 v0.2 Δ
score_mae (lower is better) 3.86 2.74 -29 %
verdict_f1 0.9934 0.9577 -3.6 % *
false_positive_rate (lower is better) 0.0000 0.0156 +1.6 pp
span_f1 (IoU ≥ 0.5) 0.3192 0.3653 +14 %
checklist_rouge_l 0.2375 0.2655 +12 %

* v0.2 is evaluated on the blind test set (more rigorous), v0.1 was on the validation set. The verdict F1 remains well above the production threshold (≥ 0.88) on both.


Quick start

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

REPO = "SemplificaAI/EuroLLM-ISO24495-9b-Instruct"

# Recommended: 8-bit loading → ~9 GB VRAM (instead of ~18 GB in bf16)
bnb = BitsAndBytesConfig(load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained(REPO)
model = AutoModelForCausalLM.from_pretrained(
    REPO, quantization_config=bnb, device_map="auto", torch_dtype=torch.bfloat16,
)
model.eval()

SYSTEM = (
    "You are an expert in plain language according to ISO 24495-1:2023. "
    "Analyze the provided text and produce: (1) a compliance score 0-100, "
    "(2) parts to improve with specific suggestions, "
    "(3) an ordered checklist of corrective actions. "
    "Reply directly without thinking aloud."
)

text = """The Parties hereby acknowledge, in light of the foregoing premises
which form an integral and substantive part of this Agreement, that the
Confidential Information shall not include..."""

messages = [
    {"role": "system", "content": SYSTEM},
    {"role": "user", "content": f"Analyze this text for ISO 24495-1 plain language compliance:\n\n<TEXT>\n{text}\n</TEXT>"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=3072, do_sample=False,
                         pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

System prompts in the other five languages: see § Multilingual prompts.


Output format

The model emits a single XML block with four fields:

<ANALYSIS>
<SCORE>42</SCORE>
<VERDICT>non_conforme</VERDICT>

<SPANS>
[
  {
    "text_fragment": "The Parties hereby acknowledge, in light of the foregoing premises...",
    "violation_type": "legalese_overload",
    "suggestion": "Both parties agree, based on the above context, that...",
    "start_char": 0,
    "end_char": 78,
    "severity": "high"
  }
]
</SPANS>

<CHECKLIST>
1. Replace archaic legal formulas with direct expressions.
2. Break long sentences into shorter periods.
3. Define technical terms on first use.
</CHECKLIST>
</ANALYSIS>

Fields

Field Value Notes
<SCORE> integer 0100 100 = fully compliant
<VERDICT> conforme | non_conforme internal threshold around 60
<SPANS> inline JSON array violations with char-level spans
<CHECKLIST> numbered list corrective actions in priority order

violation_type vocabulary (10 ISO-aligned categories)

sentence_too_long, passive_voice_overuse, undefined_jargon, buried_action, nominalization, double_negative, ambiguous_reference, missing_structure, inconsistent_terminology, legalese_overload.

severity

low | medium | high

Reference parser

A tolerant Python parser (handles truncated output and non-standard JSON escapes) is available in the companion training-scripts repository, in scripts/shared/text_utils.py.


Examples

Two real runs of the model on documents from different domains and languages, processed end-to-end with greedy decoding (do_sample=False, max_new_tokens=3072).

Example 1 — Italian NDA (legal)

Input (excerpt from a pseudonymised non-disclosure agreement, about 1,500 words):

Su richiesta dell'altra Parte, ovvero alla conclusione o all'interruzione, per qualsiasi motivo, senza alcun pregiudizio per quanto riguarda gli altri impegni di cui al presente Accordo, la Parte ricevente si obbliga a riconsegnare entro 30 giorni all'altra, ovvero, a scelta di quest'ultima, a distruggere e attestare per iscritto la distruzione, ogni copia di tutti i documenti, o altro materiale in qualsiasi forma in possesso della Parte stessa, delle Persone Collegate o di Terzi, che contengano o che si riferiscano alle "Informazioni riservate"...

Selected output fields (full output has 8 spans + 5 checklist items):

SCORE:    15 / 100
VERDICT:  non_conforme
# violation_type severity Suggestion
1 legalese_overload high "Se richiesto dall'altra Parte, alla fine o all'interruzione dell'Accordo, la Parte Ricevente deve riconsegnare o distruggere tutte le copie..."
2 legalese_overload medium "Le Parti dichiarano di acconsentire al trattamento dei loro dati personali, in conformità al Regolamento UE 2016/679."
3 legalese_overload medium "Le Parti stipulano quanto segue" (replacing "TUTTO CIÒ PREMESSO / SI STIPULA E CONVIENE QUANTO SEGUE")

Checklist excerpt:

  1. Semplificare il linguaggio giuridico per renderlo più accessibile.
  2. Eliminare le formule rituali e le espressioni arcaiche.
  3. Riformulare le frasi lunghe e complesse in periodi più brevi e chiari.

Wall-clock: 42 s on a single RTX 4090 (8-bit loading, 1,682 tokens generated).

Example 2 — English technical safety manual

Input (excerpt from a HVAC equipment safety manual, about 2,300 words, OCR-cleaned and brand-anonymised):

PROHIBITION. It is forbidden to use the machine without the safety devices: not working, installed incorrectly. Operating the machine without the safety devices creates potential hazards for the operator. For correct and long-lasting operation of the machine, carry out the scheduled maintenance work as specified by the manufacturer...

Selected output fields (full output has 8 spans + 5 checklist items):

SCORE:    15 / 100
VERDICT:  non_conforme
# violation_type severity Suggestion
1 missing_structure high Add a section title (e.g., 'Prohibited Modifications') and use bullet points for the consequences.
2 missing_structure high Add a section title (e.g., 'Safety Device Requirements') and list the consequences of non-compliance.
5 inconsistent_terminology medium Use 'explosion risk areas' consistently instead of 'areas classified as at risk of explosion'.
6 inconsistent_terminology medium Use 'fixed guards' consistently instead of 'fixed guards protecting the moving parts'.

Checklist excerpt:

  1. Organize the manual into logical sections with clear, bold headings.
  2. Use bulleted lists to present rules, prohibitions, and safety requirements.
  3. Standardize terminology for the machine, fluids, and safety devices throughout the text.
  4. Add a table of contents to help readers navigate the document.

Both documents score 15/100 in different ways: the NDA is flagged for legalese overload, the safety manual for missing structure and inconsistent terminology. The model correctly diagnoses different failure modes for different document types.


Evaluation

Evaluated on 200 blind samples drawn from the v3 held-out test split, stratified by (language × doc_type × difficulty × verdict), never seen during training or validation.

Metrics

Metric Prod threshold Acceptable threshold v0.2 result Status
score_mae (mean absolute error on 0–100 score) ≤ 8.0 ≤ 12.0 2.74 PROD
verdict_f1 (binary F1 conforme / non_conforme) ≥ 0.88 ≥ 0.80 0.9577 PROD
verdict_precision 0.9714 (high)
verdict_recall 0.9444 (high)
false_positive_rate (on conforme class) ≤ 0.08 ≤ 0.15 0.0156 PROD
span_f1 (IoU char-level ≥ 0.5) ≥ 0.72 ≥ 0.62 0.3653 ⚠️ below accept
checklist_rouge_l ≥ 0.55 ≥ 0.45 0.2655 ⚠️ below accept

Interpretation

Strengths

  • Excellent score calibration: MAE 2.74 on a 0–100 scale, far below the production threshold (≤ 8). The model's quantitative agreement with the ground truth is very tight.
  • Strong binary classification: verdict F1 0.96 with high precision (0.97) and recall (0.94). Very few false positives on compliant texts (1.6 %).
  • Robust XML schema adherence: canonical tags, canonical violation vocabulary, coherent character-level offsets across all six languages.

Measured weaknesses (improving from v0.1, still below acceptable)

  • Span F1 0.37: the model identifies fewer spans than the ground truth on dense documents, or with offset drifts that fail the IoU ≥ 0.5 threshold. Improvement target for v1.0.
  • Checklist ROUGE-L 0.27: corrective items are semantically plausible but lexically divergent from the ground truth (ROUGE penalises paraphrasing). A semantic metric (BERTScore) would likely reward these outputs more accurately.

Test set composition

  • Languages: IT 50 %, EN 15 %, PT 12 %, ES 10 %, FR 8 %, DE 5 % (natural distribution preserved in val/test, balanced in train)
  • Document types (10): 9 administrative/legal categories plus a catch-all other category for stylistic diversity
  • Difficulty buckets: easy / medium / hard / very_hard

The aggregate metrics are averaged across all six languages. A per-language breakdown will be released with v1.0.


Intended use

Recommended use cases

  • Automated triage of contractual, regulatory and administrative documents to flag problematic clauses from a plain-language perspective.
  • Decision-support tool for editors, compliance officers, in-house legal teams.
  • First-draft generation of accessible rewrites for portions of a document.
  • Teaching and research on ISO 24495-1 and plain language across multilingual corpora.

Out-of-scope use cases

  • Fully automated decisions without human review. Output must always be validated by an expert, especially for legally consequential implications.
  • Domains outside training scope: clinical/medical text, purely academic scientific writing, creative literature. The model is optimised on the nine administrative/legal document types of the training set.
  • Languages other than the six supported. Performance outside the EU language set is not guaranteed.
  • Legal or compliance advice substitute. The model identifies readability issues, not legal correctness or compliance with other regulations.

Limitations

  • Hybrid training set (about 23,000 task records): first about 9,000 records are fully synthetic (gemini-2.5-flash + gemini-3.5-flash recovery), remaining records are built on top of human-curated source documents with partial assisted re-annotation by gemini-3.5-flash under human review. Generator-side biases have not been formally measured.
  • Not validated on standard public benchmarks. The reported metrics come from an internal blind test set (200 samples) drawn from the same distribution as the training set. External validation is planned for v1.0.
  • Per-language variability. The training task data is more balanced across languages than v0.1, but Italian is still the largest single language (about 43 % of the task split). Expect slightly better calibration on Italian than on German (14 %).
  • Long outputs may be truncated. On documents with many violations the generation can exceed 2,048 tokens; we recommend max_new_tokens=3072+ combined with a parser tolerant of unclosed XML tags.
  • Sub-optimal span detection (see § Evaluation). On dense documents the model tends to be conservative in the number of spans reported.
  • No support for documents longer than about 30,000 characters (training-time sequence-length limit = 3,072 tokens). For very long documents, pre-chunk at sentence boundaries (≤ 500 words per chunk).

Training details

Base model utter-project/EuroLLM-9B-Instruct-2512 (Apache 2.0)
Architecture Llama-style decoder, 9B parameters, native ChatML chat template
Fine-tuning method LoRA in bf16 on top of an int8-quantised base (bitsandbytes)
LoRA rank / alpha 64 / 128
LoRA target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable parameters 203,685,888 (2.18 % of total)
Framework Axolotl 0.16.1; Liger kernel (fused linear + cross-entropy)
Sample packing yes
Sequence length 3,072 tokens
Epochs 2 (vs 3 for v0.1)
Optimizer steps 896 total (448 per epoch)
Batch 1 micro × 16 gradient accumulation = 16 effective
Optimizer Paged AdamW 8-bit (bitsandbytes)
Learning rate 1.5e-4, cosine schedule, 100-step warmup (vs 2e-4 / 50-step for v0.1)
Loss masking assistant tokens only (roles_to_train: ["assistant"])
Hardware 1× NVIDIA RTX 4090 (24 GB) + 128 GB system RAM
Training time 7h 14m wall clock
Final loss 0.30 (from 0.65 at step 5, −54 %)
Peak VRAM ~21 GB / 24 GB

The model published in this repository is the final merge of the LoRA adapter into the base model, saved as a single model.safetensors file in bf16 (about 18 GB). For 8-bit inference, load with BitsAndBytesConfig(load_in_8bit=True) as shown in the Quick start.

The bf16 merge is the "neutral ground": it can be re-quantised post-hoc to any target format (int8, NF4, GGUF Q4_K_M).


Dataset

The model was trained on semplifica.Language v3, an internal hybrid (synthetic + human-curated) dataset of 28,410 records (23,589 train / 2,194 validation / 2,627 test) covering six European languages, with the following structure:

Composition

  • Train mix:
    • task_iso24495: 18,589 records (79 %) — the primary compliance task.
    • euroblocks_instruct: 5,000 records (21 %) — anti-forgetting, general-purpose instruct conversations to retain broad capability.
  • Origin of the task records (the task_iso24495 portion):
    • First about 9,000 records: fully synthetic, generated with gemini-2.5-flash (with gemini-3.5-flash recovery passes on blocking defects).
    • Remaining about 9,500+ records: built on top of human-curated source documents from selected public/proprietary datasets (text_complexity_de, german4all, plaba, med_easi, porsimples_sent, admin_it, simpitiki), cleaned and normalised, then partially re-annotated with assistance from gemini-3.5-flash under human review. This phase brought real-world stylistic variety, edge-case clauses, and harder negative examples that pure synthetic generation underproduced.
  • Format: ChatML triples (system, user, assistant) with structured XML output (matching the schema documented in § Output format).
  • Languages (task split): IT 43 %, EN 26 %, FR 17 %, PT 16 %, DE 14 %, ES 11 %.
  • Document types (10): service contracts, privacy notices, general terms & conditions, business letters, internal regulations, tender notices, insurance policies, consent forms, administrative communications, plus an other catch-all for stylistic diversity.
  • Difficulty buckets: easy / medium / hard / very_hard, with target word counts and violation density scaled accordingly.
  • Splits: stratified by (lang × doc_type × difficulty × verdict) to keep distribution consistent across train / val / test. Val/test preserve natural distribution; train is balanced for verdict (40–60 % conforme per language).

Generation and curation pipeline

  • Synthetic generation (first about 9,000 records): initial bulk generation with gemini-2.5-flash, recovery pass with gemini-3.5-flash for blocking defects.
  • Human-curated phase (later records): source documents from the datasets listed above, cleaned and normalised, then passed through gemini-3.5-flash for assisted re-annotation, with human review on the violation labels and span boundaries.
  • Sentence-aware chunking for long documents (max 500 words per chunk, abbreviation-aware for IT/EN/FR/DE/ES/PT).
  • Algorithmic defect scan and repair across the whole corpus: case-insensitive matching, whitespace normalisation, span re-localization.
  • Verdict balancing via positive sample generation (mix 70 % Gemini 2.5 Flash + 30 % Gemini 3.5 Flash) on the human-curated baselines.

Each record carries provenance metadata: id, lang, doc_type, difficulty, score, verdict, source.

Distribution

The dataset is not currently published. The decision on public release is being evaluated jointly with the v1.0 model release. For collaboration or research access requests please use the contact channel below.


Roadmap

Version Status Training set Notes
v0.1-base ✅ released ~10 K synthetic records LoRA bf16 + 8-bit base, 3 epochs
v0.2 (this) ✅ released ~28 K synthetic records + verdict balance, + anti-forgetting mix, + sentence-aware chunking, 2 epochs
v1.0 🔄 in preparation ~28 K synthetic + manually-annotated samples Domain-expert annotations to capture edge cases (contextual ambiguity, niche jargon, severity nuances)
v1.1 / v2 🔜 backlog DPO post-v1.0 human-feedback alignment on rewrite preferences

We are building v1.0 by adding manually-annotated samples from domain experts (plain-language editors, legal reviewers, compliance officers) to the synthetic pipeline. The synthetic data has reached diminishing returns on the structural quality dimension; manual annotation is what's needed to close the gap on span_f1 and checklist_rouge_l.

A 1.7 B edge-distilled sub-release (v1.0-mini) for CPU / laptop deployment is also planned.


Multilingual prompts

The model accepts system prompts in all six target languages. Examples optimised to match the training distribution:

SYSTEM_PROMPTS = {
    "it": "Sei un esperto di plain language secondo ISO 24495-1:2023. ...",
    "en": "You are an expert in plain language according to ISO 24495-1:2023. ...",
    "fr": "Vous êtes expert en langage clair selon ISO 24495-1:2023. ...",
    "de": "Sie sind Experte für Verständlichkeit gemäß ISO 24495-1:2023. ...",
    "es": "Eres experto en lenguaje claro según ISO 24495-1:2023. ...",
    "pt": "Você é especialista em linguagem simples segundo a ISO 24495-1:2023. ...",
}

The full set of prompts is available in iso_principles.py in the companion training-scripts repository.


License

The fine-tuned model (this repository) is released under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license.

Non-commercial use is freely permitted (research, academia, internal evaluation). For commercial use, please contact the authors (see § Contact).

The base model (utter-project/EuroLLM-9B-Instruct-2512) is released under the Apache License 2.0 (© 2024 UTTER project). The distribution of this derivative work incorporates and attributes the base model as required by Apache 2.0. See ATTRIBUTION.md for full details.


Citation

If you use this model in academic publications or research materials, please cite as:

@misc{semplifica_iso24495_9b_v02_2026,
  title  = {EuroLLM-ISO24495-9b-Instruct (v0.2): A Fine-Tuned EuroLLM-9B
            for ISO 24495-1 Plain Language Compliance Analysis in Six EU Languages},
  author = {SemplificaAI},
  year   = {2026},
  url    = {https://huggingface.co/SemplificaAI/EuroLLM-ISO24495-9b-Instruct},
  note   = {v0.2},
}

Please also cite the base model:

@misc{eurollm9b_2024,
  title  = {EuroLLM-9B: Open-Weight European LLM},
  author = {UTTER project},
  year   = {2024},
  url    = {https://huggingface.co/utter-project/EuroLLM-9B-Instruct-2512},
}

Contact

  • Commercial use or access to v1.0: hf@semplifica.ai
  • Issues, bugs, qualitative feedback: use the Community tab of this HF repository.
  • Academic collaboration: contact the authors for joint dataset / benchmark initiatives.
Downloads last month
24
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SemplificaAI/EuroLLM-ISO24495-9b-Instruct

Finetuned
(1)
this model

Evaluation results

  • Score MAE (0–100) on semplifica.Language synthetic v3 test set (blind)
    self-reported
    2.740
  • Verdict F1 (binary) on semplifica.Language synthetic v3 test set (blind)
    self-reported
    0.958
  • Verdict Precision on semplifica.Language synthetic v3 test set (blind)
    self-reported
    0.971
  • Verdict Recall on semplifica.Language synthetic v3 test set (blind)
    self-reported
    0.944
  • False Positive Rate on semplifica.Language synthetic v3 test set (blind)
    self-reported
    0.016
  • Span F1 (IoU ≥ 0.5) on semplifica.Language synthetic v3 test set (blind)
    self-reported
    0.365
  • Checklist ROUGE-L on semplifica.Language synthetic v3 test set (blind)
    self-reported
    0.266