dariofinardi's picture
docs: replace ~ with 'about' to avoid GFM strikethrough rendering bug
1b9c24a verified
---
language:
- it
- en
- pt
- es
- fr
- de
license: cc-by-nc-4.0
license_name: cc-by-nc-4.0
license_link: https://creativecommons.org/licenses/by-nc/4.0/
base_model: utter-project/EuroLLM-1.7B-Instruct
base_model_relation: finetune
library_name: transformers
pipeline_tag: text-generation
tags:
- plain-language
- iso-24495-1
- compliance
- legal-nlp
- multilingual
- eurollm
- lora
- structured-output
- edge
gated: auto
extra_gated_heading: "Access to EuroLLM-ISO24495-1.7b-Instruct (v0.2)"
extra_gated_description: >
This model is released under CC-BY-NC-4.0 (non-commercial). The form below
helps us understand who is using the model and prioritize improvements.
Approval is automatic once the form is submitted.
extra_gated_prompt: >
By submitting this form you confirm that (1) your intended use complies
with the CC-BY-NC-4.0 license terms (non-commercial), and (2) you have
read the Limitations section of the model card. For commercial use,
please contact hf@semplifica.ai.
extra_gated_fields:
Full name: text
Organization or affiliation: text
Country: country
Intended use:
type: text
description: "Briefly describe how you intend to use the model (1-2 sentences)."
Affiliation type:
type: select
options:
- Academic / Research
- Public administration
- Non-profit
- Industry (non-commercial evaluation only)
- Individual / Personal
I agree to non-commercial use only (CC-BY-NC-4.0):
type: checkbox
extra_gated_button_content: "Request access"
model-index:
- name: EuroLLM-ISO24495-1.7b-Instruct-v0.2
results:
- task:
type: text-generation
name: ISO 24495-1 Plain Language Compliance Analysis
dataset:
name: semplifica.Language synthetic v3 test set (blind)
type: custom
config: 200_samples_blind
metrics:
- type: mae
value: 3.66
name: Score MAE (0–100)
verified: false
- type: f1
value: 0.9396
name: Verdict F1 (binary)
verified: false
- type: precision
value: 0.9091
name: Verdict Precision
verified: false
- type: recall
value: 0.9722
name: Verdict Recall
verified: false
- type: false_positive_rate
value: 0.0547
name: False Positive Rate
verified: false
- type: f1
value: 0.2721
name: Span F1 (IoU 0.5)
verified: false
- type: rouge
value: 0.2235
name: Checklist ROUGE-L
verified: false
---
# EuroLLM-ISO24495-1.7b-Instruct (v0.2)
A fine-tuned [EuroLLM-1.7B-Instruct](https://huggingface.co/utter-project/EuroLLM-1.7B-Instruct)
specialised in **ISO 24495-1 (Plain Language)** compliance analysis of legal,
administrative and technical texts across **six European languages**:
Italian, English, Portuguese, Spanish, French, German.
Given a document, the model emits a structured XML analysis with: a
compliance score (0–100), a binary verdict, a list of violation spans with
character-level offsets and corrective suggestions, and a prioritised
checklist of corrective actions.
> **Version**: `v0.2` — trained on about 23,000 task records (v3 dataset,
> hybrid synthetic + human-curated), with verdict balance per language and a
> 21 % anti-forgetting mix (EuroBlocks instruct conversations).
> **Target deployment**: edge / consumer GPU. The model runs in about
> **3.4 GB VRAM in bf16** and about **1.7 GB in 8-bit** on a single
> consumer card.
## Edge profile
| Aspect | Value |
|---|---|
| Parameters | 1.7 B |
| Architecture | Llama-style decoder, native ChatML chat template |
| VRAM (bf16 inference) | ~3.4 GB |
| VRAM (8-bit inference) | ~1.7 GB |
| Context window | 4,096 tokens |
| Languages | IT, EN, FR, DE, ES, PT |
| Target hardware | single consumer GPU (RTX 3060 12 GB and up), or CPU/laptop in 8-bit |
---
## Quick start
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
REPO = "SemplificaAI/EuroLLM-ISO24495-1.7b-Instruct"
# 8-bit loading → ~1.7 GB VRAM. For bf16 (~3.4 GB) drop the quantization_config.
bnb = BitsAndBytesConfig(load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained(REPO)
model = AutoModelForCausalLM.from_pretrained(
REPO, quantization_config=bnb, device_map="auto", torch_dtype=torch.bfloat16,
)
model.eval()
SYSTEM = (
"You are an expert in plain language according to ISO 24495-1:2023. "
"Analyze the provided text and produce: (1) a compliance score 0-100, "
"(2) parts to improve with specific suggestions, "
"(3) an ordered checklist of corrective actions. "
"Reply directly without thinking aloud."
)
text = """The Parties hereby acknowledge, in light of the foregoing premises
which form an integral and substantive part of this Agreement, that the
Confidential Information shall not include..."""
messages = [
{"role": "system", "content": SYSTEM},
{"role": "user", "content": f"Analyze this text for ISO 24495-1 plain language compliance:\n\n<TEXT>\n{text}\n</TEXT>"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=3072, do_sample=False,
pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
```
System prompts in the other five languages: see [§ Multilingual prompts](#multilingual-prompts).
---
## Output format
The model emits a single XML block with four fields:
```xml
<ANALYSIS>
<SCORE>42</SCORE>
<VERDICT>non_conforme</VERDICT>
<SPANS>
[
{
"text_fragment": "The Parties hereby acknowledge, in light of the foregoing premises...",
"violation_type": "legalese_overload",
"suggestion": "Both parties agree, based on the above context, that...",
"start_char": 0,
"end_char": 78,
"severity": "high"
}
]
</SPANS>
<CHECKLIST>
1. Replace archaic legal formulas with direct expressions.
2. Break long sentences into shorter periods.
3. Define technical terms on first use.
</CHECKLIST>
</ANALYSIS>
```
### Fields
| Field | Value | Notes |
|---|---|---|
| `<SCORE>` | integer `0``100` | 100 = fully compliant |
| `<VERDICT>` | `conforme` \| `non_conforme` | internal threshold around 60 |
| `<SPANS>` | inline JSON array | violations with char-level spans |
| `<CHECKLIST>` | numbered list | corrective actions in priority order |
### `violation_type` vocabulary (10 ISO-aligned categories)
`sentence_too_long`, `passive_voice_overuse`, `undefined_jargon`,
`buried_action`, `nominalization`, `double_negative`, `ambiguous_reference`,
`missing_structure`, `inconsistent_terminology`, `legalese_overload`.
### `severity`
`low` | `medium` | `high`
### Reference parser
A tolerant Python parser (handles truncated output and non-standard JSON
escapes) is shipped together with the model code (see `text_utils.py` in
the training-scripts repository).
---
## Examples
Qualitative examples (a real Italian NDA and an English safety manual) will
be added in a follow-up commit, with the same per-document structure used
for the 9B model card.
---
## Evaluation
Evaluated on **200 blind samples** drawn from the v3 held-out test split,
stratified by `(language × doc_type × difficulty × verdict)`, never seen
during training or validation.
### Metrics
| Metric | Prod threshold | Acceptable threshold | **v0.2 result** | Status |
|---|---|---|---|---|
| `score_mae` (mean absolute error on 0–100 score) | ≤ 8.0 | ≤ 12.0 | **3.66** | ✅ **PROD** |
| `verdict_f1` (binary F1 conforme / non_conforme) | ≥ 0.88 | ≥ 0.80 | **0.9396** | ✅ **PROD** |
| `verdict_precision` | — | — | **0.9091** | (high) |
| `verdict_recall` | — | — | **0.9722** | (very high) |
| `false_positive_rate` (on `conforme` class) | ≤ 0.08 | ≤ 0.15 | **0.0547** | ✅ **PROD** |
| `span_f1` (IoU char-level ≥ 0.5) | ≥ 0.72 | ≥ 0.62 | 0.2721 | ⚠️ below accept |
| `checklist_rouge_l` | ≥ 0.55 | ≥ 0.45 | 0.2235 | ⚠️ below accept |
### Interpretation
**Strengths**
- **High verdict recall (0.97)**: the model is conservative on the
non-compliant class and rarely misses a problematic document — useful
for triage workflows where false negatives are more costly than false
positives.
- **Production-grade score calibration**: MAE 3.66 on a 0–100 scale, well
below the production threshold of 8. Quantitative agreement with the
ground truth is tight despite the small footprint.
- **Binary verdict F1 above production threshold**: 0.94 vs the 0.88
threshold; precision 0.91, recall 0.97 — favouring recall is intentional
in a triage context.
- **False positive rate (5.5 %) under the production cap of 8 %**: the
model does not over-flag compliant texts.
- **Robust XML schema** adherence: canonical tags, canonical violation
vocabulary, coherent character-level offsets across all six languages.
**Measured weaknesses**
- **Span F1 0.27** (below the 0.62 acceptable threshold): on documents
with many violations the model reports fewer spans than the ground
truth, or with offset drift that fails the IoU ≥ 0.5 match. The reduced
parameter count limits memorisation of precise span boundaries — this
is the trade-off accepted in exchange for edge deployability.
- **Checklist ROUGE-L 0.22** (below the 0.45 acceptable threshold):
corrective items are semantically plausible but lexically divergent
from the ground truth (ROUGE penalises paraphrasing). A semantic
metric such as BERTScore would likely reward these outputs more
fairly.
- **Verdict precision (0.91) lower than recall (0.97)**: about 1 in 11
flagged documents is a false positive. Acceptable for screening, but
if you need high-precision flagging consider a higher-capacity model.
### Test set composition
- **Languages**: IT 50 %, EN 15 %, PT 12 %, ES 10 %, FR 8 %, DE 5 %
(natural distribution preserved in val/test, balanced in train)
- **Document types (10)**: 9 administrative/legal categories plus a
catch-all `other` category for stylistic diversity
- **Difficulty buckets**: easy / medium / hard / very_hard
The aggregate metrics are **averaged across all six languages**.
---
## Intended use
**Recommended use cases**
- **Edge / on-device** automated triage of contractual, regulatory and
administrative documents to flag problematic clauses from a
plain-language perspective.
- Decision-support tool for editors, compliance officers, in-house legal
teams on hardware with limited VRAM.
- First-draft generation of accessible rewrites for portions of a document.
- Teaching and research on ISO 24495-1 and plain language across
multilingual corpora.
**Out-of-scope use cases**
- **Fully automated decisions without human review.** Output must always be
validated by an expert, especially for legally consequential implications.
- **Domains outside training scope**: clinical/medical text, purely academic
scientific writing, creative literature. The model is optimised on
administrative/legal document types of the training set.
- **Languages other than the six supported.** Performance outside the EU
language set is not guaranteed.
- **Legal or compliance advice substitute.** The model identifies
*readability* issues, not legal correctness or compliance with other
regulations.
---
## Limitations
- **Hybrid training set** (about 23,000 task records): first about
9,000 records are fully synthetic (`gemini-2.5-flash` +
`gemini-3.5-flash` recovery), remaining records are built on top of
human-curated source documents with partial assisted re-annotation by
`gemini-3.5-flash` under human review. Generator-side biases have not
been formally measured.
- **Not validated on standard public benchmarks.** The reported metrics
come from an internal blind test set (200 samples) drawn from the same
distribution as the training set.
- **Smaller model capacity than 9B-class variants.** Expect lower
`span_f1` and `checklist_rouge_l` than a 9B fine-tuned on the same
dataset — the trade-off here is **edge deployability** (1.7 GB in 8-bit
vs about 9 GB).
- **Per-language variability.** Italian is the largest single language in
training (about 43 % of task split). Expect slightly better calibration
on Italian than on German (14 %).
- **Short context window — hard 4,096-token limit** (vs 32K of larger
EuroLLM variants). See the dedicated section
[§ Working with the 4K context window](#working-with-the-4k-context-window)
for the recommended input-size policy.
- **Long outputs may be truncated.** On documents with many violations the
generation can exceed 2,048 tokens; we recommend `max_new_tokens=3072+`
combined with a parser tolerant of unclosed XML tags.
---
## Working with the 4K context window
The base model has a hard `max_position_embeddings = 4,096` (about
**3,000 input words** as an absolute ceiling). During training we used a
sequence length of 4,096 tokens, **including** the assistant XML output
which itself can consume 1,000 to 2,000 tokens for documents with many
violations.
**General rule of thumb** for using this model in production:
> Feed the model **complete sentences and complete sections** of the
> document. Do **not** split mid-sentence. We recommend keeping each
> single request **under 500 words of input text**.
As a reference:
| Input size | Approximate equivalent (A4) |
|---|---|
| **500 words** | ~1 page A4 of dense contract text, or ~1.5 pages of standard administrative prose |
| **1,000 words** | ~2 pages A4 of dense contract text |
| **3,000 words** (≈ hard ceiling) | ~4–5 pages A4 — leaves little room for the output |
If your document exceeds the recommended limit of about 500 words per request:
1. **Pre-chunk at sentence boundaries** (never split a sentence). Aim
for chunks of 300–500 words each, abbreviation-aware (`art.`, `n.`,
`Sig.`, etc.) for the six supported languages.
2. **Preserve natural document structure** as chunk boundaries when
possible: article, section, clause. This keeps each request
semantically coherent and produces better-scoped span offsets.
3. **Run the model once per chunk** and concatenate the resulting
`<SPANS>` arrays. The `start_char` / `end_char` offsets are
chunk-local — remap them to the original document by adding the chunk
offset.
4. **Do not deduplicate spans across chunks**: if the same violation
appears in two adjacent chunks, both are valid local findings.
Going above the 500-word recommendation generally still works (up to
about 1,000 to 1,500 words), but you trade off:
- Span offset precision drops (the model has fewer training samples in
that input-size bucket).
- Recall on violations late in the document drops (attention spreads
thin).
- Risk of output truncation grows (long input + long output approaches
the 4 K ceiling).
For corpora where most documents are longer than about 2,000 words,
consider the 9B variant (`SemplificaAI/EuroLLM-ISO24495-9b-Instruct`,
32 K context) instead.
---
## Training details
| | |
|---|---|
| **Base model** | [utter-project/EuroLLM-1.7B-Instruct](https://huggingface.co/utter-project/EuroLLM-1.7B-Instruct) (Apache 2.0) |
| **Architecture** | Llama-style decoder, 1.7B parameters, native ChatML chat template |
| **Fine-tuning method** | LoRA in bf16 on a bf16 (non-quantised) base |
| **LoRA rank / alpha** | 32 / 64 |
| **LoRA target modules** | `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` |
| **Trainable parameters** | 28,704,768 (1.73 % of total) |
| **Framework** | [Axolotl](https://github.com/axolotl-ai-cloud/axolotl) 0.16.1; Liger kernel (fused linear + cross-entropy) |
| **Sample packing** | yes |
| **Sequence length** | 4,096 tokens |
| **Epochs** | 2 |
| **Optimizer steps** | 700 (350 per epoch) |
| **Batch** | 1 micro × 16 gradient accumulation = 16 effective |
| **Optimizer** | Paged AdamW 8-bit (bitsandbytes) |
| **Learning rate** | 2.0e-4, cosine schedule, 100-step warmup |
| **Loss masking** | assistant tokens only (`roles_to_train: ["assistant"]`) |
| **Hardware** | 1× NVIDIA RTX 4090 (24 GB) |
| **Training time** | ~1h 40m wall clock |
| **Final loss** | ~0.42 (from ~0.91 at step 5, −54 %) |
| **Peak VRAM** | ~8.3 GB / 24 GB |
The model published in this repository is the **final merge** of the LoRA
adapter into the base model, saved as a single `model.safetensors` file in
**bf16** (about 3.4 GB). For 8-bit inference, load with
`BitsAndBytesConfig(load_in_8bit=True)` as shown in the Quick start.
---
## Dataset
The model was trained on **`semplifica.Language v3`**, an
internal **hybrid (synthetic + human-curated)** dataset of **28,410 records**
(23,589 train / 2,194 validation / 2,627 test) covering six European languages.
### Composition
- **Train mix**:
- `task_iso24495`: 18,589 records (79 %) — the primary compliance task.
- `euroblocks_instruct`: 5,000 records (21 %) — anti-forgetting,
general-purpose instruct conversations to retain broad capability.
- **Origin of the task records** (the `task_iso24495` portion):
- First **about 9,000 records**: **fully synthetic**, generated with
`gemini-2.5-flash` (with `gemini-3.5-flash` recovery passes on
blocking defects).
- Remaining **about 9,500+ records**: built on top of **human-curated
source documents** from selected public/proprietary datasets, cleaned
and normalised, then **partially re-annotated with assistance from
`gemini-3.5-flash`** under human review. This phase brought
real-world stylistic variety, edge-case clauses, and harder negative
examples that pure synthetic generation underproduced.
- **Format**: ChatML triples `(system, user, assistant)` with structured
XML output (matching the schema documented in § Output format).
- **Languages** (task split): IT 43 %, EN 26 %, FR 17 %, PT 16 %, DE 14 %,
ES 11 %.
- **Document types (10)**: service contracts, privacy notices, general
terms & conditions, business letters, internal regulations, tender
notices, insurance policies, consent forms, administrative
communications, plus an `other` catch-all for stylistic diversity.
- **Difficulty buckets**: easy / medium / hard / very_hard, with target
word counts and violation density scaled accordingly.
- **Splits**: stratified by `(lang × doc_type × difficulty × verdict)` to
keep distribution consistent across train / val / test. Val/test
preserve natural distribution; train is balanced for verdict
(40–60 % conforme per language).
### Generation and curation pipeline
- **Synthetic generation** (first about 9,000 records):
initial bulk generation with `gemini-2.5-flash`, recovery pass with
`gemini-3.5-flash` for blocking defects.
- **Human-curated phase** (later records):
source documents from selected datasets, cleaned and normalised, then
passed through `gemini-3.5-flash` for assisted re-annotation, with
human review on the violation labels and span boundaries.
- **Sentence-aware chunking** for long documents (max 500 words per
chunk, abbreviation-aware for IT/EN/FR/DE/ES/PT).
- **Algorithmic defect scan and repair** across the whole corpus:
case-insensitive matching, whitespace normalisation, span
re-localization.
Each record carries provenance metadata: `id`, `lang`, `doc_type`,
`difficulty`, `score`, `verdict`, `source`.
### Distribution
The dataset is **not currently published**. The decision on public release
is being evaluated separately from this model release.
---
## Multilingual prompts
The model accepts system prompts in all six target languages. Examples
optimised to match the training distribution:
```python
SYSTEM_PROMPTS = {
"it": "Sei un esperto di plain language secondo ISO 24495-1:2023. ...",
"en": "You are an expert in plain language according to ISO 24495-1:2023. ...",
"fr": "Vous êtes expert en langage clair selon ISO 24495-1:2023. ...",
"de": "Sie sind Experte für Verständlichkeit gemäß ISO 24495-1:2023. ...",
"es": "Eres experto en lenguaje claro según ISO 24495-1:2023. ...",
"pt": "Você é especialista em linguagem simples segundo a ISO 24495-1:2023. ...",
}
```
The full set of prompts is available in `iso_principles.py` in the
companion training-scripts repository.
---
## License
The **fine-tuned model** (this repository) is released under the
**Creative Commons Attribution-NonCommercial 4.0 International ([CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/))**
license.
> Non-commercial use is freely permitted (research, academia, internal
> evaluation). For commercial use, please contact the authors (see § Contact).
The **base model**
([utter-project/EuroLLM-1.7B-Instruct](https://huggingface.co/utter-project/EuroLLM-1.7B-Instruct))
is released under the **Apache License 2.0** (© 2024 UTTER project). The
distribution of this derivative work **incorporates and attributes** the
base model as required by Apache 2.0. See [`ATTRIBUTION.md`](ATTRIBUTION.md)
for full details.
---
## Citation
If you use this model in academic publications or research materials,
please cite as:
```bibtex
@misc{semplifica_iso24495_1_7b_v02_2026,
title = {EuroLLM-ISO24495-1.7b-Instruct (v0.2): A Fine-Tuned EuroLLM-1.7B
for ISO 24495-1 Plain Language Compliance Analysis in Six EU Languages},
author = {SemplificaAI},
year = {2026},
url = {https://huggingface.co/SemplificaAI/EuroLLM-ISO24495-1.7b-Instruct},
note = {v0.2},
}
```
Please also cite the **base model**:
```bibtex
@misc{eurollm1_7b_2024,
title = {EuroLLM-1.7B: Open-Weight European LLM},
author = {UTTER project},
year = {2024},
url = {https://huggingface.co/utter-project/EuroLLM-1.7B-Instruct},
}
```
---
## Contact
- **Commercial use**: [hf@semplifica.ai](mailto:hf@semplifica.ai)
- **Issues, bugs, qualitative feedback**: use the *Community* tab of this HF repository.
- **Academic collaboration**: contact the authors for joint dataset /
benchmark initiatives.