Text Generation
Transformers
Safetensors
llama
plain-language
iso-24495-1
compliance
legal-nlp
multilingual
eurollm
lora
structured-output
conversational
Eval Results (legacy)
text-generation-inference
Instructions to use SemplificaAI/EuroLLM-ISO24495-9b-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use SemplificaAI/EuroLLM-ISO24495-9b-Instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="SemplificaAI/EuroLLM-ISO24495-9b-Instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("SemplificaAI/EuroLLM-ISO24495-9b-Instruct") model = AutoModelForCausalLM.from_pretrained("SemplificaAI/EuroLLM-ISO24495-9b-Instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use SemplificaAI/EuroLLM-ISO24495-9b-Instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "SemplificaAI/EuroLLM-ISO24495-9b-Instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SemplificaAI/EuroLLM-ISO24495-9b-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/SemplificaAI/EuroLLM-ISO24495-9b-Instruct
- SGLang
How to use SemplificaAI/EuroLLM-ISO24495-9b-Instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "SemplificaAI/EuroLLM-ISO24495-9b-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SemplificaAI/EuroLLM-ISO24495-9b-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "SemplificaAI/EuroLLM-ISO24495-9b-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SemplificaAI/EuroLLM-ISO24495-9b-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use SemplificaAI/EuroLLM-ISO24495-9b-Instruct with Docker Model Runner:
docker model run hf.co/SemplificaAI/EuroLLM-ISO24495-9b-Instruct
| language: | |
| - it | |
| - en | |
| - pt | |
| - es | |
| - fr | |
| - de | |
| license: cc-by-nc-4.0 | |
| license_name: cc-by-nc-4.0 | |
| license_link: https://creativecommons.org/licenses/by-nc/4.0/ | |
| base_model: utter-project/EuroLLM-9B-Instruct-2512 | |
| base_model_relation: finetune | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| tags: | |
| - plain-language | |
| - iso-24495-1 | |
| - compliance | |
| - legal-nlp | |
| - multilingual | |
| - eurollm | |
| - lora | |
| - structured-output | |
| gated: auto | |
| extra_gated_heading: "Access to EuroLLM-ISO24495-9b-Instruct (v0.2)" | |
| extra_gated_description: > | |
| This model is released under CC-BY-NC-4.0 (non-commercial). The form below | |
| helps us understand who is using the model and prioritize improvements for | |
| v1.0. Approval is automatic once the form is submitted. | |
| extra_gated_prompt: > | |
| By submitting this form you confirm that (1) your intended use complies | |
| with the CC-BY-NC-4.0 license terms (non-commercial), and (2) you have | |
| read the Limitations section of the model card. For commercial use, | |
| please contact hf@semplifica.ai. | |
| extra_gated_fields: | |
| Full name: text | |
| Organization or affiliation: text | |
| Country: country | |
| Intended use: | |
| type: text | |
| description: "Briefly describe how you intend to use the model (1-2 sentences)." | |
| Affiliation type: | |
| type: select | |
| options: | |
| - Academic / Research | |
| - Public administration | |
| - Non-profit | |
| - Industry (non-commercial evaluation only) | |
| - Individual / Personal | |
| I agree to non-commercial use only (CC-BY-NC-4.0): | |
| type: checkbox | |
| extra_gated_button_content: "Request access" | |
| model-index: | |
| - name: EuroLLM-ISO24495-9b-Instruct-v0.2 | |
| results: | |
| - task: | |
| type: text-generation | |
| name: ISO 24495-1 Plain Language Compliance Analysis | |
| dataset: | |
| name: semplifica.Language synthetic v3 test set (blind) | |
| type: custom | |
| config: 200_samples_blind | |
| metrics: | |
| - type: mae | |
| value: 2.74 | |
| name: Score MAE (0–100) | |
| verified: false | |
| - type: f1 | |
| value: 0.9577 | |
| name: Verdict F1 (binary) | |
| verified: false | |
| - type: precision | |
| value: 0.9714 | |
| name: Verdict Precision | |
| verified: false | |
| - type: recall | |
| value: 0.9444 | |
| name: Verdict Recall | |
| verified: false | |
| - type: false_positive_rate | |
| value: 0.0156 | |
| name: False Positive Rate | |
| verified: false | |
| - type: f1 | |
| value: 0.3653 | |
| name: Span F1 (IoU ≥ 0.5) | |
| verified: false | |
| - type: rouge | |
| value: 0.2655 | |
| name: Checklist ROUGE-L | |
| verified: false | |
| # EuroLLM-ISO24495-9b-Instruct (v0.2) | |
| A fine-tuned [EuroLLM-9B-Instruct-2512](https://huggingface.co/utter-project/EuroLLM-9B-Instruct-2512) | |
| specialised in **ISO 24495-1 (Plain Language)** compliance analysis of legal, | |
| administrative and technical texts across **six European languages**: | |
| Italian, English, Portuguese, Spanish, French, German. | |
| Given a document, the model emits a structured XML analysis with: a | |
| compliance score (0–100), a binary verdict, a list of violation spans with | |
| character-level offsets and corrective suggestions, and a prioritised | |
| checklist of corrective actions. | |
| > **Version**: `v0.2` — trained on about 28,000 records (v3 dataset, hybrid | |
| > synthetic + human-curated), with verdict balance per language and a 21 % | |
| > anti-forgetting mix (EuroBlocks instruct conversations). | |
| > **Previous**: [`v0.1-base`](https://huggingface.co/SemplificaAI/EuroLLM-ISO24495-9b-Instruct/tree/v0.1) — trained on 10 K records, see git tag. | |
| > **Next**: `v1.0` — adds manually-annotated samples from domain experts | |
| > (in preparation). | |
| ## What changed in v0.2 | |
| Compared to **v0.1-base** (the first public release): | |
| - **2.8× larger training set** (28,410 records vs 10,225): same 9 | |
| document types in 6 EU languages, plus a new `other` catch-all category | |
| for greater stylistic diversity. | |
| - **Per-language verdict balance** of 40–60 % conforme (v0.1 was skewed | |
| to about 30 % conforme): reduces the model's prior bias toward | |
| "non_conforme" verdicts on borderline cases. | |
| - **Anti-forgetting mix**: 21 % of training is general-purpose instruct | |
| conversation (`euroblocks_instruct`) so the model retains broad | |
| instruction-following capability when asked questions outside the ISO | |
| 24495-1 task. | |
| - **Better language coverage**: Italian went from 50 % → 43 %; German | |
| tripled (5 % → 14 %); English nearly doubled (15 % → 26 %). | |
| - **Sentence-aware document chunking**: long documents are split at | |
| sentence boundaries (max 500 words / chunk) with violation spans | |
| re-localized to the new offsets. | |
| - **Conservative training**: 2 epochs (instead of 3), learning rate | |
| 1.5e-4 (instead of 2e-4), warmup 100 steps (instead of 50). All to | |
| reduce overfitting risk on the larger, more diverse corpus. | |
| ### Headline metric improvements (200-sample blind test) | |
| | Metric | v0.1 | **v0.2** | Δ | | |
| |---|---|---|---| | |
| | `score_mae` (lower is better) | 3.86 | **2.74** | **-29 %** | | |
| | `verdict_f1` | 0.9934 | 0.9577 | -3.6 % * | | |
| | `false_positive_rate` (lower is better) | 0.0000 | 0.0156 | +1.6 pp | | |
| | `span_f1` (IoU ≥ 0.5) | 0.3192 | **0.3653** | **+14 %** | | |
| | `checklist_rouge_l` | 0.2375 | **0.2655** | **+12 %** | | |
| \* v0.2 is evaluated on the **blind test set** (more rigorous), v0.1 was | |
| on the validation set. The verdict F1 remains well above the production | |
| threshold (≥ 0.88) on both. | |
| --- | |
| ## Quick start | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig | |
| import torch | |
| REPO = "SemplificaAI/EuroLLM-ISO24495-9b-Instruct" | |
| # Recommended: 8-bit loading → ~9 GB VRAM (instead of ~18 GB in bf16) | |
| bnb = BitsAndBytesConfig(load_in_8bit=True) | |
| tokenizer = AutoTokenizer.from_pretrained(REPO) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| REPO, quantization_config=bnb, device_map="auto", torch_dtype=torch.bfloat16, | |
| ) | |
| model.eval() | |
| SYSTEM = ( | |
| "You are an expert in plain language according to ISO 24495-1:2023. " | |
| "Analyze the provided text and produce: (1) a compliance score 0-100, " | |
| "(2) parts to improve with specific suggestions, " | |
| "(3) an ordered checklist of corrective actions. " | |
| "Reply directly without thinking aloud." | |
| ) | |
| text = """The Parties hereby acknowledge, in light of the foregoing premises | |
| which form an integral and substantive part of this Agreement, that the | |
| Confidential Information shall not include...""" | |
| messages = [ | |
| {"role": "system", "content": SYSTEM}, | |
| {"role": "user", "content": f"Analyze this text for ISO 24495-1 plain language compliance:\n\n<TEXT>\n{text}\n</TEXT>"}, | |
| ] | |
| prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) | |
| inputs = tokenizer(prompt, return_tensors="pt").to(model.device) | |
| with torch.no_grad(): | |
| out = model.generate(**inputs, max_new_tokens=3072, do_sample=False, | |
| pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id) | |
| print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)) | |
| ``` | |
| System prompts in the other five languages: see [§ Multilingual prompts](#multilingual-prompts). | |
| --- | |
| ## Output format | |
| The model emits a single XML block with four fields: | |
| ```xml | |
| <ANALYSIS> | |
| <SCORE>42</SCORE> | |
| <VERDICT>non_conforme</VERDICT> | |
| <SPANS> | |
| [ | |
| { | |
| "text_fragment": "The Parties hereby acknowledge, in light of the foregoing premises...", | |
| "violation_type": "legalese_overload", | |
| "suggestion": "Both parties agree, based on the above context, that...", | |
| "start_char": 0, | |
| "end_char": 78, | |
| "severity": "high" | |
| } | |
| ] | |
| </SPANS> | |
| <CHECKLIST> | |
| 1. Replace archaic legal formulas with direct expressions. | |
| 2. Break long sentences into shorter periods. | |
| 3. Define technical terms on first use. | |
| </CHECKLIST> | |
| </ANALYSIS> | |
| ``` | |
| ### Fields | |
| | Field | Value | Notes | | |
| |---|---|---| | |
| | `<SCORE>` | integer `0`–`100` | 100 = fully compliant | | |
| | `<VERDICT>` | `conforme` \| `non_conforme` | internal threshold around 60 | | |
| | `<SPANS>` | inline JSON array | violations with char-level spans | | |
| | `<CHECKLIST>` | numbered list | corrective actions in priority order | | |
| ### `violation_type` vocabulary (10 ISO-aligned categories) | |
| `sentence_too_long`, `passive_voice_overuse`, `undefined_jargon`, | |
| `buried_action`, `nominalization`, `double_negative`, `ambiguous_reference`, | |
| `missing_structure`, `inconsistent_terminology`, `legalese_overload`. | |
| ### `severity` | |
| `low` | `medium` | `high` | |
| ### Reference parser | |
| A tolerant Python parser (handles truncated output and non-standard JSON | |
| escapes) is available in the companion training-scripts repository, in | |
| `scripts/shared/text_utils.py`. | |
| --- | |
| ## Examples | |
| Two real runs of the model on documents from different domains and | |
| languages, processed end-to-end with greedy decoding (`do_sample=False`, | |
| `max_new_tokens=3072`). | |
| ### Example 1 — Italian NDA (legal) | |
| **Input** (excerpt from a pseudonymised non-disclosure agreement, about 1,500 words): | |
| > *Su richiesta dell'altra Parte, ovvero alla conclusione o all'interruzione, | |
| > per qualsiasi motivo, senza alcun pregiudizio per quanto riguarda gli altri | |
| > impegni di cui al presente Accordo, la Parte ricevente si obbliga a | |
| > riconsegnare entro 30 giorni all'altra, ovvero, a scelta di quest'ultima, | |
| > a distruggere e attestare per iscritto la distruzione, ogni copia di tutti | |
| > i documenti, o altro materiale in qualsiasi forma in possesso della Parte | |
| > stessa, delle Persone Collegate o di Terzi, che contengano o che si | |
| > riferiscano alle "Informazioni riservate"...* | |
| **Selected output fields** (full output has 8 spans + 5 checklist items): | |
| ``` | |
| SCORE: 15 / 100 | |
| VERDICT: non_conforme | |
| ``` | |
| | # | violation_type | severity | Suggestion | | |
| |---|---|---|---| | |
| | 1 | `legalese_overload` | high | *"Se richiesto dall'altra Parte, alla fine o all'interruzione dell'Accordo, la Parte Ricevente deve riconsegnare o distruggere tutte le copie..."* | | |
| | 2 | `legalese_overload` | medium | *"Le Parti dichiarano di acconsentire al trattamento dei loro dati personali, in conformità al Regolamento UE 2016/679."* | | |
| | 3 | `legalese_overload` | medium | *"Le Parti stipulano quanto segue"* (replacing "TUTTO CIÒ PREMESSO / SI STIPULA E CONVIENE QUANTO SEGUE") | | |
| **Checklist excerpt**: | |
| > 1. Semplificare il linguaggio giuridico per renderlo più accessibile. | |
| > 2. Eliminare le formule rituali e le espressioni arcaiche. | |
| > 3. Riformulare le frasi lunghe e complesse in periodi più brevi e chiari. | |
| **Wall-clock**: 42 s on a single RTX 4090 (8-bit loading, 1,682 tokens generated). | |
| ### Example 2 — English technical safety manual | |
| **Input** (excerpt from a HVAC equipment safety manual, about 2,300 words, OCR-cleaned and brand-anonymised): | |
| > *PROHIBITION. It is forbidden to use the machine without the safety devices: | |
| > not working, installed incorrectly. Operating the machine without the safety | |
| > devices creates potential hazards for the operator. For correct and | |
| > long-lasting operation of the machine, carry out the scheduled maintenance | |
| > work as specified by the manufacturer...* | |
| **Selected output fields** (full output has 8 spans + 5 checklist items): | |
| ``` | |
| SCORE: 15 / 100 | |
| VERDICT: non_conforme | |
| ``` | |
| | # | violation_type | severity | Suggestion | | |
| |---|---|---|---| | |
| | 1 | `missing_structure` | high | Add a section title (e.g., *'Prohibited Modifications'*) and use bullet points for the consequences. | | |
| | 2 | `missing_structure` | high | Add a section title (e.g., *'Safety Device Requirements'*) and list the consequences of non-compliance. | | |
| | 5 | `inconsistent_terminology` | medium | Use *'explosion risk areas'* consistently instead of *'areas classified as at risk of explosion'*. | | |
| | 6 | `inconsistent_terminology` | medium | Use *'fixed guards'* consistently instead of *'fixed guards protecting the moving parts'*. | | |
| **Checklist excerpt**: | |
| > 1. Organize the manual into logical sections with clear, bold headings. | |
| > 2. Use bulleted lists to present rules, prohibitions, and safety requirements. | |
| > 3. Standardize terminology for the machine, fluids, and safety devices throughout the text. | |
| > 4. Add a table of contents to help readers navigate the document. | |
| Both documents score 15/100 in different ways: the NDA is flagged for | |
| *legalese overload*, the safety manual for *missing structure* and | |
| *inconsistent terminology*. The model correctly diagnoses different failure | |
| modes for different document types. | |
| --- | |
| ## Evaluation | |
| Evaluated on **200 blind samples** drawn from the v3 held-out test split, | |
| stratified by `(language × doc_type × difficulty × verdict)`, never seen | |
| during training or validation. | |
| ### Metrics | |
| | Metric | Prod threshold | Acceptable threshold | **v0.2 result** | Status | | |
| |---|---|---|---|---| | |
| | `score_mae` (mean absolute error on 0–100 score) | ≤ 8.0 | ≤ 12.0 | **2.74** | ✅ **PROD** | | |
| | `verdict_f1` (binary F1 conforme / non_conforme) | ≥ 0.88 | ≥ 0.80 | **0.9577** | ✅ **PROD** | | |
| | `verdict_precision` | — | — | **0.9714** | (high) | | |
| | `verdict_recall` | — | — | **0.9444** | (high) | | |
| | `false_positive_rate` (on `conforme` class) | ≤ 0.08 | ≤ 0.15 | **0.0156** | ✅ **PROD** | | |
| | `span_f1` (IoU char-level ≥ 0.5) | ≥ 0.72 | ≥ 0.62 | 0.3653 | ⚠️ below accept | | |
| | `checklist_rouge_l` | ≥ 0.55 | ≥ 0.45 | 0.2655 | ⚠️ below accept | | |
| ### Interpretation | |
| **Strengths** | |
| - **Excellent score calibration**: MAE 2.74 on a 0–100 scale, far below | |
| the production threshold (≤ 8). The model's quantitative agreement | |
| with the ground truth is very tight. | |
| - **Strong binary classification**: verdict F1 0.96 with high precision | |
| (0.97) and recall (0.94). Very few false positives on compliant texts | |
| (1.6 %). | |
| - **Robust XML schema** adherence: canonical tags, canonical violation | |
| vocabulary, coherent character-level offsets across all six languages. | |
| **Measured weaknesses** (improving from v0.1, still below acceptable) | |
| - **Span F1 0.37**: the model identifies fewer spans than the ground | |
| truth on dense documents, or with offset drifts that fail the | |
| IoU ≥ 0.5 threshold. Improvement target for v1.0. | |
| - **Checklist ROUGE-L 0.27**: corrective items are semantically | |
| plausible but lexically divergent from the ground truth (ROUGE | |
| penalises paraphrasing). A semantic metric (BERTScore) would likely | |
| reward these outputs more accurately. | |
| ### Test set composition | |
| - **Languages**: IT 50 %, EN 15 %, PT 12 %, ES 10 %, FR 8 %, DE 5 % | |
| (natural distribution preserved in val/test, balanced in train) | |
| - **Document types (10)**: 9 administrative/legal categories plus a | |
| catch-all `other` category for stylistic diversity | |
| - **Difficulty buckets**: easy / medium / hard / very_hard | |
| The aggregate metrics are **averaged across all six languages**. A | |
| per-language breakdown will be released with v1.0. | |
| --- | |
| ## Intended use | |
| **Recommended use cases** | |
| - Automated triage of contractual, regulatory and administrative documents | |
| to flag problematic clauses from a plain-language perspective. | |
| - Decision-support tool for editors, compliance officers, in-house legal | |
| teams. | |
| - First-draft generation of accessible rewrites for portions of a document. | |
| - Teaching and research on ISO 24495-1 and plain language across | |
| multilingual corpora. | |
| **Out-of-scope use cases** | |
| - **Fully automated decisions without human review.** Output must always be | |
| validated by an expert, especially for legally consequential implications. | |
| - **Domains outside training scope**: clinical/medical text, purely academic | |
| scientific writing, creative literature. The model is optimised on the | |
| nine administrative/legal document types of the training set. | |
| - **Languages other than the six supported.** Performance outside the EU | |
| language set is not guaranteed. | |
| - **Legal or compliance advice substitute.** The model identifies | |
| *readability* issues, not legal correctness or compliance with other | |
| regulations. | |
| --- | |
| ## Limitations | |
| - **Hybrid training set** (about 23,000 task records): first about | |
| 9,000 records are fully synthetic (`gemini-2.5-flash` + | |
| `gemini-3.5-flash` recovery), remaining records are built on top of | |
| human-curated source documents with partial assisted re-annotation by | |
| `gemini-3.5-flash` under human review. Generator-side biases have not | |
| been formally measured. | |
| - **Not validated on standard public benchmarks.** The reported metrics | |
| come from an internal blind test set (200 samples) drawn from the same | |
| distribution as the training set. External validation is planned for v1.0. | |
| - **Per-language variability.** The training task data is more balanced | |
| across languages than v0.1, but Italian is still the largest single | |
| language (about 43 % of the task split). Expect slightly better | |
| calibration on Italian than on German (14 %). | |
| - **Long outputs may be truncated.** On documents with many violations the | |
| generation can exceed 2,048 tokens; we recommend `max_new_tokens=3072+` | |
| combined with a parser tolerant of unclosed XML tags. | |
| - **Sub-optimal span detection** (see § Evaluation). On dense documents the | |
| model tends to be conservative in the number of spans reported. | |
| - **No support for documents longer than about 30,000 characters** | |
| (training-time sequence-length limit = 3,072 tokens). For very long | |
| documents, pre-chunk at sentence boundaries (≤ 500 words per chunk). | |
| --- | |
| ## Training details | |
| | | | | |
| |---|---| | |
| | **Base model** | [utter-project/EuroLLM-9B-Instruct-2512](https://huggingface.co/utter-project/EuroLLM-9B-Instruct-2512) (Apache 2.0) | | |
| | **Architecture** | Llama-style decoder, 9B parameters, native ChatML chat template | | |
| | **Fine-tuning method** | LoRA in bf16 on top of an int8-quantised base (bitsandbytes) | | |
| | **LoRA rank / alpha** | 64 / 128 | | |
| | **LoRA target modules** | `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` | | |
| | **Trainable parameters** | 203,685,888 (2.18 % of total) | | |
| | **Framework** | [Axolotl](https://github.com/axolotl-ai-cloud/axolotl) 0.16.1; Liger kernel (fused linear + cross-entropy) | | |
| | **Sample packing** | yes | | |
| | **Sequence length** | 3,072 tokens | | |
| | **Epochs** | 2 (vs 3 for v0.1) | | |
| | **Optimizer steps** | 896 total (448 per epoch) | | |
| | **Batch** | 1 micro × 16 gradient accumulation = 16 effective | | |
| | **Optimizer** | Paged AdamW 8-bit (bitsandbytes) | | |
| | **Learning rate** | 1.5e-4, cosine schedule, 100-step warmup (vs 2e-4 / 50-step for v0.1) | | |
| | **Loss masking** | assistant tokens only (`roles_to_train: ["assistant"]`) | | |
| | **Hardware** | 1× NVIDIA RTX 4090 (24 GB) + 128 GB system RAM | | |
| | **Training time** | 7h 14m wall clock | | |
| | **Final loss** | 0.30 (from 0.65 at step 5, −54 %) | | |
| | **Peak VRAM** | ~21 GB / 24 GB | | |
| The model published in this repository is the **final merge** of the LoRA | |
| adapter into the base model, saved as a single `model.safetensors` file in | |
| **bf16** (about 18 GB). For 8-bit inference, load with | |
| `BitsAndBytesConfig(load_in_8bit=True)` as shown in the Quick start. | |
| The bf16 merge is the "neutral ground": it can be re-quantised post-hoc to | |
| any target format (int8, NF4, GGUF Q4_K_M). | |
| --- | |
| ## Dataset | |
| The model was trained on **`semplifica.Language v3`**, an internal | |
| **hybrid (synthetic + human-curated)** dataset of **28,410 records** | |
| (23,589 train / 2,194 validation / 2,627 test) covering six European | |
| languages, with the following structure: | |
| ### Composition | |
| - **Train mix**: | |
| - `task_iso24495`: 18,589 records (79 %) — the primary compliance task. | |
| - `euroblocks_instruct`: 5,000 records (21 %) — anti-forgetting, | |
| general-purpose instruct conversations to retain broad capability. | |
| - **Origin of the task records** (the `task_iso24495` portion): | |
| - First **about 9,000 records**: **fully synthetic**, generated with | |
| `gemini-2.5-flash` (with `gemini-3.5-flash` recovery passes on | |
| blocking defects). | |
| - Remaining **about 9,500+ records**: built on top of **human-curated | |
| source documents** from selected public/proprietary datasets | |
| (`text_complexity_de`, `german4all`, `plaba`, `med_easi`, | |
| `porsimples_sent`, `admin_it`, `simpitiki`), cleaned and normalised, | |
| then **partially re-annotated with assistance from `gemini-3.5-flash`** | |
| under human review. This phase brought real-world stylistic variety, | |
| edge-case clauses, and harder negative examples that pure synthetic | |
| generation underproduced. | |
| - **Format**: ChatML triples `(system, user, assistant)` with structured | |
| XML output (matching the schema documented in § Output format). | |
| - **Languages** (task split): IT 43 %, EN 26 %, FR 17 %, PT 16 %, DE 14 %, | |
| ES 11 %. | |
| - **Document types (10)**: service contracts, privacy notices, general | |
| terms & conditions, business letters, internal regulations, tender | |
| notices, insurance policies, consent forms, administrative | |
| communications, plus an `other` catch-all for stylistic diversity. | |
| - **Difficulty buckets**: easy / medium / hard / very_hard, with target | |
| word counts and violation density scaled accordingly. | |
| - **Splits**: stratified by `(lang × doc_type × difficulty × verdict)` to | |
| keep distribution consistent across train / val / test. Val/test | |
| preserve natural distribution; train is balanced for verdict | |
| (40–60 % conforme per language). | |
| ### Generation and curation pipeline | |
| - **Synthetic generation** (first about 9,000 records): | |
| initial bulk generation with `gemini-2.5-flash`, recovery pass with | |
| `gemini-3.5-flash` for blocking defects. | |
| - **Human-curated phase** (later records): | |
| source documents from the datasets listed above, cleaned and | |
| normalised, then passed through `gemini-3.5-flash` for assisted | |
| re-annotation, with human review on the violation labels and span | |
| boundaries. | |
| - **Sentence-aware chunking** for long documents (max 500 words per | |
| chunk, abbreviation-aware for IT/EN/FR/DE/ES/PT). | |
| - **Algorithmic defect scan and repair** across the whole corpus: | |
| case-insensitive matching, whitespace normalisation, span | |
| re-localization. | |
| - **Verdict balancing** via positive sample generation (mix 70 % Gemini | |
| 2.5 Flash + 30 % Gemini 3.5 Flash) on the human-curated baselines. | |
| Each record carries provenance metadata: `id`, `lang`, `doc_type`, | |
| `difficulty`, `score`, `verdict`, `source`. | |
| ### Distribution | |
| The dataset is **not currently published**. The decision on public release | |
| is being evaluated jointly with the v1.0 model release. For collaboration | |
| or research access requests please use the contact channel below. | |
| --- | |
| ## Roadmap | |
| | Version | Status | Training set | Notes | | |
| |---|---|---|---| | |
| | **v0.1-base** | ✅ released | ~10 K synthetic records | LoRA bf16 + 8-bit base, 3 epochs | | |
| | **v0.2** (this) | ✅ released | ~28 K synthetic records | + verdict balance, + anti-forgetting mix, + sentence-aware chunking, 2 epochs | | |
| | **v1.0** | 🔄 in preparation | ~28 K synthetic + manually-annotated samples | Domain-expert annotations to capture edge cases (contextual ambiguity, niche jargon, severity nuances) | | |
| | **v1.1 / v2** | 🔜 backlog | DPO post-v1.0 | human-feedback alignment on rewrite preferences | | |
| We are building v1.0 by adding **manually-annotated samples** from | |
| domain experts (plain-language editors, legal reviewers, compliance | |
| officers) to the synthetic pipeline. The synthetic data has reached | |
| diminishing returns on the structural quality dimension; manual annotation | |
| is what's needed to close the gap on `span_f1` and `checklist_rouge_l`. | |
| A 1.7 B edge-distilled sub-release (`v1.0-mini`) for CPU / laptop | |
| deployment is also planned. | |
| --- | |
| ## Multilingual prompts | |
| The model accepts system prompts in all six target languages. Examples | |
| optimised to match the training distribution: | |
| ```python | |
| SYSTEM_PROMPTS = { | |
| "it": "Sei un esperto di plain language secondo ISO 24495-1:2023. ...", | |
| "en": "You are an expert in plain language according to ISO 24495-1:2023. ...", | |
| "fr": "Vous êtes expert en langage clair selon ISO 24495-1:2023. ...", | |
| "de": "Sie sind Experte für Verständlichkeit gemäß ISO 24495-1:2023. ...", | |
| "es": "Eres experto en lenguaje claro según ISO 24495-1:2023. ...", | |
| "pt": "Você é especialista em linguagem simples segundo a ISO 24495-1:2023. ...", | |
| } | |
| ``` | |
| The full set of prompts is available in `iso_principles.py` in the | |
| companion training-scripts repository. | |
| --- | |
| ## License | |
| The **fine-tuned model** (this repository) is released under the | |
| **Creative Commons Attribution-NonCommercial 4.0 International ([CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/))** | |
| license. | |
| > Non-commercial use is freely permitted (research, academia, internal | |
| > evaluation). For commercial use, please contact the authors (see § Contact). | |
| The **base model** | |
| ([utter-project/EuroLLM-9B-Instruct-2512](https://huggingface.co/utter-project/EuroLLM-9B-Instruct-2512)) | |
| is released under the **Apache License 2.0** (© 2024 UTTER project). The | |
| distribution of this derivative work **incorporates and attributes** the | |
| base model as required by Apache 2.0. See [`ATTRIBUTION.md`](ATTRIBUTION.md) | |
| for full details. | |
| --- | |
| ## Citation | |
| If you use this model in academic publications or research materials, | |
| please cite as: | |
| ```bibtex | |
| @misc{semplifica_iso24495_9b_v02_2026, | |
| title = {EuroLLM-ISO24495-9b-Instruct (v0.2): A Fine-Tuned EuroLLM-9B | |
| for ISO 24495-1 Plain Language Compliance Analysis in Six EU Languages}, | |
| author = {SemplificaAI}, | |
| year = {2026}, | |
| url = {https://huggingface.co/SemplificaAI/EuroLLM-ISO24495-9b-Instruct}, | |
| note = {v0.2}, | |
| } | |
| ``` | |
| Please also cite the **base model**: | |
| ```bibtex | |
| @misc{eurollm9b_2024, | |
| title = {EuroLLM-9B: Open-Weight European LLM}, | |
| author = {UTTER project}, | |
| year = {2024}, | |
| url = {https://huggingface.co/utter-project/EuroLLM-9B-Instruct-2512}, | |
| } | |
| ``` | |
| --- | |
| ## Contact | |
| - **Commercial use or access to v1.0**: [hf@semplifica.ai](mailto:hf@semplifica.ai) | |
| - **Issues, bugs, qualitative feedback**: use the *Community* tab of this HF repository. | |
| - **Academic collaboration**: contact the authors for joint dataset / | |
| benchmark initiatives. | |