Text Generation
Transformers
Safetensors
llama
plain-language
iso-24495-1
compliance
legal-nlp
multilingual
eurollm
lora
structured-output
edge
conversational
Eval Results (legacy)
text-generation-inference
Instructions to use SemplificaAI/EuroLLM-ISO24495-1.7b-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use SemplificaAI/EuroLLM-ISO24495-1.7b-Instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="SemplificaAI/EuroLLM-ISO24495-1.7b-Instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("SemplificaAI/EuroLLM-ISO24495-1.7b-Instruct") model = AutoModelForCausalLM.from_pretrained("SemplificaAI/EuroLLM-ISO24495-1.7b-Instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use SemplificaAI/EuroLLM-ISO24495-1.7b-Instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "SemplificaAI/EuroLLM-ISO24495-1.7b-Instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SemplificaAI/EuroLLM-ISO24495-1.7b-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/SemplificaAI/EuroLLM-ISO24495-1.7b-Instruct
- SGLang
How to use SemplificaAI/EuroLLM-ISO24495-1.7b-Instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "SemplificaAI/EuroLLM-ISO24495-1.7b-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SemplificaAI/EuroLLM-ISO24495-1.7b-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "SemplificaAI/EuroLLM-ISO24495-1.7b-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SemplificaAI/EuroLLM-ISO24495-1.7b-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use SemplificaAI/EuroLLM-ISO24495-1.7b-Instruct with Docker Model Runner:
docker model run hf.co/SemplificaAI/EuroLLM-ISO24495-1.7b-Instruct
| language: | |
| - it | |
| - en | |
| - pt | |
| - es | |
| - fr | |
| - de | |
| license: cc-by-nc-4.0 | |
| license_name: cc-by-nc-4.0 | |
| license_link: https://creativecommons.org/licenses/by-nc/4.0/ | |
| base_model: utter-project/EuroLLM-1.7B-Instruct | |
| base_model_relation: finetune | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| tags: | |
| - plain-language | |
| - iso-24495-1 | |
| - compliance | |
| - legal-nlp | |
| - multilingual | |
| - eurollm | |
| - lora | |
| - structured-output | |
| - edge | |
| gated: auto | |
| extra_gated_heading: "Access to EuroLLM-ISO24495-1.7b-Instruct (v0.2)" | |
| extra_gated_description: > | |
| This model is released under CC-BY-NC-4.0 (non-commercial). The form below | |
| helps us understand who is using the model and prioritize improvements. | |
| Approval is automatic once the form is submitted. | |
| extra_gated_prompt: > | |
| By submitting this form you confirm that (1) your intended use complies | |
| with the CC-BY-NC-4.0 license terms (non-commercial), and (2) you have | |
| read the Limitations section of the model card. For commercial use, | |
| please contact hf@semplifica.ai. | |
| extra_gated_fields: | |
| Full name: text | |
| Organization or affiliation: text | |
| Country: country | |
| Intended use: | |
| type: text | |
| description: "Briefly describe how you intend to use the model (1-2 sentences)." | |
| Affiliation type: | |
| type: select | |
| options: | |
| - Academic / Research | |
| - Public administration | |
| - Non-profit | |
| - Industry (non-commercial evaluation only) | |
| - Individual / Personal | |
| I agree to non-commercial use only (CC-BY-NC-4.0): | |
| type: checkbox | |
| extra_gated_button_content: "Request access" | |
| model-index: | |
| - name: EuroLLM-ISO24495-1.7b-Instruct-v0.2 | |
| results: | |
| - task: | |
| type: text-generation | |
| name: ISO 24495-1 Plain Language Compliance Analysis | |
| dataset: | |
| name: semplifica.Language synthetic v3 test set (blind) | |
| type: custom | |
| config: 200_samples_blind | |
| metrics: | |
| - type: mae | |
| value: 3.66 | |
| name: Score MAE (0–100) | |
| verified: false | |
| - type: f1 | |
| value: 0.9396 | |
| name: Verdict F1 (binary) | |
| verified: false | |
| - type: precision | |
| value: 0.9091 | |
| name: Verdict Precision | |
| verified: false | |
| - type: recall | |
| value: 0.9722 | |
| name: Verdict Recall | |
| verified: false | |
| - type: false_positive_rate | |
| value: 0.0547 | |
| name: False Positive Rate | |
| verified: false | |
| - type: f1 | |
| value: 0.2721 | |
| name: Span F1 (IoU ≥ 0.5) | |
| verified: false | |
| - type: rouge | |
| value: 0.2235 | |
| name: Checklist ROUGE-L | |
| verified: false | |
| # EuroLLM-ISO24495-1.7b-Instruct (v0.2) | |
| A fine-tuned [EuroLLM-1.7B-Instruct](https://huggingface.co/utter-project/EuroLLM-1.7B-Instruct) | |
| specialised in **ISO 24495-1 (Plain Language)** compliance analysis of legal, | |
| administrative and technical texts across **six European languages**: | |
| Italian, English, Portuguese, Spanish, French, German. | |
| Given a document, the model emits a structured XML analysis with: a | |
| compliance score (0–100), a binary verdict, a list of violation spans with | |
| character-level offsets and corrective suggestions, and a prioritised | |
| checklist of corrective actions. | |
| > **Version**: `v0.2` — trained on about 23,000 task records (v3 dataset, | |
| > hybrid synthetic + human-curated), with verdict balance per language and a | |
| > 21 % anti-forgetting mix (EuroBlocks instruct conversations). | |
| > **Target deployment**: edge / consumer GPU. The model runs in about | |
| > **3.4 GB VRAM in bf16** and about **1.7 GB in 8-bit** on a single | |
| > consumer card. | |
| ## Edge profile | |
| | Aspect | Value | | |
| |---|---| | |
| | Parameters | 1.7 B | | |
| | Architecture | Llama-style decoder, native ChatML chat template | | |
| | VRAM (bf16 inference) | ~3.4 GB | | |
| | VRAM (8-bit inference) | ~1.7 GB | | |
| | Context window | 4,096 tokens | | |
| | Languages | IT, EN, FR, DE, ES, PT | | |
| | Target hardware | single consumer GPU (RTX 3060 12 GB and up), or CPU/laptop in 8-bit | | |
| --- | |
| ## Quick start | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig | |
| import torch | |
| REPO = "SemplificaAI/EuroLLM-ISO24495-1.7b-Instruct" | |
| # 8-bit loading → ~1.7 GB VRAM. For bf16 (~3.4 GB) drop the quantization_config. | |
| bnb = BitsAndBytesConfig(load_in_8bit=True) | |
| tokenizer = AutoTokenizer.from_pretrained(REPO) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| REPO, quantization_config=bnb, device_map="auto", torch_dtype=torch.bfloat16, | |
| ) | |
| model.eval() | |
| SYSTEM = ( | |
| "You are an expert in plain language according to ISO 24495-1:2023. " | |
| "Analyze the provided text and produce: (1) a compliance score 0-100, " | |
| "(2) parts to improve with specific suggestions, " | |
| "(3) an ordered checklist of corrective actions. " | |
| "Reply directly without thinking aloud." | |
| ) | |
| text = """The Parties hereby acknowledge, in light of the foregoing premises | |
| which form an integral and substantive part of this Agreement, that the | |
| Confidential Information shall not include...""" | |
| messages = [ | |
| {"role": "system", "content": SYSTEM}, | |
| {"role": "user", "content": f"Analyze this text for ISO 24495-1 plain language compliance:\n\n<TEXT>\n{text}\n</TEXT>"}, | |
| ] | |
| prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) | |
| inputs = tokenizer(prompt, return_tensors="pt").to(model.device) | |
| with torch.no_grad(): | |
| out = model.generate(**inputs, max_new_tokens=3072, do_sample=False, | |
| pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id) | |
| print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)) | |
| ``` | |
| System prompts in the other five languages: see [§ Multilingual prompts](#multilingual-prompts). | |
| --- | |
| ## Output format | |
| The model emits a single XML block with four fields: | |
| ```xml | |
| <ANALYSIS> | |
| <SCORE>42</SCORE> | |
| <VERDICT>non_conforme</VERDICT> | |
| <SPANS> | |
| [ | |
| { | |
| "text_fragment": "The Parties hereby acknowledge, in light of the foregoing premises...", | |
| "violation_type": "legalese_overload", | |
| "suggestion": "Both parties agree, based on the above context, that...", | |
| "start_char": 0, | |
| "end_char": 78, | |
| "severity": "high" | |
| } | |
| ] | |
| </SPANS> | |
| <CHECKLIST> | |
| 1. Replace archaic legal formulas with direct expressions. | |
| 2. Break long sentences into shorter periods. | |
| 3. Define technical terms on first use. | |
| </CHECKLIST> | |
| </ANALYSIS> | |
| ``` | |
| ### Fields | |
| | Field | Value | Notes | | |
| |---|---|---| | |
| | `<SCORE>` | integer `0`–`100` | 100 = fully compliant | | |
| | `<VERDICT>` | `conforme` \| `non_conforme` | internal threshold around 60 | | |
| | `<SPANS>` | inline JSON array | violations with char-level spans | | |
| | `<CHECKLIST>` | numbered list | corrective actions in priority order | | |
| ### `violation_type` vocabulary (10 ISO-aligned categories) | |
| `sentence_too_long`, `passive_voice_overuse`, `undefined_jargon`, | |
| `buried_action`, `nominalization`, `double_negative`, `ambiguous_reference`, | |
| `missing_structure`, `inconsistent_terminology`, `legalese_overload`. | |
| ### `severity` | |
| `low` | `medium` | `high` | |
| ### Reference parser | |
| A tolerant Python parser (handles truncated output and non-standard JSON | |
| escapes) is shipped together with the model code (see `text_utils.py` in | |
| the training-scripts repository). | |
| --- | |
| ## Examples | |
| Qualitative examples (a real Italian NDA and an English safety manual) will | |
| be added in a follow-up commit, with the same per-document structure used | |
| for the 9B model card. | |
| --- | |
| ## Evaluation | |
| Evaluated on **200 blind samples** drawn from the v3 held-out test split, | |
| stratified by `(language × doc_type × difficulty × verdict)`, never seen | |
| during training or validation. | |
| ### Metrics | |
| | Metric | Prod threshold | Acceptable threshold | **v0.2 result** | Status | | |
| |---|---|---|---|---| | |
| | `score_mae` (mean absolute error on 0–100 score) | ≤ 8.0 | ≤ 12.0 | **3.66** | ✅ **PROD** | | |
| | `verdict_f1` (binary F1 conforme / non_conforme) | ≥ 0.88 | ≥ 0.80 | **0.9396** | ✅ **PROD** | | |
| | `verdict_precision` | — | — | **0.9091** | (high) | | |
| | `verdict_recall` | — | — | **0.9722** | (very high) | | |
| | `false_positive_rate` (on `conforme` class) | ≤ 0.08 | ≤ 0.15 | **0.0547** | ✅ **PROD** | | |
| | `span_f1` (IoU char-level ≥ 0.5) | ≥ 0.72 | ≥ 0.62 | 0.2721 | ⚠️ below accept | | |
| | `checklist_rouge_l` | ≥ 0.55 | ≥ 0.45 | 0.2235 | ⚠️ below accept | | |
| ### Interpretation | |
| **Strengths** | |
| - **High verdict recall (0.97)**: the model is conservative on the | |
| non-compliant class and rarely misses a problematic document — useful | |
| for triage workflows where false negatives are more costly than false | |
| positives. | |
| - **Production-grade score calibration**: MAE 3.66 on a 0–100 scale, well | |
| below the production threshold of 8. Quantitative agreement with the | |
| ground truth is tight despite the small footprint. | |
| - **Binary verdict F1 above production threshold**: 0.94 vs the 0.88 | |
| threshold; precision 0.91, recall 0.97 — favouring recall is intentional | |
| in a triage context. | |
| - **False positive rate (5.5 %) under the production cap of 8 %**: the | |
| model does not over-flag compliant texts. | |
| - **Robust XML schema** adherence: canonical tags, canonical violation | |
| vocabulary, coherent character-level offsets across all six languages. | |
| **Measured weaknesses** | |
| - **Span F1 0.27** (below the 0.62 acceptable threshold): on documents | |
| with many violations the model reports fewer spans than the ground | |
| truth, or with offset drift that fails the IoU ≥ 0.5 match. The reduced | |
| parameter count limits memorisation of precise span boundaries — this | |
| is the trade-off accepted in exchange for edge deployability. | |
| - **Checklist ROUGE-L 0.22** (below the 0.45 acceptable threshold): | |
| corrective items are semantically plausible but lexically divergent | |
| from the ground truth (ROUGE penalises paraphrasing). A semantic | |
| metric such as BERTScore would likely reward these outputs more | |
| fairly. | |
| - **Verdict precision (0.91) lower than recall (0.97)**: about 1 in 11 | |
| flagged documents is a false positive. Acceptable for screening, but | |
| if you need high-precision flagging consider a higher-capacity model. | |
| ### Test set composition | |
| - **Languages**: IT 50 %, EN 15 %, PT 12 %, ES 10 %, FR 8 %, DE 5 % | |
| (natural distribution preserved in val/test, balanced in train) | |
| - **Document types (10)**: 9 administrative/legal categories plus a | |
| catch-all `other` category for stylistic diversity | |
| - **Difficulty buckets**: easy / medium / hard / very_hard | |
| The aggregate metrics are **averaged across all six languages**. | |
| --- | |
| ## Intended use | |
| **Recommended use cases** | |
| - **Edge / on-device** automated triage of contractual, regulatory and | |
| administrative documents to flag problematic clauses from a | |
| plain-language perspective. | |
| - Decision-support tool for editors, compliance officers, in-house legal | |
| teams on hardware with limited VRAM. | |
| - First-draft generation of accessible rewrites for portions of a document. | |
| - Teaching and research on ISO 24495-1 and plain language across | |
| multilingual corpora. | |
| **Out-of-scope use cases** | |
| - **Fully automated decisions without human review.** Output must always be | |
| validated by an expert, especially for legally consequential implications. | |
| - **Domains outside training scope**: clinical/medical text, purely academic | |
| scientific writing, creative literature. The model is optimised on | |
| administrative/legal document types of the training set. | |
| - **Languages other than the six supported.** Performance outside the EU | |
| language set is not guaranteed. | |
| - **Legal or compliance advice substitute.** The model identifies | |
| *readability* issues, not legal correctness or compliance with other | |
| regulations. | |
| --- | |
| ## Limitations | |
| - **Hybrid training set** (about 23,000 task records): first about | |
| 9,000 records are fully synthetic (`gemini-2.5-flash` + | |
| `gemini-3.5-flash` recovery), remaining records are built on top of | |
| human-curated source documents with partial assisted re-annotation by | |
| `gemini-3.5-flash` under human review. Generator-side biases have not | |
| been formally measured. | |
| - **Not validated on standard public benchmarks.** The reported metrics | |
| come from an internal blind test set (200 samples) drawn from the same | |
| distribution as the training set. | |
| - **Smaller model capacity than 9B-class variants.** Expect lower | |
| `span_f1` and `checklist_rouge_l` than a 9B fine-tuned on the same | |
| dataset — the trade-off here is **edge deployability** (1.7 GB in 8-bit | |
| vs about 9 GB). | |
| - **Per-language variability.** Italian is the largest single language in | |
| training (about 43 % of task split). Expect slightly better calibration | |
| on Italian than on German (14 %). | |
| - **Short context window — hard 4,096-token limit** (vs 32K of larger | |
| EuroLLM variants). See the dedicated section | |
| [§ Working with the 4K context window](#working-with-the-4k-context-window) | |
| for the recommended input-size policy. | |
| - **Long outputs may be truncated.** On documents with many violations the | |
| generation can exceed 2,048 tokens; we recommend `max_new_tokens=3072+` | |
| combined with a parser tolerant of unclosed XML tags. | |
| --- | |
| ## Working with the 4K context window | |
| The base model has a hard `max_position_embeddings = 4,096` (about | |
| **3,000 input words** as an absolute ceiling). During training we used a | |
| sequence length of 4,096 tokens, **including** the assistant XML output | |
| which itself can consume 1,000 to 2,000 tokens for documents with many | |
| violations. | |
| **General rule of thumb** for using this model in production: | |
| > Feed the model **complete sentences and complete sections** of the | |
| > document. Do **not** split mid-sentence. We recommend keeping each | |
| > single request **under 500 words of input text**. | |
| As a reference: | |
| | Input size | Approximate equivalent (A4) | | |
| |---|---| | |
| | **500 words** | ~1 page A4 of dense contract text, or ~1.5 pages of standard administrative prose | | |
| | **1,000 words** | ~2 pages A4 of dense contract text | | |
| | **3,000 words** (≈ hard ceiling) | ~4–5 pages A4 — leaves little room for the output | | |
| If your document exceeds the recommended limit of about 500 words per request: | |
| 1. **Pre-chunk at sentence boundaries** (never split a sentence). Aim | |
| for chunks of 300–500 words each, abbreviation-aware (`art.`, `n.`, | |
| `Sig.`, etc.) for the six supported languages. | |
| 2. **Preserve natural document structure** as chunk boundaries when | |
| possible: article, section, clause. This keeps each request | |
| semantically coherent and produces better-scoped span offsets. | |
| 3. **Run the model once per chunk** and concatenate the resulting | |
| `<SPANS>` arrays. The `start_char` / `end_char` offsets are | |
| chunk-local — remap them to the original document by adding the chunk | |
| offset. | |
| 4. **Do not deduplicate spans across chunks**: if the same violation | |
| appears in two adjacent chunks, both are valid local findings. | |
| Going above the 500-word recommendation generally still works (up to | |
| about 1,000 to 1,500 words), but you trade off: | |
| - Span offset precision drops (the model has fewer training samples in | |
| that input-size bucket). | |
| - Recall on violations late in the document drops (attention spreads | |
| thin). | |
| - Risk of output truncation grows (long input + long output approaches | |
| the 4 K ceiling). | |
| For corpora where most documents are longer than about 2,000 words, | |
| consider the 9B variant (`SemplificaAI/EuroLLM-ISO24495-9b-Instruct`, | |
| 32 K context) instead. | |
| --- | |
| ## Training details | |
| | | | | |
| |---|---| | |
| | **Base model** | [utter-project/EuroLLM-1.7B-Instruct](https://huggingface.co/utter-project/EuroLLM-1.7B-Instruct) (Apache 2.0) | | |
| | **Architecture** | Llama-style decoder, 1.7B parameters, native ChatML chat template | | |
| | **Fine-tuning method** | LoRA in bf16 on a bf16 (non-quantised) base | | |
| | **LoRA rank / alpha** | 32 / 64 | | |
| | **LoRA target modules** | `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` | | |
| | **Trainable parameters** | 28,704,768 (1.73 % of total) | | |
| | **Framework** | [Axolotl](https://github.com/axolotl-ai-cloud/axolotl) 0.16.1; Liger kernel (fused linear + cross-entropy) | | |
| | **Sample packing** | yes | | |
| | **Sequence length** | 4,096 tokens | | |
| | **Epochs** | 2 | | |
| | **Optimizer steps** | 700 (350 per epoch) | | |
| | **Batch** | 1 micro × 16 gradient accumulation = 16 effective | | |
| | **Optimizer** | Paged AdamW 8-bit (bitsandbytes) | | |
| | **Learning rate** | 2.0e-4, cosine schedule, 100-step warmup | | |
| | **Loss masking** | assistant tokens only (`roles_to_train: ["assistant"]`) | | |
| | **Hardware** | 1× NVIDIA RTX 4090 (24 GB) | | |
| | **Training time** | ~1h 40m wall clock | | |
| | **Final loss** | ~0.42 (from ~0.91 at step 5, −54 %) | | |
| | **Peak VRAM** | ~8.3 GB / 24 GB | | |
| The model published in this repository is the **final merge** of the LoRA | |
| adapter into the base model, saved as a single `model.safetensors` file in | |
| **bf16** (about 3.4 GB). For 8-bit inference, load with | |
| `BitsAndBytesConfig(load_in_8bit=True)` as shown in the Quick start. | |
| --- | |
| ## Dataset | |
| The model was trained on **`semplifica.Language v3`**, an | |
| internal **hybrid (synthetic + human-curated)** dataset of **28,410 records** | |
| (23,589 train / 2,194 validation / 2,627 test) covering six European languages. | |
| ### Composition | |
| - **Train mix**: | |
| - `task_iso24495`: 18,589 records (79 %) — the primary compliance task. | |
| - `euroblocks_instruct`: 5,000 records (21 %) — anti-forgetting, | |
| general-purpose instruct conversations to retain broad capability. | |
| - **Origin of the task records** (the `task_iso24495` portion): | |
| - First **about 9,000 records**: **fully synthetic**, generated with | |
| `gemini-2.5-flash` (with `gemini-3.5-flash` recovery passes on | |
| blocking defects). | |
| - Remaining **about 9,500+ records**: built on top of **human-curated | |
| source documents** from selected public/proprietary datasets, cleaned | |
| and normalised, then **partially re-annotated with assistance from | |
| `gemini-3.5-flash`** under human review. This phase brought | |
| real-world stylistic variety, edge-case clauses, and harder negative | |
| examples that pure synthetic generation underproduced. | |
| - **Format**: ChatML triples `(system, user, assistant)` with structured | |
| XML output (matching the schema documented in § Output format). | |
| - **Languages** (task split): IT 43 %, EN 26 %, FR 17 %, PT 16 %, DE 14 %, | |
| ES 11 %. | |
| - **Document types (10)**: service contracts, privacy notices, general | |
| terms & conditions, business letters, internal regulations, tender | |
| notices, insurance policies, consent forms, administrative | |
| communications, plus an `other` catch-all for stylistic diversity. | |
| - **Difficulty buckets**: easy / medium / hard / very_hard, with target | |
| word counts and violation density scaled accordingly. | |
| - **Splits**: stratified by `(lang × doc_type × difficulty × verdict)` to | |
| keep distribution consistent across train / val / test. Val/test | |
| preserve natural distribution; train is balanced for verdict | |
| (40–60 % conforme per language). | |
| ### Generation and curation pipeline | |
| - **Synthetic generation** (first about 9,000 records): | |
| initial bulk generation with `gemini-2.5-flash`, recovery pass with | |
| `gemini-3.5-flash` for blocking defects. | |
| - **Human-curated phase** (later records): | |
| source documents from selected datasets, cleaned and normalised, then | |
| passed through `gemini-3.5-flash` for assisted re-annotation, with | |
| human review on the violation labels and span boundaries. | |
| - **Sentence-aware chunking** for long documents (max 500 words per | |
| chunk, abbreviation-aware for IT/EN/FR/DE/ES/PT). | |
| - **Algorithmic defect scan and repair** across the whole corpus: | |
| case-insensitive matching, whitespace normalisation, span | |
| re-localization. | |
| Each record carries provenance metadata: `id`, `lang`, `doc_type`, | |
| `difficulty`, `score`, `verdict`, `source`. | |
| ### Distribution | |
| The dataset is **not currently published**. The decision on public release | |
| is being evaluated separately from this model release. | |
| --- | |
| ## Multilingual prompts | |
| The model accepts system prompts in all six target languages. Examples | |
| optimised to match the training distribution: | |
| ```python | |
| SYSTEM_PROMPTS = { | |
| "it": "Sei un esperto di plain language secondo ISO 24495-1:2023. ...", | |
| "en": "You are an expert in plain language according to ISO 24495-1:2023. ...", | |
| "fr": "Vous êtes expert en langage clair selon ISO 24495-1:2023. ...", | |
| "de": "Sie sind Experte für Verständlichkeit gemäß ISO 24495-1:2023. ...", | |
| "es": "Eres experto en lenguaje claro según ISO 24495-1:2023. ...", | |
| "pt": "Você é especialista em linguagem simples segundo a ISO 24495-1:2023. ...", | |
| } | |
| ``` | |
| The full set of prompts is available in `iso_principles.py` in the | |
| companion training-scripts repository. | |
| --- | |
| ## License | |
| The **fine-tuned model** (this repository) is released under the | |
| **Creative Commons Attribution-NonCommercial 4.0 International ([CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/))** | |
| license. | |
| > Non-commercial use is freely permitted (research, academia, internal | |
| > evaluation). For commercial use, please contact the authors (see § Contact). | |
| The **base model** | |
| ([utter-project/EuroLLM-1.7B-Instruct](https://huggingface.co/utter-project/EuroLLM-1.7B-Instruct)) | |
| is released under the **Apache License 2.0** (© 2024 UTTER project). The | |
| distribution of this derivative work **incorporates and attributes** the | |
| base model as required by Apache 2.0. See [`ATTRIBUTION.md`](ATTRIBUTION.md) | |
| for full details. | |
| --- | |
| ## Citation | |
| If you use this model in academic publications or research materials, | |
| please cite as: | |
| ```bibtex | |
| @misc{semplifica_iso24495_1_7b_v02_2026, | |
| title = {EuroLLM-ISO24495-1.7b-Instruct (v0.2): A Fine-Tuned EuroLLM-1.7B | |
| for ISO 24495-1 Plain Language Compliance Analysis in Six EU Languages}, | |
| author = {SemplificaAI}, | |
| year = {2026}, | |
| url = {https://huggingface.co/SemplificaAI/EuroLLM-ISO24495-1.7b-Instruct}, | |
| note = {v0.2}, | |
| } | |
| ``` | |
| Please also cite the **base model**: | |
| ```bibtex | |
| @misc{eurollm1_7b_2024, | |
| title = {EuroLLM-1.7B: Open-Weight European LLM}, | |
| author = {UTTER project}, | |
| year = {2024}, | |
| url = {https://huggingface.co/utter-project/EuroLLM-1.7B-Instruct}, | |
| } | |
| ``` | |
| --- | |
| ## Contact | |
| - **Commercial use**: [hf@semplifica.ai](mailto:hf@semplifica.ai) | |
| - **Issues, bugs, qualitative feedback**: use the *Community* tab of this HF repository. | |
| - **Academic collaboration**: contact the authors for joint dataset / | |
| benchmark initiatives. | |