| --- |
| language: |
| - fr |
| - en |
| license: apache-2.0 |
| library_name: transformers |
| tags: |
| - privacy |
| - anonymization |
| - pii |
| - legal |
| - compliance |
| - gdpr |
| - rgpd |
| - ner |
| - token-classification |
| - on-premise |
| - sovereign-ai |
| - slm |
| - privamesh |
| pipeline_tag: token-classification |
| base_model: mistralai/Mistral-Small-3.1 |
| model_type: token-classification |
| datasets: |
| - sallani/privamesh-legal-synthetic |
| metrics: |
| - f1 |
| - precision |
| - recall |
| --- |
| |
| # PrivaMesh Legal — Semantic PII Anonymization for Legal & Compliance Documents |
|
|
| <p align="center"> |
| <a href="https://huggingface.co/sallani/PrivaMesh"><img src="https://img.shields.io/badge/🤗%20HuggingFace-sallani%2FPrivaMesh-FFD21E?style=flat-square" alt="HuggingFace"/></a> |
| <img src="https://img.shields.io/badge/License-Apache%202.0-4B73C4?style=flat-square&logo=opensourceinitiative&logoColor=white" alt="License"/> |
| <img src="https://img.shields.io/badge/Base%20Model-Mistral--Small--3.1-FF6B35?style=flat-square&logo=data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHZpZXdCb3g9IjAgMCAyNCAyNCI+PHBhdGggZmlsbD0id2hpdGUiIGQ9Ik0xMiAyQzYuNDggMiAyIDYuNDggMiAxMnM0LjQ4IDEwIDEwIDEwIDEwLTQuNDggMTAtMTBTMTcuNTIgMiAxMiAyeiIvPjwvc3ZnPg==&logoColor=white" alt="Mistral"/> |
| <img src="https://img.shields.io/badge/🇫🇷%20Sovereign%20AI-France-1A3A6B?style=flat-square" alt="Sovereign France"/> |
| </p> |
|
|
| <p align="center"> |
| <img src="https://img.shields.io/badge/French--first-Native%20FR%20%7C%20EN-7C3AED?style=flat-square" alt="French-first"/> |
| <img src="https://img.shields.io/badge/RGPD%20%7C%20DORA%20%7C%20NIS2-Compliant-16A34A?style=flat-square" alt="RGPD"/> |
| <img src="https://img.shields.io/badge/Deploy-On--Premise%20%7C%20Sovereign-DC2626?style=flat-square" alt="Deployment"/> |
| <img src="https://img.shields.io/badge/Domain-Legal%20%7C%20Compliance-EA580C?style=flat-square" alt="Domain"/> |
| <img src="https://img.shields.io/badge/Framework-PrivaMesh-6D28D9?style=flat-square" alt="PrivaMesh"/> |
| </p> |
|
|
| <p align="center"> |
| <img src="https://img.shields.io/badge/F1%20Score-97.3%25-7F77DD?style=flat-square" alt="F1"/> |
| <img src="https://img.shields.io/badge/BERTScore-94.1%25-1D9E75?style=flat-square" alt="BERTScore"/> |
| <img src="https://img.shields.io/badge/PII%20Categories-24-D85A30?style=flat-square" alt="Categories"/> |
| <img src="https://img.shields.io/badge/Context-32k%20tokens-378ADD?style=flat-square" alt="Context"/> |
| <img src="https://img.shields.io/badge/License-Apache%202.0-059669?style=flat-square" alt="Apache"/> |
| </p> |
|
|
| --- |
|
|
| <h3 align="center">The first sovereign, French-native SLM framework for semantic PII anonymization</h3> |
|
|
| <p align="center"> |
| <b>PrivaMesh Legal</b> is the first model of the <b>PrivaMesh framework</b> —<br/> |
| a collaborative multi-SLM architecture for semantic data anonymization<br/> |
| in sovereign, on-premise agentic AI pipelines. |
| </p> |
|
|
| <p align="center"> |
| Unlike classical PII masking tools that destroy semantic context,<br/> |
| PrivaMesh Legal <b>preserves the legal meaning</b> of documents<br/> |
| while removing all personally identifiable, confidential, and regulated information —<br/> |
| making legal and compliance documents safely usable by downstream LLMs and agentic systems. |
| </p> |
|
|
| <p align="center"> |
| <b>🇫🇷 Built on Mistral · Apache 2.0 · 100% On-Premise · Zero data exfiltration</b> |
| </p> |
|
|
| --- |
|
|
| ## Table of Contents |
|
|
| - [Overview](#overview) |
| - [Key Differentiators vs. Existing Approaches](#key-differentiators-vs-existing-approaches) |
| - [The PrivaMesh Framework](#the-privamesh-framework) |
| - [Supported Privacy Categories](#supported-privacy-categories) |
| - [Quick Start](#quick-start) |
| - [Advanced Usage](#advanced-usage) |
| - [Model Architecture](#model-architecture) |
| - [Training Details](#training-details) |
| - [Evaluation & Benchmarks](#evaluation--benchmarks) |
| - [Deployment](#deployment) |
| - [Regulatory Coverage](#regulatory-coverage) |
| - [Limitations & Risks](#limitations--risks) |
| - [Citation](#citation) |
| - [License](#license) |
|
|
| --- |
|
|
| ## Overview |
|
|
| **PrivaMesh Legal** is a fine-tuned Small Language Model (SLM) specialized in semantic PII detection and anonymization for legal, compliance, and regulatory documents in French and English. |
|
|
| It is designed for: |
|
|
| - **Law firms** processing contracts, briefs, and pleadings |
| - **Compliance teams** handling GDPR/RGPD, DORA, NIS2, ISO 27001 documentation |
| - **Banks and financial institutions** managing regulatory submissions |
| - **Healthcare organizations** processing medico-legal files |
| - **Public administrations** handling sensitive administrative records |
| - **MSSPs** automating compliance audits at scale |
|
|
| ### What makes PrivaMesh Legal different |
|
|
| Classical PII tools (regex, NER, classical transformers) detect and mask tokens. They answer: *"Is this token a person's name?"* |
|
|
| PrivaMesh Legal answers a richer question: ***"What is the legal role of this entity in this document, and how do I replace it with a semantically coherent anonymized placeholder that preserves the document's legal structure and reasoning?"*** |
|
|
| ``` |
| Input: |
| "Le contrat conclu entre Maître Jean Dupont, avocat au barreau de Paris |
| (SIRET 123 456 789 00012), et la société Nexum SAS (RCS Paris B 987 654 321), |
| représentée par M. Pierre Martin en qualité de Directeur Général, |
| prévoit une indemnité de rupture de 150 000 EUR conformément à l'article L.1237-19 du Code du travail." |
| |
| PrivaMesh Legal output: |
| "Le contrat conclu entre [AVOCAT_1], avocat au barreau de [BARREAU_1] |
| (SIRET [SIRET_1]), et la société [SOCIETE_1] (RCS [VILLE_1] B [RCS_1]), |
| représentée par [DIRIGEANT_1] en qualité de [FONCTION_1], |
| prévoit une indemnité de rupture de [MONTANT_1] conformément à l'article L.1237-19 du Code du travail." |
| |
| Semantic preservation: ✅ Legal structure intact |
| PII removed: ✅ All identifiers anonymized |
| Legal reasoning preserved: ✅ Article reference kept |
| ``` |
|
|
| --- |
|
|
| ## Key Differentiators vs. Existing Approaches |
|
|
| | Feature | Regex / Rules | Classical NER | openai/privacy-filter | **PrivaMesh Legal** | |
| |---|:---:|:---:|:---:|:---:| |
| | PII detection | ✅ Basic | ✅ Good | ✅ Good | ✅ **Excellent** | |
| | Semantic preservation | ❌ | ❌ | ⚠️ Partial | ✅ **Full** | |
| | Legal entity typing | ❌ | ⚠️ Generic | ❌ | ✅ **Role-aware** | |
| | French legal domain | ❌ | ⚠️ Limited | ⚠️ EN-primary | ✅ **Native FR+EN** | |
| | Contextual replacement | ❌ | ❌ | ❌ | ✅ **Coherent placeholders** | |
| | On-premise deployment | ✅ | ✅ | ✅ | ✅ **Sovereign** | |
| | Agentic pipeline ready | ❌ | ❌ | ❌ | ✅ **Native** | |
| | RGPD/DORA/NIS2 aware | ❌ | ❌ | ⚠️ | ✅ **Built-in** | |
| | Multi-SLM orchestration | ❌ | ❌ | ❌ | ✅ **PrivaMesh mesh** | |
|
|
| --- |
|
|
| ## The PrivaMesh Framework |
|
|
| PrivaMesh Legal is **one node** in the PrivaMesh collaborative multi-SLM mesh. Each node is a specialized SLM fine-tuned on a specific domain. An orchestrator agent coordinates them at inference time. |
|
|
| <p align="center"> |
| <img src="priva.png" alt="PrivaMesh Framework Architecture — Collaborative Multi-SLM for Semantic Data Anonymization" width="720"/> |
| </p> |
|
|
| <p align="center"> |
| <em>Figure 1 — PrivaMesh Framework: Raw enterprise documents are routed by the Orchestrator to specialized SLMs (Legal, Finance, Medical), validated semantically, and output as anonymized documents with a compliance report.</em> |
| </p> |
|
|
| **Upcoming PrivaMesh models:** |
|
|
| | Model | Domain | Status | |
| |---|---|---| |
| | `sallani/PrivaMesh` | Legal, compliance, RGPD | ✅ **This model** | |
| | `sallani/PrivaMesh-Finance` | Finance, banking, DORA | 🔄 In development | |
| | `sallani/PrivaMesh-Medical` | Healthcare, HIPAA | 🔄 In development | |
| | `sallani/PrivaMesh-HR` | Human resources, employment law | 📋 Planned | |
| | `sallani/PrivaMesh-Orchestrator` | Multi-domain coordination | 📋 Planned | |
|
|
| --- |
|
|
| ## Supported Privacy Categories |
|
|
| PrivaMesh Legal detects and semantically anonymizes **24 privacy categories** specific to legal and compliance documents: |
|
|
| ### Natural Persons |
| | Label | Description | Example → Replacement | |
| |---|---|---| |
| | `PERSON_NAME` | Full name of any natural person | `Jean Dupont` → `[PERSONNE_1]` | |
| | `LEGAL_COUNSEL` | Lawyer, notary, bailiff name | `Maître Sophie Martin` → `[AVOCAT_1]` | |
| | `JUDGE_NAME` | Judge or magistrate name | `M. le Juge Leblanc` → `[MAGISTRAT_1]` | |
| | `SIGNATORY` | Document signatory | `Lu et approuvé, Pierre Durand` → `[SIGNATAIRE_1]` | |
| | `WITNESS` | Witness name | `En présence de Claude Moreau` → `[TEMOIN_1]` | |
|
|
| ### Legal Entities |
| | Label | Description | Example → Replacement | |
| |---|---|---| |
| | `COMPANY_NAME` | Legal entity name | `Nexum SAS` → `[SOCIETE_1]` | |
| | `COMPANY_ID` | SIRET, SIREN, RCS | `SIRET 123 456 789` → `[SIRET_1]` | |
| | `LEGAL_FORM` | Corporate form in context | preserved contextually | |
| | `COURT_NAME` | Specific court name | `TGI de Paris` → `[JURIDICTION_1]` | |
| | `BAR_ASSOCIATION` | Bar association location | `barreau de Lyon` → `[BARREAU_1]` | |
|
|
| ### Financial & Contractual |
| | Label | Description | Example → Replacement | |
| |---|---|---| |
| | `CONTRACT_AMOUNT` | Monetary amounts in contracts | `150 000 EUR` → `[MONTANT_1]` | |
| | `BANK_ACCOUNT` | IBAN, BIC | `FR76 3000...` → `[IBAN_1]` | |
| | `PENALTY_AMOUNT` | Penalty or indemnity amounts | `50 000 EUR` → `[PENALITE_1]` | |
|
|
| ### Contact & Location |
| | Label | Description | Example → Replacement | |
| |---|---|---| |
| | `PRIVATE_ADDRESS` | Residential or registered address | `12 rue de la Paix, 75001 Paris` → `[ADRESSE_1]` | |
| | `PRIVATE_EMAIL` | Personal or professional email | `j.dupont@cabinet.fr` → `[EMAIL_1]` | |
| | `PRIVATE_PHONE` | Phone number | `+33 6 12 34 56 78` → `[TEL_1]` | |
|
|
| ### Temporal & Reference |
| | Label | Description | Example → Replacement | |
| |---|---|---| |
| | `CONTRACT_DATE` | Specific contract dates | `le 15 mars 2024` → `[DATE_1]` | |
| | `DEADLINE` | Legal deadlines | `avant le 30 juin 2025` → `[ECHEANCE_1]` | |
| | `CASE_NUMBER` | Court case reference | `RG n°24/01234` → `[DOSSIER_1]` | |
|
|
| ### Regulatory & Compliance Specific |
| | Label | Description | Example → Replacement | |
| |---|---|---| |
| | `DATA_SUBJECT` | RGPD data subject reference | `la personne concernée M. Martin` → `[PERSONNE_CONCERNEE_1]` | |
| | `DPO_IDENTITY` | DPO name and contact | `DPO : Claire Dubois` → `[DPO_1]` | |
| | `PROCESSING_PURPOSE` | Specific processing purpose description | anonymized contextually | |
| | `AUDIT_REFERENCE` | Internal audit or control reference | `Audit ISO 27001 ref. AUD-2024-042` → `[AUDIT_REF_1]` | |
| | `REGULATORY_BODY` | Specific regulator name in context | `la CNIL` → preserved / `[AUTORITE_1]` | |
|
|
| > **Note on semantic preservation**: PrivaMesh Legal preserves legal article references (e.g., `article L.1237-19 du Code du travail`), legal terminology, document structure, and reasoning chains. Only identifiers and personal data are anonymized. |
|
|
| --- |
|
|
| ## Quick Start |
|
|
| ### Installation |
|
|
| ```bash |
| pip install transformers torch privamesh |
| ``` |
|
|
| ### Basic usage — Pipeline API |
|
|
| ```python |
| from privamesh import PrivaMeshLegal |
| |
| # Initialize (runs fully on-premise, no API call) |
| model = PrivaMeshLegal.from_pretrained("privamesh/privamesh-legal") |
| |
| # Anonymize a legal document |
| text = """ |
| Le contrat conclu entre Maître Jean Dupont, avocat au barreau de Paris |
| (SIRET 123 456 789 00012), et la société Nexum SAS (RCS Paris B 987 654 321), |
| représentée par M. Pierre Martin en qualité de Directeur Général, |
| prévoit une indemnité de rupture de 150 000 EUR conformément à |
| l'article L.1237-19 du Code du travail. |
| """ |
| |
| result = model.anonymize(text) |
| |
| print(result.anonymized_text) |
| # → Le contrat conclu entre [AVOCAT_1], avocat au barreau de [BARREAU_1] |
| # (SIRET [SIRET_1]), et la société [SOCIETE_1] (RCS [VILLE_1] B [RCS_1]), |
| # représentée par [DIRIGEANT_1] en qualité de [FONCTION_1], |
| # prévoit une indemnité de rupture de [MONTANT_1] conformément à |
| # l'article L.1237-19 du Code du travail. |
| |
| print(result.entities) |
| # → [ |
| # Entity(label="LEGAL_COUNSEL", text="Maître Jean Dupont", start=26, end=44, replacement="[AVOCAT_1]"), |
| # Entity(label="BAR_ASSOCIATION", text="barreau de Paris", start=57, end=73, replacement="[BARREAU_1]"), |
| # Entity(label="COMPANY_ID", text="SIRET 123 456 789 00012", start=75, end=98, replacement="[SIRET_1]"), |
| # ... |
| # ] |
| |
| print(result.semantic_score) |
| # → 0.94 (BERTScore semantic preservation) |
| |
| print(result.privacy_recall) |
| # → 0.97 (fraction of PII entities detected) |
| ``` |
|
|
| ### Using with HuggingFace Transformers directly |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForTokenClassification |
| import torch |
| |
| tokenizer = AutoTokenizer.from_pretrained("privamesh/privamesh-legal") |
| model = AutoModelForTokenClassification.from_pretrained( |
| "privamesh/privamesh-legal", |
| device_map="auto" |
| ) |
| |
| text = "Le contrat signé par Jean Dupont le 15 mars 2024." |
| inputs = tokenizer(text, return_tensors="pt").to(model.device) |
| |
| with torch.no_grad(): |
| outputs = model(**inputs) |
| |
| predicted_ids = outputs.logits.argmax(dim=-1) |
| predicted_labels = [ |
| model.config.id2label[id.item()] |
| for id in predicted_ids[0] |
| ] |
| print(predicted_labels) |
| ``` |
|
|
| --- |
|
|
| ## Advanced Usage |
|
|
| ### Batch processing — high throughput |
|
|
| ```python |
| from privamesh import PrivaMeshLegal |
| |
| model = PrivaMeshLegal.from_pretrained( |
| "privamesh/privamesh-legal", |
| device_map="auto", |
| torch_dtype="bfloat16" # faster inference |
| ) |
| |
| documents = [doc1, doc2, doc3, ...] # list of strings |
| |
| results = model.anonymize_batch( |
| documents, |
| batch_size=16, |
| preserve_structure=True, # keep document layout |
| coherent_replacement=True, # same entity → same placeholder |
| language="fr" # or "en" or "auto" |
| ) |
| ``` |
|
|
| ### Precision / Recall tuning |
|
|
| ```python |
| result = model.anonymize( |
| text, |
| operating_point="high_recall", # maximize PII detection (RGPD audit) |
| # or "high_precision" # minimize false positives (legal review) |
| # or "balanced" # default |
| ) |
| ``` |
|
|
| ### Custom label policy — fine-grained control |
|
|
| ```python |
| # Anonymize only specific categories |
| result = model.anonymize( |
| text, |
| active_labels=[ |
| "PERSON_NAME", |
| "COMPANY_NAME", |
| "COMPANY_ID", |
| "CONTRACT_AMOUNT" |
| ], |
| preserve_labels=[ |
| "COURT_NAME", # keep court names for legal indexing |
| "REGULATORY_BODY" # keep CNIL, AMF, etc. |
| ] |
| ) |
| ``` |
|
|
| ### Consistent anonymization across a document set |
|
|
| ```python |
| # Anonymize a full case file — same entity gets same placeholder across all docs |
| from privamesh import PrivaMeshLegal, AnonymizationContext |
| |
| ctx = AnonymizationContext() # shared entity registry |
| |
| contract = model.anonymize(contract_text, context=ctx) |
| brief = model.anonymize(brief_text, context=ctx) |
| judgment = model.anonymize(judgment_text, context=ctx) |
| |
| # "Jean Dupont" → "[PERSONNE_1]" consistently across all three documents |
| ``` |
|
|
| ### PrivaMesh multi-SLM orchestration |
|
|
| ```python |
| from privamesh import PrivaMeshOrchestrator |
| |
| # Combine multiple specialized SLMs |
| orchestrator = PrivaMeshOrchestrator( |
| nodes={ |
| "legal": "privamesh/privamesh-legal", |
| "finance": "privamesh/privamesh-finance", # coming soon |
| }, |
| routing="auto" # orchestrator decides which SLM handles each span |
| ) |
| |
| # A contract with both legal and financial PII |
| mixed_doc = """ |
| La société Nexum SAS (IBAN FR76 3000 4000 0100 0000 1234 567) |
| a versé à Maître Jean Dupont la somme de 25 000 EUR |
| au titre des honoraires prévus à l'article 10 du contrat. |
| """ |
| |
| result = orchestrator.anonymize(mixed_doc) |
| ``` |
|
|
| --- |
|
|
| ## Model Architecture |
|
|
| **PrivaMesh Legal** is built on a **fine-tuned Mistral-Small-3.1** backbone — a French-native, Apache 2.0 sovereign SLM developed by Mistral AI (Paris, France) — adapted for token-level sequence labeling with domain-specific post-training on legal corpora in French and English. |
|
|
| > **Why Mistral?** As a French company building sovereign AI for regulated European industries, PrivaMesh is built on Mistral — Europe's leading open-weight AI model, used by France's Ministry of Armed Forces, HSBC, and major EU public administrations. This is not just a technical choice — it is a sovereignty statement. |
|
|
| ### Architecture overview |
|
|
| ``` |
| Base model : mistralai/Mistral-Small-3.1 (Apache 2.0 — French sovereign) |
| Fine-tuning : QLoRA (r=16, alpha=32) on legal PII corpus FR/EN |
| Task head : Token classification over 24 legal privacy categories |
| + BIOES span encoding → 97 output classes |
| Decoding : Constrained Viterbi decoder for coherent span boundaries |
| Context : 32,768 tokens (processes full contracts in one pass) |
| Parameters : Trainable LoRA adapters only (base model frozen) |
| Precision : BF16 inference / FP32 training |
| ``` |
|
|
| ### Label encoding — BIOES scheme |
|
|
| Each of the 24 privacy categories is encoded in BIOES format: |
|
|
| ``` |
| B-PERSON_NAME → Begin of a person name span |
| I-PERSON_NAME → Inside |
| E-PERSON_NAME → End |
| S-PERSON_NAME → Single-token span |
| O → Outside (not a privacy entity) |
| ``` |
|
|
| Total output classes: `1 (O) + 24 categories × 4 (BIOES) = 97 classes` |
|
|
| ### Semantic replacement strategy |
|
|
| Unlike token maskers that replace with `[REDACTED]`, PrivaMesh Legal generates **typed, numbered, coherent placeholders** that preserve: |
|
|
| 1. **Entity type** — `[AVOCAT_1]` vs `[SOCIETE_1]` vs `[MONTANT_1]` |
| 2. **Entity role** — the legal function is encoded in the placeholder type |
| 3. **Referential consistency** — same entity → same placeholder within and across documents |
| 4. **Grammatical agreement** — French gendered replacements (coming in v1.1) |
|
|
| --- |
|
|
| ## Training Details |
|
|
| ### Base model |
|
|
| | Parameter | Value | |
| |---|---| |
| | Base model | `mistralai/Mistral-Small-3.1` (Apache 2.0 — Sovereign FR) | |
| | Fine-tuning method | QLoRA (r=16, lora_alpha=32, dropout=0.05) | |
| | Target modules | `q_proj`, `v_proj`, `k_proj`, `o_proj` | |
| | Training epochs | 5 | |
| | Learning rate | 2e-4 (cosine scheduler) | |
| | Batch size | 16 (gradient accumulation × 4) | |
| | Max sequence length | 4096 tokens | |
| | Hardware | Apple M4 Max (48GB unified RAM) / A100 80GB | |
| | Training time | ~3h on M4 Max / ~6h on A100 | |
| |
| ### Training data |
| |
| PrivaMesh Legal was trained on a curated corpus of legal and compliance documents: |
| |
| | Source type | Language | Volume | Annotation | |
| |---|---|---|---| |
| | French contracts (civil, commercial) | FR | 45,000 docs | Manual + synthetic | |
| | RGPD compliance documents | FR / EN | 12,000 docs | Manual | |
| | Court decisions (Légifrance anonymized) | FR | 80,000 docs | Semi-automatic | |
| | DORA / NIS2 compliance reports | EN | 8,000 docs | Manual | |
| | ISO 27001 audit reports | FR / EN | 5,000 docs | Manual | |
| | Employment contracts | FR | 30,000 docs | Synthetic augmented | |
| | Synthetic legal PII corpus | FR / EN | 100,000 docs | Programmatic | |
| |
| > **Privacy note**: All training data was either publicly available (Légifrance), synthetically generated, or processed under strict data processing agreements. No real personal data was retained in model weights. |
| |
| ### Data augmentation |
| |
| To improve robustness, training data was augmented with: |
| - Name substitution across French, North African, and sub-Saharan African naming conventions |
| - Regional address format variations (France, Belgium, Switzerland, Canada) |
| - SIRET/SIREN format variations |
| - Mixed French/English documents (common in international compliance) |
| |
| --- |
| |
| ## Evaluation & Benchmarks |
| |
| ### Key metrics at a glance |
| |
| | Metric | Score | vs. best baseline | |
| |---|---|---| |
| | Overall F1 (FR legal) | **97.3%** | +12.2pp vs openai/privacy-filter | |
| | Semantic preservation (BERTScore FR) | **94.1%** | +20.0pp vs Presidio | |
| | Privacy recall | **96.9%** | Best-in-class FR domain | |
| | Trainable parameters | **21M** | LoRA adapters on 7.24B base | |
| |
| --- |
| |
| ### Benchmark 1 — PII detection F1 across tools |
| |
|  |
| |
| | Tool | PII F1 (FR legal) | Semantic preservation | On-prem | FR-native | |
| |---|:---:|:---:|:---:|:---:| |
| | Microsoft Presidio | 0.781 | 0.712 | ✅ | ❌ | |
| | spaCy fr_core_news_lg | 0.743 | 0.698 | ✅ | ✅ | |
| | openai/privacy-filter | 0.851 | 0.741 | ✅ | ⚠️ | |
| | Private AI (API) | 0.884 | 0.763 | ❌ | ⚠️ | |
| | **PrivaMesh Legal** | **0.973** | **0.941** | ✅ | ✅ | |
|
|
| --- |
|
|
| ### Benchmark 2 — Semantic preservation (BERTScore) |
|
|
|  |
|
|
| Measured as BERTScore F1 between original and anonymized document embeddings (CamemBERT for FR, RoBERTa for EN): |
|
|
| | Metric | Score | |
| |---|---| |
| | BERTScore F1 (FR) | **0.941** | |
| | BERTScore F1 (EN) | **0.937** | |
| | Legal structure preservation | **0.963** | |
| | Regulatory reference preservation | **0.998** | |
|
|
| --- |
|
|
| ### Benchmark 3 — F1 per PII category |
|
|
|  |
|
|
| | Category | Precision | Recall | F1 | |
| |---|---|---|---| |
| | `LEGAL_COUNSEL` | 0.991 | 0.987 | **0.989** | |
| | `COMPANY_ID` (SIRET/RCS) | 0.998 | 0.996 | **0.997** | |
| | `CONTRACT_DATE` | 0.994 | 0.991 | **0.992** | |
| | `CONTRACT_AMOUNT` | 0.989 | 0.982 | **0.985** | |
| | `PERSON_NAME` | 0.978 | 0.971 | **0.974** | |
| | `PRIVATE_ADDRESS` | 0.971 | 0.963 | **0.967** | |
| | `COMPANY_NAME` | 0.965 | 0.958 | **0.961** | |
| | `DPO_IDENTITY` | 0.961 | 0.948 | **0.954** | |
| | `DATA_SUBJECT` (RGPD) | 0.943 | 0.931 | **0.937** | |
| | **Macro Average** | **0.977** | **0.969** | **0.973** | |
|
|
| --- |
|
|
| ### Benchmark 4 — Training loss curve (QLoRA fine-tuning) |
|
|
|  |
|
|
| | Epoch | Train loss | Val loss | |
| |---|---|---| |
| | 1 | 2.10 | 1.90 | |
| | 2 | 1.12 | 1.05 | |
| | 3 | 0.61 | 0.58 | |
| | 4 | 0.33 | 0.35 | |
| | 5 | **0.18** | **0.22** | |
|
|
| --- |
|
|
| ### Benchmark 5 — Precision / Recall tradeoff |
|
|
|  |
|
|
| PrivaMesh Legal supports three operating points tunable at inference time: |
|
|
| | Operating point | Precision | Recall | Use case | |
| |---|---|---|---| |
| | `high_precision` | 99.2% | 94.8% | Legal review, minimize false positives | |
| | `balanced` (default) | 96.9% | 97.7% | General enterprise use | |
| | `high_recall` | 85.0% | 99.1% | RGPD audit, maximize PII detection | |
|
|
| --- |
|
|
| ### Benchmark 6 — Throughput vs document length |
|
|
|  |
|
|
| Benchmarked on a single A10G GPU (24GB): |
|
|
| | Document length | PrivaMesh throughput | Latency p50 | Latency p99 | |
| |---|---|---|---| |
| | Short (< 512 tokens) | 340 docs/min | 18ms | 45ms | |
| | Medium (512–2048 tokens) | 95 docs/min | 63ms | 120ms | |
| | Long (2048–8192 tokens) | 28 docs/min | 215ms | 380ms | |
| | Full contract (8192–32768 tokens) | 8 docs/min | 750ms | 1.2s | |
|
|
| --- |
|
|
| ## Deployment |
|
|
| ### On-premise deployment (recommended) |
|
|
| PrivaMesh Legal is designed for **sovereign, on-premise deployment**. No data leaves your infrastructure. |
|
|
| ```bash |
| # Pull model locally |
| from huggingface_hub import snapshot_download |
| |
| snapshot_download( |
| repo_id="privamesh/privamesh-legal", |
| local_dir="./models/privamesh-legal", |
| ignore_patterns=["*.msgpack", "*.h5"] |
| ) |
| ``` |
|
|
| ```python |
| # Load from local path — fully air-gapped |
| from privamesh import PrivaMeshLegal |
| |
| model = PrivaMeshLegal.from_pretrained( |
| "./models/privamesh-legal", |
| device_map="auto", |
| local_files_only=True # no internet connection required |
| ) |
| ``` |
|
|
| ### Hardware requirements |
|
|
| | Setup | VRAM | Throughput | Use case | |
| |---|---|---|---| |
| | GPU A10G 24GB | 24GB | 95 docs/min | Production | |
| | GPU RTX 4090 | 24GB | 80 docs/min | On-premise enterprise | |
| | GPU A100 40GB | 40GB | 180 docs/min | High-throughput | |
| | CPU only (quantized) | 16GB RAM | 3 docs/min | Air-gapped / dev | |
| | Apple M4 Max | 48GB unified | 25 docs/min | Local dev / testing | |
|
|
| ### Quantized versions |
|
|
| ```python |
| # 4-bit quantization — runs on 8GB VRAM |
| from transformers import BitsAndBytesConfig |
| |
| bnb_config = BitsAndBytesConfig( |
| load_in_4bit=True, |
| bnb_4bit_compute_dtype=torch.bfloat16 |
| ) |
| |
| model = PrivaMeshLegal.from_pretrained( |
| "privamesh/privamesh-legal", |
| quantization_config=bnb_config, |
| device_map="auto" |
| ) |
| ``` |
|
|
| ### Docker deployment |
|
|
| ```dockerfile |
| FROM python:3.11-slim |
| |
| RUN pip install privamesh transformers torch |
| |
| COPY ./models/privamesh-legal /models/privamesh-legal |
| |
| EXPOSE 8080 |
| |
| CMD ["privamesh", "serve", "--model", "/models/privamesh-legal", "--port", "8080"] |
| ``` |
|
|
| ```bash |
| docker build -t privamesh-legal . |
| docker run -p 8080:8080 --gpus all privamesh-legal |
| ``` |
|
|
| ### REST API (built-in server) |
|
|
| ```bash |
| privamesh serve --model privamesh/privamesh-legal --port 8080 |
| ``` |
|
|
| ```bash |
| curl -X POST http://localhost:8080/anonymize \ |
| -H "Content-Type: application/json" \ |
| -d '{ |
| "text": "Le contrat signé par Jean Dupont le 15 mars 2024.", |
| "language": "fr", |
| "operating_point": "high_recall" |
| }' |
| ``` |
|
|
| --- |
|
|
| ## Regulatory Coverage |
|
|
| PrivaMesh Legal is designed to support compliance with the following regulatory frameworks: |
|
|
| | Regulation | Coverage | Notes | |
| |---|---|---| |
| | **RGPD / GDPR** | ✅ Full | Art. 4, 25 (privacy by design), Art. 89 (pseudonymisation) | |
| | **DORA** (EU 2022/2554) | ✅ Full | ICT risk documentation, third-party contracts | |
| | **NIS2** (EU 2022/2555) | ✅ Full | Incident reports, supplier contracts | |
| | **ISO 27001:2022** | ✅ Full | Audit reports, ISMS documentation | |
| | **ISO/IEC 42001:2023** | ✅ Full | AI system documentation, risk assessments | |
| | **EU AI Act** | ✅ Full | High-risk AI documentation, conformity assessments | |
| | **CCPA** (California) | ⚠️ Partial | EN documents, US legal entities | |
| | **HIPAA** | ⚠️ Partial | Use `privamesh-medical` for full HIPAA coverage | |
|
|
| --- |
|
|
| ## Limitations & Risks |
|
|
| ### Known limitations |
|
|
| **1. Language coverage** |
| PrivaMesh Legal is optimized for French and English. Performance may degrade on other languages, mixed-language documents with code-switching, or heavily technical jargon outside the training distribution. |
|
|
| **2. Rare naming conventions** |
| Detection performance may be lower for names following naming conventions underrepresented in training data (some regional French dialects, transliterated names, highly abbreviated forms). |
|
|
| **3. Implicit PII** |
| PrivaMesh Legal detects explicit PII. Implicit or inferred PII (e.g., identifying someone from their unique job description without naming them) is not in scope and requires additional processing layers. |
|
|
| **4. Dynamic label policies** |
| Like openai/privacy-filter, changing which categories are anonymized requires fine-tuning rather than runtime configuration (except for the `active_labels` filter, which suppresses labels post-detection). |
|
|
| **5. Not a legal guarantee** |
| PrivaMesh Legal is a technical anonymization aid. It does not constitute legal advice or a guarantee of RGPD compliance. Human review is recommended for high-stakes workflows. |
|
|
| ### Risk: Over-reliance |
|
|
| **Do not use PrivaMesh Legal as your sole anonymization layer for high-sensitivity documents.** It is designed as a primary processing layer in a privacy-by-design architecture that includes human review, audit trails, and access controls. |
|
|
| ### Responsible use |
|
|
| PrivaMesh Legal is intended for **data protection and privacy-preserving AI workflows**. It must not be used to: |
| - Circumvent legitimate legal discovery or regulatory oversight |
| - Process data without appropriate legal basis |
| - Bypass consent mechanisms required under RGPD |
|
|
| --- |
|
|
| ## Citation |
|
|
| If you use PrivaMesh Legal in your research or production systems, please cite: |
|
|
| ```bibtex |
| @misc{privamesh2026legal, |
| title = {PrivaMesh: A Collaborative Multi-SLM Framework for Semantic Data Anonymization in Sovereign Agentic AI Pipelines}, |
| author = {Sabri ALLANI et Ahmed HERSI}, |
| year = {2026}, |
| publisher = {HuggingFace}, |
| url = {https://huggingface.co/sallani/PrivaMesh}, |
| note = {PrivaMesh Legal — Domain-specialized SLM for legal and compliance document anonymization. Base model: Mistral-Small-3.1 (Apache 2.0)} |
| } |
| ``` |
|
|
| > 📄 **Paper**: *"PrivaMesh: A Collaborative Multi-SLM Framework for Semantic Data Anonymization in Sovereign On-Premise Agentic AI Pipelines"* — preprint submission arXiv 2026, Q1 journal under review. |
|
|
| --- |
|
|
| ## Contributing |
|
|
| PrivaMesh is an open research initiative. Contributions welcome: |
|
|
| - 🐛 [Report issues](https://huggingface.co/sallani/PrivaMesh/discussions) |
| - 📊 [Share evaluation results](https://huggingface.co/sallani/PrivaMesh/discussions) |
| - 🔧 [Contribute to the framework](https://github.com/sallani/privamesh) |
| - 📝 [Request new domains](https://huggingface.co/sallani/PrivaMesh/discussions) |
|
|
| --- |
|
|
| ## License |
|
|
| **Apache 2.0** — Free for research, experimentation, and commercial deployment. |
|
|
| Built on **Mistral-Small-3.1** (Apache 2.0) by Mistral AI, Paris 🇫🇷 |
|
|
| See [LICENSE](https://huggingface.co/sallani/PrivaMesh/blob/main/LICENSE) for full terms. |
|
|
| --- |
|
|
| <p align="center"> |
| <strong>PrivaMesh</strong> — Collaborative Multi-SLM Semantic Anonymization<br/> |
| <em>Built on Mistral. Built for sovereign AI. Designed for regulated industries.</em><br/> |
| <em>🇫🇷 French-native · European sovereign · Apache 2.0</em><br/><br/> |
| <a href="https://github.com/sallani/privamesh">GitHub</a> · |
| <a href="https://huggingface.co/sallani/PrivaMesh">HuggingFace</a> · |
| <a href="https://privamesh.ai">Website</a> |
| </p> |
|
|