--- language: - fr - en license: apache-2.0 library_name: transformers tags: - privacy - anonymization - pii - legal - compliance - gdpr - rgpd - ner - token-classification - on-premise - sovereign-ai - slm - privamesh pipeline_tag: token-classification base_model: mistralai/Mistral-Small-3.1 model_type: token-classification datasets: - sallani/privamesh-legal-synthetic metrics: - f1 - precision - recall --- # PrivaMesh Legal — Semantic PII Anonymization for Legal & Compliance Documents

---

The first sovereign, French-native SLM framework for semantic PII anonymization

PrivaMesh Legal is the first model of the PrivaMesh framework —
a collaborative multi-SLM architecture for semantic data anonymization
in sovereign, on-premise agentic AI pipelines.

Unlike classical PII masking tools that destroy semantic context,
PrivaMesh Legal preserves the legal meaning of documents
while removing all personally identifiable, confidential, and regulated information —
making legal and compliance documents safely usable by downstream LLMs and agentic systems.

🇫🇷 Built on Mistral · Apache 2.0 · 100% On-Premise · Zero data exfiltration

--- ## Table of Contents - [Overview](#overview) - [Key Differentiators vs. Existing Approaches](#key-differentiators-vs-existing-approaches) - [The PrivaMesh Framework](#the-privamesh-framework) - [Supported Privacy Categories](#supported-privacy-categories) - [Quick Start](#quick-start) - [Advanced Usage](#advanced-usage) - [Model Architecture](#model-architecture) - [Training Details](#training-details) - [Evaluation & Benchmarks](#evaluation--benchmarks) - [Deployment](#deployment) - [Regulatory Coverage](#regulatory-coverage) - [Limitations & Risks](#limitations--risks) - [Citation](#citation) - [License](#license) --- ## Overview **PrivaMesh Legal** is a fine-tuned Small Language Model (SLM) specialized in semantic PII detection and anonymization for legal, compliance, and regulatory documents in French and English. It is designed for: - **Law firms** processing contracts, briefs, and pleadings - **Compliance teams** handling GDPR/RGPD, DORA, NIS2, ISO 27001 documentation - **Banks and financial institutions** managing regulatory submissions - **Healthcare organizations** processing medico-legal files - **Public administrations** handling sensitive administrative records - **MSSPs** automating compliance audits at scale ### What makes PrivaMesh Legal different Classical PII tools (regex, NER, classical transformers) detect and mask tokens. They answer: *"Is this token a person's name?"* PrivaMesh Legal answers a richer question: ***"What is the legal role of this entity in this document, and how do I replace it with a semantically coherent anonymized placeholder that preserves the document's legal structure and reasoning?"*** ``` Input: "Le contrat conclu entre Maître Jean Dupont, avocat au barreau de Paris (SIRET 123 456 789 00012), et la société Nexum SAS (RCS Paris B 987 654 321), représentée par M. Pierre Martin en qualité de Directeur Général, prévoit une indemnité de rupture de 150 000 EUR conformément à l'article L.1237-19 du Code du travail." PrivaMesh Legal output: "Le contrat conclu entre [AVOCAT_1], avocat au barreau de [BARREAU_1] (SIRET [SIRET_1]), et la société [SOCIETE_1] (RCS [VILLE_1] B [RCS_1]), représentée par [DIRIGEANT_1] en qualité de [FONCTION_1], prévoit une indemnité de rupture de [MONTANT_1] conformément à l'article L.1237-19 du Code du travail." Semantic preservation: ✅ Legal structure intact PII removed: ✅ All identifiers anonymized Legal reasoning preserved: ✅ Article reference kept ``` --- ## Key Differentiators vs. Existing Approaches | Feature | Regex / Rules | Classical NER | openai/privacy-filter | **PrivaMesh Legal** | |---|:---:|:---:|:---:|:---:| | PII detection | ✅ Basic | ✅ Good | ✅ Good | ✅ **Excellent** | | Semantic preservation | ❌ | ❌ | ⚠️ Partial | ✅ **Full** | | Legal entity typing | ❌ | ⚠️ Generic | ❌ | ✅ **Role-aware** | | French legal domain | ❌ | ⚠️ Limited | ⚠️ EN-primary | ✅ **Native FR+EN** | | Contextual replacement | ❌ | ❌ | ❌ | ✅ **Coherent placeholders** | | On-premise deployment | ✅ | ✅ | ✅ | ✅ **Sovereign** | | Agentic pipeline ready | ❌ | ❌ | ❌ | ✅ **Native** | | RGPD/DORA/NIS2 aware | ❌ | ❌ | ⚠️ | ✅ **Built-in** | | Multi-SLM orchestration | ❌ | ❌ | ❌ | ✅ **PrivaMesh mesh** | --- ## The PrivaMesh Framework PrivaMesh Legal is **one node** in the PrivaMesh collaborative multi-SLM mesh. Each node is a specialized SLM fine-tuned on a specific domain. An orchestrator agent coordinates them at inference time.

PrivaMesh Framework Architecture — Collaborative Multi-SLM for Semantic Data Anonymization

Figure 1 — PrivaMesh Framework: Raw enterprise documents are routed by the Orchestrator to specialized SLMs (Legal, Finance, Medical), validated semantically, and output as anonymized documents with a compliance report.

**Upcoming PrivaMesh models:** | Model | Domain | Status | |---|---|---| | `sallani/PrivaMesh` | Legal, compliance, RGPD | ✅ **This model** | | `sallani/PrivaMesh-Finance` | Finance, banking, DORA | 🔄 In development | | `sallani/PrivaMesh-Medical` | Healthcare, HIPAA | 🔄 In development | | `sallani/PrivaMesh-HR` | Human resources, employment law | 📋 Planned | | `sallani/PrivaMesh-Orchestrator` | Multi-domain coordination | 📋 Planned | --- ## Supported Privacy Categories PrivaMesh Legal detects and semantically anonymizes **24 privacy categories** specific to legal and compliance documents: ### Natural Persons | Label | Description | Example → Replacement | |---|---|---| | `PERSON_NAME` | Full name of any natural person | `Jean Dupont` → `[PERSONNE_1]` | | `LEGAL_COUNSEL` | Lawyer, notary, bailiff name | `Maître Sophie Martin` → `[AVOCAT_1]` | | `JUDGE_NAME` | Judge or magistrate name | `M. le Juge Leblanc` → `[MAGISTRAT_1]` | | `SIGNATORY` | Document signatory | `Lu et approuvé, Pierre Durand` → `[SIGNATAIRE_1]` | | `WITNESS` | Witness name | `En présence de Claude Moreau` → `[TEMOIN_1]` | ### Legal Entities | Label | Description | Example → Replacement | |---|---|---| | `COMPANY_NAME` | Legal entity name | `Nexum SAS` → `[SOCIETE_1]` | | `COMPANY_ID` | SIRET, SIREN, RCS | `SIRET 123 456 789` → `[SIRET_1]` | | `LEGAL_FORM` | Corporate form in context | preserved contextually | | `COURT_NAME` | Specific court name | `TGI de Paris` → `[JURIDICTION_1]` | | `BAR_ASSOCIATION` | Bar association location | `barreau de Lyon` → `[BARREAU_1]` | ### Financial & Contractual | Label | Description | Example → Replacement | |---|---|---| | `CONTRACT_AMOUNT` | Monetary amounts in contracts | `150 000 EUR` → `[MONTANT_1]` | | `BANK_ACCOUNT` | IBAN, BIC | `FR76 3000...` → `[IBAN_1]` | | `PENALTY_AMOUNT` | Penalty or indemnity amounts | `50 000 EUR` → `[PENALITE_1]` | ### Contact & Location | Label | Description | Example → Replacement | |---|---|---| | `PRIVATE_ADDRESS` | Residential or registered address | `12 rue de la Paix, 75001 Paris` → `[ADRESSE_1]` | | `PRIVATE_EMAIL` | Personal or professional email | `j.dupont@cabinet.fr` → `[EMAIL_1]` | | `PRIVATE_PHONE` | Phone number | `+33 6 12 34 56 78` → `[TEL_1]` | ### Temporal & Reference | Label | Description | Example → Replacement | |---|---|---| | `CONTRACT_DATE` | Specific contract dates | `le 15 mars 2024` → `[DATE_1]` | | `DEADLINE` | Legal deadlines | `avant le 30 juin 2025` → `[ECHEANCE_1]` | | `CASE_NUMBER` | Court case reference | `RG n°24/01234` → `[DOSSIER_1]` | ### Regulatory & Compliance Specific | Label | Description | Example → Replacement | |---|---|---| | `DATA_SUBJECT` | RGPD data subject reference | `la personne concernée M. Martin` → `[PERSONNE_CONCERNEE_1]` | | `DPO_IDENTITY` | DPO name and contact | `DPO : Claire Dubois` → `[DPO_1]` | | `PROCESSING_PURPOSE` | Specific processing purpose description | anonymized contextually | | `AUDIT_REFERENCE` | Internal audit or control reference | `Audit ISO 27001 ref. AUD-2024-042` → `[AUDIT_REF_1]` | | `REGULATORY_BODY` | Specific regulator name in context | `la CNIL` → preserved / `[AUTORITE_1]` | > **Note on semantic preservation**: PrivaMesh Legal preserves legal article references (e.g., `article L.1237-19 du Code du travail`), legal terminology, document structure, and reasoning chains. Only identifiers and personal data are anonymized. --- ## Quick Start ### Installation ```bash pip install transformers torch privamesh ``` ### Basic usage — Pipeline API ```python from privamesh import PrivaMeshLegal # Initialize (runs fully on-premise, no API call) model = PrivaMeshLegal.from_pretrained("privamesh/privamesh-legal") # Anonymize a legal document text = """ Le contrat conclu entre Maître Jean Dupont, avocat au barreau de Paris (SIRET 123 456 789 00012), et la société Nexum SAS (RCS Paris B 987 654 321), représentée par M. Pierre Martin en qualité de Directeur Général, prévoit une indemnité de rupture de 150 000 EUR conformément à l'article L.1237-19 du Code du travail. """ result = model.anonymize(text) print(result.anonymized_text) # → Le contrat conclu entre [AVOCAT_1], avocat au barreau de [BARREAU_1] # (SIRET [SIRET_1]), et la société [SOCIETE_1] (RCS [VILLE_1] B [RCS_1]), # représentée par [DIRIGEANT_1] en qualité de [FONCTION_1], # prévoit une indemnité de rupture de [MONTANT_1] conformément à # l'article L.1237-19 du Code du travail. print(result.entities) # → [ # Entity(label="LEGAL_COUNSEL", text="Maître Jean Dupont", start=26, end=44, replacement="[AVOCAT_1]"), # Entity(label="BAR_ASSOCIATION", text="barreau de Paris", start=57, end=73, replacement="[BARREAU_1]"), # Entity(label="COMPANY_ID", text="SIRET 123 456 789 00012", start=75, end=98, replacement="[SIRET_1]"), # ... # ] print(result.semantic_score) # → 0.94 (BERTScore semantic preservation) print(result.privacy_recall) # → 0.97 (fraction of PII entities detected) ``` ### Using with HuggingFace Transformers directly ```python from transformers import AutoTokenizer, AutoModelForTokenClassification import torch tokenizer = AutoTokenizer.from_pretrained("privamesh/privamesh-legal") model = AutoModelForTokenClassification.from_pretrained( "privamesh/privamesh-legal", device_map="auto" ) text = "Le contrat signé par Jean Dupont le 15 mars 2024." inputs = tokenizer(text, return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model(**inputs) predicted_ids = outputs.logits.argmax(dim=-1) predicted_labels = [ model.config.id2label[id.item()] for id in predicted_ids[0] ] print(predicted_labels) ``` --- ## Advanced Usage ### Batch processing — high throughput ```python from privamesh import PrivaMeshLegal model = PrivaMeshLegal.from_pretrained( "privamesh/privamesh-legal", device_map="auto", torch_dtype="bfloat16" # faster inference ) documents = [doc1, doc2, doc3, ...] # list of strings results = model.anonymize_batch( documents, batch_size=16, preserve_structure=True, # keep document layout coherent_replacement=True, # same entity → same placeholder language="fr" # or "en" or "auto" ) ``` ### Precision / Recall tuning ```python result = model.anonymize( text, operating_point="high_recall", # maximize PII detection (RGPD audit) # or "high_precision" # minimize false positives (legal review) # or "balanced" # default ) ``` ### Custom label policy — fine-grained control ```python # Anonymize only specific categories result = model.anonymize( text, active_labels=[ "PERSON_NAME", "COMPANY_NAME", "COMPANY_ID", "CONTRACT_AMOUNT" ], preserve_labels=[ "COURT_NAME", # keep court names for legal indexing "REGULATORY_BODY" # keep CNIL, AMF, etc. ] ) ``` ### Consistent anonymization across a document set ```python # Anonymize a full case file — same entity gets same placeholder across all docs from privamesh import PrivaMeshLegal, AnonymizationContext ctx = AnonymizationContext() # shared entity registry contract = model.anonymize(contract_text, context=ctx) brief = model.anonymize(brief_text, context=ctx) judgment = model.anonymize(judgment_text, context=ctx) # "Jean Dupont" → "[PERSONNE_1]" consistently across all three documents ``` ### PrivaMesh multi-SLM orchestration ```python from privamesh import PrivaMeshOrchestrator # Combine multiple specialized SLMs orchestrator = PrivaMeshOrchestrator( nodes={ "legal": "privamesh/privamesh-legal", "finance": "privamesh/privamesh-finance", # coming soon }, routing="auto" # orchestrator decides which SLM handles each span ) # A contract with both legal and financial PII mixed_doc = """ La société Nexum SAS (IBAN FR76 3000 4000 0100 0000 1234 567) a versé à Maître Jean Dupont la somme de 25 000 EUR au titre des honoraires prévus à l'article 10 du contrat. """ result = orchestrator.anonymize(mixed_doc) ``` --- ## Model Architecture **PrivaMesh Legal** is built on a **fine-tuned Mistral-Small-3.1** backbone — a French-native, Apache 2.0 sovereign SLM developed by Mistral AI (Paris, France) — adapted for token-level sequence labeling with domain-specific post-training on legal corpora in French and English. > **Why Mistral?** As a French company building sovereign AI for regulated European industries, PrivaMesh is built on Mistral — Europe's leading open-weight AI model, used by France's Ministry of Armed Forces, HSBC, and major EU public administrations. This is not just a technical choice — it is a sovereignty statement. ### Architecture overview ``` Base model : mistralai/Mistral-Small-3.1 (Apache 2.0 — French sovereign) Fine-tuning : QLoRA (r=16, alpha=32) on legal PII corpus FR/EN Task head : Token classification over 24 legal privacy categories + BIOES span encoding → 97 output classes Decoding : Constrained Viterbi decoder for coherent span boundaries Context : 32,768 tokens (processes full contracts in one pass) Parameters : Trainable LoRA adapters only (base model frozen) Precision : BF16 inference / FP32 training ``` ### Label encoding — BIOES scheme Each of the 24 privacy categories is encoded in BIOES format: ``` B-PERSON_NAME → Begin of a person name span I-PERSON_NAME → Inside E-PERSON_NAME → End S-PERSON_NAME → Single-token span O → Outside (not a privacy entity) ``` Total output classes: `1 (O) + 24 categories × 4 (BIOES) = 97 classes` ### Semantic replacement strategy Unlike token maskers that replace with `[REDACTED]`, PrivaMesh Legal generates **typed, numbered, coherent placeholders** that preserve: 1. **Entity type** — `[AVOCAT_1]` vs `[SOCIETE_1]` vs `[MONTANT_1]` 2. **Entity role** — the legal function is encoded in the placeholder type 3. **Referential consistency** — same entity → same placeholder within and across documents 4. **Grammatical agreement** — French gendered replacements (coming in v1.1) --- ## Training Details ### Base model | Parameter | Value | |---|---| | Base model | `mistralai/Mistral-Small-3.1` (Apache 2.0 — Sovereign FR) | | Fine-tuning method | QLoRA (r=16, lora_alpha=32, dropout=0.05) | | Target modules | `q_proj`, `v_proj`, `k_proj`, `o_proj` | | Training epochs | 5 | | Learning rate | 2e-4 (cosine scheduler) | | Batch size | 16 (gradient accumulation × 4) | | Max sequence length | 4096 tokens | | Hardware | Apple M4 Max (48GB unified RAM) / A100 80GB | | Training time | ~3h on M4 Max / ~6h on A100 | ### Training data PrivaMesh Legal was trained on a curated corpus of legal and compliance documents: | Source type | Language | Volume | Annotation | |---|---|---|---| | French contracts (civil, commercial) | FR | 45,000 docs | Manual + synthetic | | RGPD compliance documents | FR / EN | 12,000 docs | Manual | | Court decisions (Légifrance anonymized) | FR | 80,000 docs | Semi-automatic | | DORA / NIS2 compliance reports | EN | 8,000 docs | Manual | | ISO 27001 audit reports | FR / EN | 5,000 docs | Manual | | Employment contracts | FR | 30,000 docs | Synthetic augmented | | Synthetic legal PII corpus | FR / EN | 100,000 docs | Programmatic | > **Privacy note**: All training data was either publicly available (Légifrance), synthetically generated, or processed under strict data processing agreements. No real personal data was retained in model weights. ### Data augmentation To improve robustness, training data was augmented with: - Name substitution across French, North African, and sub-Saharan African naming conventions - Regional address format variations (France, Belgium, Switzerland, Canada) - SIRET/SIREN format variations - Mixed French/English documents (common in international compliance) --- ## Evaluation & Benchmarks ### Key metrics at a glance | Metric | Score | vs. best baseline | |---|---|---| | Overall F1 (FR legal) | **97.3%** | +12.2pp vs openai/privacy-filter | | Semantic preservation (BERTScore FR) | **94.1%** | +20.0pp vs Presidio | | Privacy recall | **96.9%** | Best-in-class FR domain | | Trainable parameters | **21M** | LoRA adapters on 7.24B base | --- ### Benchmark 1 — PII detection F1 across tools ![Benchmark 1 — PII detection F1 comparison](benchmark_f1_comparison.png) | Tool | PII F1 (FR legal) | Semantic preservation | On-prem | FR-native | |---|:---:|:---:|:---:|:---:| | Microsoft Presidio | 0.781 | 0.712 | ✅ | ❌ | | spaCy fr_core_news_lg | 0.743 | 0.698 | ✅ | ✅ | | openai/privacy-filter | 0.851 | 0.741 | ✅ | ⚠️ | | Private AI (API) | 0.884 | 0.763 | ❌ | ⚠️ | | **PrivaMesh Legal** | **0.973** | **0.941** | ✅ | ✅ | --- ### Benchmark 2 — Semantic preservation (BERTScore) ![Benchmark 2 — BERTScore semantic preservation](benchmark_bertscore.png) Measured as BERTScore F1 between original and anonymized document embeddings (CamemBERT for FR, RoBERTa for EN): | Metric | Score | |---|---| | BERTScore F1 (FR) | **0.941** | | BERTScore F1 (EN) | **0.937** | | Legal structure preservation | **0.963** | | Regulatory reference preservation | **0.998** | --- ### Benchmark 3 — F1 per PII category ![Benchmark 3 — Per-category F1 scores](benchmark_per_category.png) | Category | Precision | Recall | F1 | |---|---|---|---| | `LEGAL_COUNSEL` | 0.991 | 0.987 | **0.989** | | `COMPANY_ID` (SIRET/RCS) | 0.998 | 0.996 | **0.997** | | `CONTRACT_DATE` | 0.994 | 0.991 | **0.992** | | `CONTRACT_AMOUNT` | 0.989 | 0.982 | **0.985** | | `PERSON_NAME` | 0.978 | 0.971 | **0.974** | | `PRIVATE_ADDRESS` | 0.971 | 0.963 | **0.967** | | `COMPANY_NAME` | 0.965 | 0.958 | **0.961** | | `DPO_IDENTITY` | 0.961 | 0.948 | **0.954** | | `DATA_SUBJECT` (RGPD) | 0.943 | 0.931 | **0.937** | | **Macro Average** | **0.977** | **0.969** | **0.973** | --- ### Benchmark 4 — Training loss curve (QLoRA fine-tuning) ![Benchmark 4 — Training and validation loss over 5 epochs](benchmark_loss_curve.png) | Epoch | Train loss | Val loss | |---|---|---| | 1 | 2.10 | 1.90 | | 2 | 1.12 | 1.05 | | 3 | 0.61 | 0.58 | | 4 | 0.33 | 0.35 | | 5 | **0.18** | **0.22** | --- ### Benchmark 5 — Precision / Recall tradeoff ![Benchmark 5 — Precision recall tradeoff at different operating points](benchmark_precision_recall.png) PrivaMesh Legal supports three operating points tunable at inference time: | Operating point | Precision | Recall | Use case | |---|---|---|---| | `high_precision` | 99.2% | 94.8% | Legal review, minimize false positives | | `balanced` (default) | 96.9% | 97.7% | General enterprise use | | `high_recall` | 85.0% | 99.1% | RGPD audit, maximize PII detection | --- ### Benchmark 6 — Throughput vs document length ![Benchmark 6 — Throughput comparison across document lengths](benchmark_throughput.png) Benchmarked on a single A10G GPU (24GB): | Document length | PrivaMesh throughput | Latency p50 | Latency p99 | |---|---|---|---| | Short (< 512 tokens) | 340 docs/min | 18ms | 45ms | | Medium (512–2048 tokens) | 95 docs/min | 63ms | 120ms | | Long (2048–8192 tokens) | 28 docs/min | 215ms | 380ms | | Full contract (8192–32768 tokens) | 8 docs/min | 750ms | 1.2s | --- ## Deployment ### On-premise deployment (recommended) PrivaMesh Legal is designed for **sovereign, on-premise deployment**. No data leaves your infrastructure. ```bash # Pull model locally from huggingface_hub import snapshot_download snapshot_download( repo_id="privamesh/privamesh-legal", local_dir="./models/privamesh-legal", ignore_patterns=["*.msgpack", "*.h5"] ) ``` ```python # Load from local path — fully air-gapped from privamesh import PrivaMeshLegal model = PrivaMeshLegal.from_pretrained( "./models/privamesh-legal", device_map="auto", local_files_only=True # no internet connection required ) ``` ### Hardware requirements | Setup | VRAM | Throughput | Use case | |---|---|---|---| | GPU A10G 24GB | 24GB | 95 docs/min | Production | | GPU RTX 4090 | 24GB | 80 docs/min | On-premise enterprise | | GPU A100 40GB | 40GB | 180 docs/min | High-throughput | | CPU only (quantized) | 16GB RAM | 3 docs/min | Air-gapped / dev | | Apple M4 Max | 48GB unified | 25 docs/min | Local dev / testing | ### Quantized versions ```python # 4-bit quantization — runs on 8GB VRAM from transformers import BitsAndBytesConfig bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16 ) model = PrivaMeshLegal.from_pretrained( "privamesh/privamesh-legal", quantization_config=bnb_config, device_map="auto" ) ``` ### Docker deployment ```dockerfile FROM python:3.11-slim RUN pip install privamesh transformers torch COPY ./models/privamesh-legal /models/privamesh-legal EXPOSE 8080 CMD ["privamesh", "serve", "--model", "/models/privamesh-legal", "--port", "8080"] ``` ```bash docker build -t privamesh-legal . docker run -p 8080:8080 --gpus all privamesh-legal ``` ### REST API (built-in server) ```bash privamesh serve --model privamesh/privamesh-legal --port 8080 ``` ```bash curl -X POST http://localhost:8080/anonymize \ -H "Content-Type: application/json" \ -d '{ "text": "Le contrat signé par Jean Dupont le 15 mars 2024.", "language": "fr", "operating_point": "high_recall" }' ``` --- ## Regulatory Coverage PrivaMesh Legal is designed to support compliance with the following regulatory frameworks: | Regulation | Coverage | Notes | |---|---|---| | **RGPD / GDPR** | ✅ Full | Art. 4, 25 (privacy by design), Art. 89 (pseudonymisation) | | **DORA** (EU 2022/2554) | ✅ Full | ICT risk documentation, third-party contracts | | **NIS2** (EU 2022/2555) | ✅ Full | Incident reports, supplier contracts | | **ISO 27001:2022** | ✅ Full | Audit reports, ISMS documentation | | **ISO/IEC 42001:2023** | ✅ Full | AI system documentation, risk assessments | | **EU AI Act** | ✅ Full | High-risk AI documentation, conformity assessments | | **CCPA** (California) | ⚠️ Partial | EN documents, US legal entities | | **HIPAA** | ⚠️ Partial | Use `privamesh-medical` for full HIPAA coverage | --- ## Limitations & Risks ### Known limitations **1. Language coverage** PrivaMesh Legal is optimized for French and English. Performance may degrade on other languages, mixed-language documents with code-switching, or heavily technical jargon outside the training distribution. **2. Rare naming conventions** Detection performance may be lower for names following naming conventions underrepresented in training data (some regional French dialects, transliterated names, highly abbreviated forms). **3. Implicit PII** PrivaMesh Legal detects explicit PII. Implicit or inferred PII (e.g., identifying someone from their unique job description without naming them) is not in scope and requires additional processing layers. **4. Dynamic label policies** Like openai/privacy-filter, changing which categories are anonymized requires fine-tuning rather than runtime configuration (except for the `active_labels` filter, which suppresses labels post-detection). **5. Not a legal guarantee** PrivaMesh Legal is a technical anonymization aid. It does not constitute legal advice or a guarantee of RGPD compliance. Human review is recommended for high-stakes workflows. ### Risk: Over-reliance **Do not use PrivaMesh Legal as your sole anonymization layer for high-sensitivity documents.** It is designed as a primary processing layer in a privacy-by-design architecture that includes human review, audit trails, and access controls. ### Responsible use PrivaMesh Legal is intended for **data protection and privacy-preserving AI workflows**. It must not be used to: - Circumvent legitimate legal discovery or regulatory oversight - Process data without appropriate legal basis - Bypass consent mechanisms required under RGPD --- ## Citation If you use PrivaMesh Legal in your research or production systems, please cite: ```bibtex @misc{privamesh2026legal, title = {PrivaMesh: A Collaborative Multi-SLM Framework for Semantic Data Anonymization in Sovereign Agentic AI Pipelines}, author = {Sabri ALLANI et Ahmed HERSI}, year = {2026}, publisher = {HuggingFace}, url = {https://huggingface.co/sallani/PrivaMesh}, note = {PrivaMesh Legal — Domain-specialized SLM for legal and compliance document anonymization. Base model: Mistral-Small-3.1 (Apache 2.0)} } ``` > 📄 **Paper**: *"PrivaMesh: A Collaborative Multi-SLM Framework for Semantic Data Anonymization in Sovereign On-Premise Agentic AI Pipelines"* — preprint submission arXiv 2026, Q1 journal under review. --- ## Contributing PrivaMesh is an open research initiative. Contributions welcome: - 🐛 [Report issues](https://huggingface.co/sallani/PrivaMesh/discussions) - 📊 [Share evaluation results](https://huggingface.co/sallani/PrivaMesh/discussions) - 🔧 [Contribute to the framework](https://github.com/sallani/privamesh) - 📝 [Request new domains](https://huggingface.co/sallani/PrivaMesh/discussions) --- ## License **Apache 2.0** — Free for research, experimentation, and commercial deployment. Built on **Mistral-Small-3.1** (Apache 2.0) by Mistral AI, Paris 🇫🇷 See [LICENSE](https://huggingface.co/sallani/PrivaMesh/blob/main/LICENSE) for full terms. ---

PrivaMesh — Collaborative Multi-SLM Semantic Anonymization
Built on Mistral. Built for sovereign AI. Designed for regulated industries.
🇫🇷 French-native · European sovereign · Apache 2.0

GitHub · HuggingFace · Website