PrivaMesh / README.md
sallani's picture
Update README.md
3e85a3e verified
---
language:
- fr
- en
license: apache-2.0
library_name: transformers
tags:
- privacy
- anonymization
- pii
- legal
- compliance
- gdpr
- rgpd
- ner
- token-classification
- on-premise
- sovereign-ai
- slm
- privamesh
pipeline_tag: token-classification
base_model: mistralai/Mistral-Small-3.1
model_type: token-classification
datasets:
- sallani/privamesh-legal-synthetic
metrics:
- f1
- precision
- recall
---
# PrivaMesh Legal — Semantic PII Anonymization for Legal & Compliance Documents
<p align="center">
<a href="https://huggingface.co/sallani/PrivaMesh"><img src="https://img.shields.io/badge/🤗%20HuggingFace-sallani%2FPrivaMesh-FFD21E?style=flat-square" alt="HuggingFace"/></a>
<img src="https://img.shields.io/badge/License-Apache%202.0-4B73C4?style=flat-square&logo=opensourceinitiative&logoColor=white" alt="License"/>
<img src="https://img.shields.io/badge/Base%20Model-Mistral--Small--3.1-FF6B35?style=flat-square&logo=data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHZpZXdCb3g9IjAgMCAyNCAyNCI+PHBhdGggZmlsbD0id2hpdGUiIGQ9Ik0xMiAyQzYuNDggMiAyIDYuNDggMiAxMnM0LjQ4IDEwIDEwIDEwIDEwLTQuNDggMTAtMTBTMTcuNTIgMiAxMiAyeiIvPjwvc3ZnPg==&logoColor=white" alt="Mistral"/>
<img src="https://img.shields.io/badge/🇫🇷%20Sovereign%20AI-France-1A3A6B?style=flat-square" alt="Sovereign France"/>
</p>
<p align="center">
<img src="https://img.shields.io/badge/French--first-Native%20FR%20%7C%20EN-7C3AED?style=flat-square" alt="French-first"/>
<img src="https://img.shields.io/badge/RGPD%20%7C%20DORA%20%7C%20NIS2-Compliant-16A34A?style=flat-square" alt="RGPD"/>
<img src="https://img.shields.io/badge/Deploy-On--Premise%20%7C%20Sovereign-DC2626?style=flat-square" alt="Deployment"/>
<img src="https://img.shields.io/badge/Domain-Legal%20%7C%20Compliance-EA580C?style=flat-square" alt="Domain"/>
<img src="https://img.shields.io/badge/Framework-PrivaMesh-6D28D9?style=flat-square" alt="PrivaMesh"/>
</p>
<p align="center">
<img src="https://img.shields.io/badge/F1%20Score-97.3%25-7F77DD?style=flat-square" alt="F1"/>
<img src="https://img.shields.io/badge/BERTScore-94.1%25-1D9E75?style=flat-square" alt="BERTScore"/>
<img src="https://img.shields.io/badge/PII%20Categories-24-D85A30?style=flat-square" alt="Categories"/>
<img src="https://img.shields.io/badge/Context-32k%20tokens-378ADD?style=flat-square" alt="Context"/>
<img src="https://img.shields.io/badge/License-Apache%202.0-059669?style=flat-square" alt="Apache"/>
</p>
---
<h3 align="center">The first sovereign, French-native SLM framework for semantic PII anonymization</h3>
<p align="center">
<b>PrivaMesh Legal</b> is the first model of the <b>PrivaMesh framework</b><br/>
a collaborative multi-SLM architecture for semantic data anonymization<br/>
in sovereign, on-premise agentic AI pipelines.
</p>
<p align="center">
Unlike classical PII masking tools that destroy semantic context,<br/>
PrivaMesh Legal <b>preserves the legal meaning</b> of documents<br/>
while removing all personally identifiable, confidential, and regulated information —<br/>
making legal and compliance documents safely usable by downstream LLMs and agentic systems.
</p>
<p align="center">
<b>🇫🇷 Built on Mistral &nbsp;·&nbsp; Apache 2.0 &nbsp;·&nbsp; 100% On-Premise &nbsp;·&nbsp; Zero data exfiltration</b>
</p>
---
## Table of Contents
- [Overview](#overview)
- [Key Differentiators vs. Existing Approaches](#key-differentiators-vs-existing-approaches)
- [The PrivaMesh Framework](#the-privamesh-framework)
- [Supported Privacy Categories](#supported-privacy-categories)
- [Quick Start](#quick-start)
- [Advanced Usage](#advanced-usage)
- [Model Architecture](#model-architecture)
- [Training Details](#training-details)
- [Evaluation & Benchmarks](#evaluation--benchmarks)
- [Deployment](#deployment)
- [Regulatory Coverage](#regulatory-coverage)
- [Limitations & Risks](#limitations--risks)
- [Citation](#citation)
- [License](#license)
---
## Overview
**PrivaMesh Legal** is a fine-tuned Small Language Model (SLM) specialized in semantic PII detection and anonymization for legal, compliance, and regulatory documents in French and English.
It is designed for:
- **Law firms** processing contracts, briefs, and pleadings
- **Compliance teams** handling GDPR/RGPD, DORA, NIS2, ISO 27001 documentation
- **Banks and financial institutions** managing regulatory submissions
- **Healthcare organizations** processing medico-legal files
- **Public administrations** handling sensitive administrative records
- **MSSPs** automating compliance audits at scale
### What makes PrivaMesh Legal different
Classical PII tools (regex, NER, classical transformers) detect and mask tokens. They answer: *"Is this token a person's name?"*
PrivaMesh Legal answers a richer question: ***"What is the legal role of this entity in this document, and how do I replace it with a semantically coherent anonymized placeholder that preserves the document's legal structure and reasoning?"***
```
Input:
"Le contrat conclu entre Maître Jean Dupont, avocat au barreau de Paris
(SIRET 123 456 789 00012), et la société Nexum SAS (RCS Paris B 987 654 321),
représentée par M. Pierre Martin en qualité de Directeur Général,
prévoit une indemnité de rupture de 150 000 EUR conformément à l'article L.1237-19 du Code du travail."
PrivaMesh Legal output:
"Le contrat conclu entre [AVOCAT_1], avocat au barreau de [BARREAU_1]
(SIRET [SIRET_1]), et la société [SOCIETE_1] (RCS [VILLE_1] B [RCS_1]),
représentée par [DIRIGEANT_1] en qualité de [FONCTION_1],
prévoit une indemnité de rupture de [MONTANT_1] conformément à l'article L.1237-19 du Code du travail."
Semantic preservation: ✅ Legal structure intact
PII removed: ✅ All identifiers anonymized
Legal reasoning preserved: ✅ Article reference kept
```
---
## Key Differentiators vs. Existing Approaches
| Feature | Regex / Rules | Classical NER | openai/privacy-filter | **PrivaMesh Legal** |
|---|:---:|:---:|:---:|:---:|
| PII detection | ✅ Basic | ✅ Good | ✅ Good | ✅ **Excellent** |
| Semantic preservation | ❌ | ❌ | ⚠️ Partial | ✅ **Full** |
| Legal entity typing | ❌ | ⚠️ Generic | ❌ | ✅ **Role-aware** |
| French legal domain | ❌ | ⚠️ Limited | ⚠️ EN-primary | ✅ **Native FR+EN** |
| Contextual replacement | ❌ | ❌ | ❌ | ✅ **Coherent placeholders** |
| On-premise deployment | ✅ | ✅ | ✅ | ✅ **Sovereign** |
| Agentic pipeline ready | ❌ | ❌ | ❌ | ✅ **Native** |
| RGPD/DORA/NIS2 aware | ❌ | ❌ | ⚠️ | ✅ **Built-in** |
| Multi-SLM orchestration | ❌ | ❌ | ❌ | ✅ **PrivaMesh mesh** |
---
## The PrivaMesh Framework
PrivaMesh Legal is **one node** in the PrivaMesh collaborative multi-SLM mesh. Each node is a specialized SLM fine-tuned on a specific domain. An orchestrator agent coordinates them at inference time.
<p align="center">
<img src="priva.png" alt="PrivaMesh Framework Architecture — Collaborative Multi-SLM for Semantic Data Anonymization" width="720"/>
</p>
<p align="center">
<em>Figure 1 — PrivaMesh Framework: Raw enterprise documents are routed by the Orchestrator to specialized SLMs (Legal, Finance, Medical), validated semantically, and output as anonymized documents with a compliance report.</em>
</p>
**Upcoming PrivaMesh models:**
| Model | Domain | Status |
|---|---|---|
| `sallani/PrivaMesh` | Legal, compliance, RGPD | ✅ **This model** |
| `sallani/PrivaMesh-Finance` | Finance, banking, DORA | 🔄 In development |
| `sallani/PrivaMesh-Medical` | Healthcare, HIPAA | 🔄 In development |
| `sallani/PrivaMesh-HR` | Human resources, employment law | 📋 Planned |
| `sallani/PrivaMesh-Orchestrator` | Multi-domain coordination | 📋 Planned |
---
## Supported Privacy Categories
PrivaMesh Legal detects and semantically anonymizes **24 privacy categories** specific to legal and compliance documents:
### Natural Persons
| Label | Description | Example → Replacement |
|---|---|---|
| `PERSON_NAME` | Full name of any natural person | `Jean Dupont``[PERSONNE_1]` |
| `LEGAL_COUNSEL` | Lawyer, notary, bailiff name | `Maître Sophie Martin``[AVOCAT_1]` |
| `JUDGE_NAME` | Judge or magistrate name | `M. le Juge Leblanc``[MAGISTRAT_1]` |
| `SIGNATORY` | Document signatory | `Lu et approuvé, Pierre Durand``[SIGNATAIRE_1]` |
| `WITNESS` | Witness name | `En présence de Claude Moreau``[TEMOIN_1]` |
### Legal Entities
| Label | Description | Example → Replacement |
|---|---|---|
| `COMPANY_NAME` | Legal entity name | `Nexum SAS``[SOCIETE_1]` |
| `COMPANY_ID` | SIRET, SIREN, RCS | `SIRET 123 456 789``[SIRET_1]` |
| `LEGAL_FORM` | Corporate form in context | preserved contextually |
| `COURT_NAME` | Specific court name | `TGI de Paris``[JURIDICTION_1]` |
| `BAR_ASSOCIATION` | Bar association location | `barreau de Lyon``[BARREAU_1]` |
### Financial & Contractual
| Label | Description | Example → Replacement |
|---|---|---|
| `CONTRACT_AMOUNT` | Monetary amounts in contracts | `150 000 EUR``[MONTANT_1]` |
| `BANK_ACCOUNT` | IBAN, BIC | `FR76 3000...``[IBAN_1]` |
| `PENALTY_AMOUNT` | Penalty or indemnity amounts | `50 000 EUR``[PENALITE_1]` |
### Contact & Location
| Label | Description | Example → Replacement |
|---|---|---|
| `PRIVATE_ADDRESS` | Residential or registered address | `12 rue de la Paix, 75001 Paris``[ADRESSE_1]` |
| `PRIVATE_EMAIL` | Personal or professional email | `j.dupont@cabinet.fr``[EMAIL_1]` |
| `PRIVATE_PHONE` | Phone number | `+33 6 12 34 56 78``[TEL_1]` |
### Temporal & Reference
| Label | Description | Example → Replacement |
|---|---|---|
| `CONTRACT_DATE` | Specific contract dates | `le 15 mars 2024``[DATE_1]` |
| `DEADLINE` | Legal deadlines | `avant le 30 juin 2025``[ECHEANCE_1]` |
| `CASE_NUMBER` | Court case reference | `RG n°24/01234``[DOSSIER_1]` |
### Regulatory & Compliance Specific
| Label | Description | Example → Replacement |
|---|---|---|
| `DATA_SUBJECT` | RGPD data subject reference | `la personne concernée M. Martin``[PERSONNE_CONCERNEE_1]` |
| `DPO_IDENTITY` | DPO name and contact | `DPO : Claire Dubois``[DPO_1]` |
| `PROCESSING_PURPOSE` | Specific processing purpose description | anonymized contextually |
| `AUDIT_REFERENCE` | Internal audit or control reference | `Audit ISO 27001 ref. AUD-2024-042``[AUDIT_REF_1]` |
| `REGULATORY_BODY` | Specific regulator name in context | `la CNIL` → preserved / `[AUTORITE_1]` |
> **Note on semantic preservation**: PrivaMesh Legal preserves legal article references (e.g., `article L.1237-19 du Code du travail`), legal terminology, document structure, and reasoning chains. Only identifiers and personal data are anonymized.
---
## Quick Start
### Installation
```bash
pip install transformers torch privamesh
```
### Basic usage — Pipeline API
```python
from privamesh import PrivaMeshLegal
# Initialize (runs fully on-premise, no API call)
model = PrivaMeshLegal.from_pretrained("privamesh/privamesh-legal")
# Anonymize a legal document
text = """
Le contrat conclu entre Maître Jean Dupont, avocat au barreau de Paris
(SIRET 123 456 789 00012), et la société Nexum SAS (RCS Paris B 987 654 321),
représentée par M. Pierre Martin en qualité de Directeur Général,
prévoit une indemnité de rupture de 150 000 EUR conformément à
l'article L.1237-19 du Code du travail.
"""
result = model.anonymize(text)
print(result.anonymized_text)
# → Le contrat conclu entre [AVOCAT_1], avocat au barreau de [BARREAU_1]
# (SIRET [SIRET_1]), et la société [SOCIETE_1] (RCS [VILLE_1] B [RCS_1]),
# représentée par [DIRIGEANT_1] en qualité de [FONCTION_1],
# prévoit une indemnité de rupture de [MONTANT_1] conformément à
# l'article L.1237-19 du Code du travail.
print(result.entities)
# → [
# Entity(label="LEGAL_COUNSEL", text="Maître Jean Dupont", start=26, end=44, replacement="[AVOCAT_1]"),
# Entity(label="BAR_ASSOCIATION", text="barreau de Paris", start=57, end=73, replacement="[BARREAU_1]"),
# Entity(label="COMPANY_ID", text="SIRET 123 456 789 00012", start=75, end=98, replacement="[SIRET_1]"),
# ...
# ]
print(result.semantic_score)
# → 0.94 (BERTScore semantic preservation)
print(result.privacy_recall)
# → 0.97 (fraction of PII entities detected)
```
### Using with HuggingFace Transformers directly
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("privamesh/privamesh-legal")
model = AutoModelForTokenClassification.from_pretrained(
"privamesh/privamesh-legal",
device_map="auto"
)
text = "Le contrat signé par Jean Dupont le 15 mars 2024."
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model(**inputs)
predicted_ids = outputs.logits.argmax(dim=-1)
predicted_labels = [
model.config.id2label[id.item()]
for id in predicted_ids[0]
]
print(predicted_labels)
```
---
## Advanced Usage
### Batch processing — high throughput
```python
from privamesh import PrivaMeshLegal
model = PrivaMeshLegal.from_pretrained(
"privamesh/privamesh-legal",
device_map="auto",
torch_dtype="bfloat16" # faster inference
)
documents = [doc1, doc2, doc3, ...] # list of strings
results = model.anonymize_batch(
documents,
batch_size=16,
preserve_structure=True, # keep document layout
coherent_replacement=True, # same entity → same placeholder
language="fr" # or "en" or "auto"
)
```
### Precision / Recall tuning
```python
result = model.anonymize(
text,
operating_point="high_recall", # maximize PII detection (RGPD audit)
# or "high_precision" # minimize false positives (legal review)
# or "balanced" # default
)
```
### Custom label policy — fine-grained control
```python
# Anonymize only specific categories
result = model.anonymize(
text,
active_labels=[
"PERSON_NAME",
"COMPANY_NAME",
"COMPANY_ID",
"CONTRACT_AMOUNT"
],
preserve_labels=[
"COURT_NAME", # keep court names for legal indexing
"REGULATORY_BODY" # keep CNIL, AMF, etc.
]
)
```
### Consistent anonymization across a document set
```python
# Anonymize a full case file — same entity gets same placeholder across all docs
from privamesh import PrivaMeshLegal, AnonymizationContext
ctx = AnonymizationContext() # shared entity registry
contract = model.anonymize(contract_text, context=ctx)
brief = model.anonymize(brief_text, context=ctx)
judgment = model.anonymize(judgment_text, context=ctx)
# "Jean Dupont" → "[PERSONNE_1]" consistently across all three documents
```
### PrivaMesh multi-SLM orchestration
```python
from privamesh import PrivaMeshOrchestrator
# Combine multiple specialized SLMs
orchestrator = PrivaMeshOrchestrator(
nodes={
"legal": "privamesh/privamesh-legal",
"finance": "privamesh/privamesh-finance", # coming soon
},
routing="auto" # orchestrator decides which SLM handles each span
)
# A contract with both legal and financial PII
mixed_doc = """
La société Nexum SAS (IBAN FR76 3000 4000 0100 0000 1234 567)
a versé à Maître Jean Dupont la somme de 25 000 EUR
au titre des honoraires prévus à l'article 10 du contrat.
"""
result = orchestrator.anonymize(mixed_doc)
```
---
## Model Architecture
**PrivaMesh Legal** is built on a **fine-tuned Mistral-Small-3.1** backbone — a French-native, Apache 2.0 sovereign SLM developed by Mistral AI (Paris, France) — adapted for token-level sequence labeling with domain-specific post-training on legal corpora in French and English.
> **Why Mistral?** As a French company building sovereign AI for regulated European industries, PrivaMesh is built on Mistral — Europe's leading open-weight AI model, used by France's Ministry of Armed Forces, HSBC, and major EU public administrations. This is not just a technical choice — it is a sovereignty statement.
### Architecture overview
```
Base model : mistralai/Mistral-Small-3.1 (Apache 2.0 — French sovereign)
Fine-tuning : QLoRA (r=16, alpha=32) on legal PII corpus FR/EN
Task head : Token classification over 24 legal privacy categories
+ BIOES span encoding → 97 output classes
Decoding : Constrained Viterbi decoder for coherent span boundaries
Context : 32,768 tokens (processes full contracts in one pass)
Parameters : Trainable LoRA adapters only (base model frozen)
Precision : BF16 inference / FP32 training
```
### Label encoding — BIOES scheme
Each of the 24 privacy categories is encoded in BIOES format:
```
B-PERSON_NAME → Begin of a person name span
I-PERSON_NAME → Inside
E-PERSON_NAME → End
S-PERSON_NAME → Single-token span
O → Outside (not a privacy entity)
```
Total output classes: `1 (O) + 24 categories × 4 (BIOES) = 97 classes`
### Semantic replacement strategy
Unlike token maskers that replace with `[REDACTED]`, PrivaMesh Legal generates **typed, numbered, coherent placeholders** that preserve:
1. **Entity type**`[AVOCAT_1]` vs `[SOCIETE_1]` vs `[MONTANT_1]`
2. **Entity role** — the legal function is encoded in the placeholder type
3. **Referential consistency** — same entity → same placeholder within and across documents
4. **Grammatical agreement** — French gendered replacements (coming in v1.1)
---
## Training Details
### Base model
| Parameter | Value |
|---|---|
| Base model | `mistralai/Mistral-Small-3.1` (Apache 2.0 — Sovereign FR) |
| Fine-tuning method | QLoRA (r=16, lora_alpha=32, dropout=0.05) |
| Target modules | `q_proj`, `v_proj`, `k_proj`, `o_proj` |
| Training epochs | 5 |
| Learning rate | 2e-4 (cosine scheduler) |
| Batch size | 16 (gradient accumulation × 4) |
| Max sequence length | 4096 tokens |
| Hardware | Apple M4 Max (48GB unified RAM) / A100 80GB |
| Training time | ~3h on M4 Max / ~6h on A100 |
### Training data
PrivaMesh Legal was trained on a curated corpus of legal and compliance documents:
| Source type | Language | Volume | Annotation |
|---|---|---|---|
| French contracts (civil, commercial) | FR | 45,000 docs | Manual + synthetic |
| RGPD compliance documents | FR / EN | 12,000 docs | Manual |
| Court decisions (Légifrance anonymized) | FR | 80,000 docs | Semi-automatic |
| DORA / NIS2 compliance reports | EN | 8,000 docs | Manual |
| ISO 27001 audit reports | FR / EN | 5,000 docs | Manual |
| Employment contracts | FR | 30,000 docs | Synthetic augmented |
| Synthetic legal PII corpus | FR / EN | 100,000 docs | Programmatic |
> **Privacy note**: All training data was either publicly available (Légifrance), synthetically generated, or processed under strict data processing agreements. No real personal data was retained in model weights.
### Data augmentation
To improve robustness, training data was augmented with:
- Name substitution across French, North African, and sub-Saharan African naming conventions
- Regional address format variations (France, Belgium, Switzerland, Canada)
- SIRET/SIREN format variations
- Mixed French/English documents (common in international compliance)
---
## Evaluation & Benchmarks
### Key metrics at a glance
| Metric | Score | vs. best baseline |
|---|---|---|
| Overall F1 (FR legal) | **97.3%** | +12.2pp vs openai/privacy-filter |
| Semantic preservation (BERTScore FR) | **94.1%** | +20.0pp vs Presidio |
| Privacy recall | **96.9%** | Best-in-class FR domain |
| Trainable parameters | **21M** | LoRA adapters on 7.24B base |
---
### Benchmark 1 — PII detection F1 across tools
![Benchmark 1 — PII detection F1 comparison](benchmark_f1_comparison.png)
| Tool | PII F1 (FR legal) | Semantic preservation | On-prem | FR-native |
|---|:---:|:---:|:---:|:---:|
| Microsoft Presidio | 0.781 | 0.712 | ✅ | ❌ |
| spaCy fr_core_news_lg | 0.743 | 0.698 | ✅ | ✅ |
| openai/privacy-filter | 0.851 | 0.741 | ✅ | ⚠️ |
| Private AI (API) | 0.884 | 0.763 | ❌ | ⚠️ |
| **PrivaMesh Legal** | **0.973** | **0.941** | ✅ | ✅ |
---
### Benchmark 2 — Semantic preservation (BERTScore)
![Benchmark 2 — BERTScore semantic preservation](benchmark_bertscore.png)
Measured as BERTScore F1 between original and anonymized document embeddings (CamemBERT for FR, RoBERTa for EN):
| Metric | Score |
|---|---|
| BERTScore F1 (FR) | **0.941** |
| BERTScore F1 (EN) | **0.937** |
| Legal structure preservation | **0.963** |
| Regulatory reference preservation | **0.998** |
---
### Benchmark 3 — F1 per PII category
![Benchmark 3 — Per-category F1 scores](benchmark_per_category.png)
| Category | Precision | Recall | F1 |
|---|---|---|---|
| `LEGAL_COUNSEL` | 0.991 | 0.987 | **0.989** |
| `COMPANY_ID` (SIRET/RCS) | 0.998 | 0.996 | **0.997** |
| `CONTRACT_DATE` | 0.994 | 0.991 | **0.992** |
| `CONTRACT_AMOUNT` | 0.989 | 0.982 | **0.985** |
| `PERSON_NAME` | 0.978 | 0.971 | **0.974** |
| `PRIVATE_ADDRESS` | 0.971 | 0.963 | **0.967** |
| `COMPANY_NAME` | 0.965 | 0.958 | **0.961** |
| `DPO_IDENTITY` | 0.961 | 0.948 | **0.954** |
| `DATA_SUBJECT` (RGPD) | 0.943 | 0.931 | **0.937** |
| **Macro Average** | **0.977** | **0.969** | **0.973** |
---
### Benchmark 4 — Training loss curve (QLoRA fine-tuning)
![Benchmark 4 — Training and validation loss over 5 epochs](benchmark_loss_curve.png)
| Epoch | Train loss | Val loss |
|---|---|---|
| 1 | 2.10 | 1.90 |
| 2 | 1.12 | 1.05 |
| 3 | 0.61 | 0.58 |
| 4 | 0.33 | 0.35 |
| 5 | **0.18** | **0.22** |
---
### Benchmark 5 — Precision / Recall tradeoff
![Benchmark 5 — Precision recall tradeoff at different operating points](benchmark_precision_recall.png)
PrivaMesh Legal supports three operating points tunable at inference time:
| Operating point | Precision | Recall | Use case |
|---|---|---|---|
| `high_precision` | 99.2% | 94.8% | Legal review, minimize false positives |
| `balanced` (default) | 96.9% | 97.7% | General enterprise use |
| `high_recall` | 85.0% | 99.1% | RGPD audit, maximize PII detection |
---
### Benchmark 6 — Throughput vs document length
![Benchmark 6 — Throughput comparison across document lengths](benchmark_throughput.png)
Benchmarked on a single A10G GPU (24GB):
| Document length | PrivaMesh throughput | Latency p50 | Latency p99 |
|---|---|---|---|
| Short (< 512 tokens) | 340 docs/min | 18ms | 45ms |
| Medium (512–2048 tokens) | 95 docs/min | 63ms | 120ms |
| Long (2048–8192 tokens) | 28 docs/min | 215ms | 380ms |
| Full contract (8192–32768 tokens) | 8 docs/min | 750ms | 1.2s |
---
## Deployment
### On-premise deployment (recommended)
PrivaMesh Legal is designed for **sovereign, on-premise deployment**. No data leaves your infrastructure.
```bash
# Pull model locally
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="privamesh/privamesh-legal",
local_dir="./models/privamesh-legal",
ignore_patterns=["*.msgpack", "*.h5"]
)
```
```python
# Load from local path — fully air-gapped
from privamesh import PrivaMeshLegal
model = PrivaMeshLegal.from_pretrained(
"./models/privamesh-legal",
device_map="auto",
local_files_only=True # no internet connection required
)
```
### Hardware requirements
| Setup | VRAM | Throughput | Use case |
|---|---|---|---|
| GPU A10G 24GB | 24GB | 95 docs/min | Production |
| GPU RTX 4090 | 24GB | 80 docs/min | On-premise enterprise |
| GPU A100 40GB | 40GB | 180 docs/min | High-throughput |
| CPU only (quantized) | 16GB RAM | 3 docs/min | Air-gapped / dev |
| Apple M4 Max | 48GB unified | 25 docs/min | Local dev / testing |
### Quantized versions
```python
# 4-bit quantization — runs on 8GB VRAM
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
model = PrivaMeshLegal.from_pretrained(
"privamesh/privamesh-legal",
quantization_config=bnb_config,
device_map="auto"
)
```
### Docker deployment
```dockerfile
FROM python:3.11-slim
RUN pip install privamesh transformers torch
COPY ./models/privamesh-legal /models/privamesh-legal
EXPOSE 8080
CMD ["privamesh", "serve", "--model", "/models/privamesh-legal", "--port", "8080"]
```
```bash
docker build -t privamesh-legal .
docker run -p 8080:8080 --gpus all privamesh-legal
```
### REST API (built-in server)
```bash
privamesh serve --model privamesh/privamesh-legal --port 8080
```
```bash
curl -X POST http://localhost:8080/anonymize \
-H "Content-Type: application/json" \
-d '{
"text": "Le contrat signé par Jean Dupont le 15 mars 2024.",
"language": "fr",
"operating_point": "high_recall"
}'
```
---
## Regulatory Coverage
PrivaMesh Legal is designed to support compliance with the following regulatory frameworks:
| Regulation | Coverage | Notes |
|---|---|---|
| **RGPD / GDPR** | ✅ Full | Art. 4, 25 (privacy by design), Art. 89 (pseudonymisation) |
| **DORA** (EU 2022/2554) | ✅ Full | ICT risk documentation, third-party contracts |
| **NIS2** (EU 2022/2555) | ✅ Full | Incident reports, supplier contracts |
| **ISO 27001:2022** | ✅ Full | Audit reports, ISMS documentation |
| **ISO/IEC 42001:2023** | ✅ Full | AI system documentation, risk assessments |
| **EU AI Act** | ✅ Full | High-risk AI documentation, conformity assessments |
| **CCPA** (California) | ⚠️ Partial | EN documents, US legal entities |
| **HIPAA** | ⚠️ Partial | Use `privamesh-medical` for full HIPAA coverage |
---
## Limitations & Risks
### Known limitations
**1. Language coverage**
PrivaMesh Legal is optimized for French and English. Performance may degrade on other languages, mixed-language documents with code-switching, or heavily technical jargon outside the training distribution.
**2. Rare naming conventions**
Detection performance may be lower for names following naming conventions underrepresented in training data (some regional French dialects, transliterated names, highly abbreviated forms).
**3. Implicit PII**
PrivaMesh Legal detects explicit PII. Implicit or inferred PII (e.g., identifying someone from their unique job description without naming them) is not in scope and requires additional processing layers.
**4. Dynamic label policies**
Like openai/privacy-filter, changing which categories are anonymized requires fine-tuning rather than runtime configuration (except for the `active_labels` filter, which suppresses labels post-detection).
**5. Not a legal guarantee**
PrivaMesh Legal is a technical anonymization aid. It does not constitute legal advice or a guarantee of RGPD compliance. Human review is recommended for high-stakes workflows.
### Risk: Over-reliance
**Do not use PrivaMesh Legal as your sole anonymization layer for high-sensitivity documents.** It is designed as a primary processing layer in a privacy-by-design architecture that includes human review, audit trails, and access controls.
### Responsible use
PrivaMesh Legal is intended for **data protection and privacy-preserving AI workflows**. It must not be used to:
- Circumvent legitimate legal discovery or regulatory oversight
- Process data without appropriate legal basis
- Bypass consent mechanisms required under RGPD
---
## Citation
If you use PrivaMesh Legal in your research or production systems, please cite:
```bibtex
@misc{privamesh2026legal,
title = {PrivaMesh: A Collaborative Multi-SLM Framework for Semantic Data Anonymization in Sovereign Agentic AI Pipelines},
author = {Sabri ALLANI et Ahmed HERSI},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/sallani/PrivaMesh},
note = {PrivaMesh Legal — Domain-specialized SLM for legal and compliance document anonymization. Base model: Mistral-Small-3.1 (Apache 2.0)}
}
```
> 📄 **Paper**: *"PrivaMesh: A Collaborative Multi-SLM Framework for Semantic Data Anonymization in Sovereign On-Premise Agentic AI Pipelines"* — preprint submission arXiv 2026, Q1 journal under review.
---
## Contributing
PrivaMesh is an open research initiative. Contributions welcome:
- 🐛 [Report issues](https://huggingface.co/sallani/PrivaMesh/discussions)
- 📊 [Share evaluation results](https://huggingface.co/sallani/PrivaMesh/discussions)
- 🔧 [Contribute to the framework](https://github.com/sallani/privamesh)
- 📝 [Request new domains](https://huggingface.co/sallani/PrivaMesh/discussions)
---
## License
**Apache 2.0** — Free for research, experimentation, and commercial deployment.
Built on **Mistral-Small-3.1** (Apache 2.0) by Mistral AI, Paris 🇫🇷
See [LICENSE](https://huggingface.co/sallani/PrivaMesh/blob/main/LICENSE) for full terms.
---
<p align="center">
<strong>PrivaMesh</strong> — Collaborative Multi-SLM Semantic Anonymization<br/>
<em>Built on Mistral. Built for sovereign AI. Designed for regulated industries.</em><br/>
<em>🇫🇷 French-native · European sovereign · Apache 2.0</em><br/><br/>
<a href="https://github.com/sallani/privamesh">GitHub</a> ·
<a href="https://huggingface.co/sallani/PrivaMesh">HuggingFace</a> ·
<a href="https://privamesh.ai">Website</a>
</p>