File size: 29,364 Bytes
4c46869
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b14c78f
4c46869
 
b14c78f
4c46869
 
 
 
 
 
 
 
 
c6a4b60
 
 
 
4c46869
 
c6a4b60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4c46869
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ebf78ef
 
 
 
 
 
 
4c46869
 
 
 
 
b14c78f
 
 
 
 
4c46869
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b14c78f
 
 
4c46869
 
 
 
b14c78f
 
4c46869
 
 
 
b14c78f
4c46869
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b14c78f
 
4c46869
 
 
 
 
b14c78f
 
4c46869
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b14c78f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4c46869
b14c78f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4c46869
 
 
 
 
b14c78f
4c46869
b14c78f
4c46869
b14c78f
4c46869
b14c78f
4c46869
 
b14c78f
4c46869
b14c78f
4c46869
b14c78f
4c46869
b14c78f
 
 
 
 
 
 
4c46869
b14c78f
 
 
4c46869
b14c78f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4c46869
 
 
b14c78f
4c46869
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b14c78f
4c46869
3e85a3e
b14c78f
4c46869
b14c78f
 
4c46869
 
 
b14c78f
4c46869
 
 
 
 
 
 
b14c78f
 
 
 
4c46869
 
 
 
 
 
 
b14c78f
 
 
4c46869
 
 
 
 
b14c78f
 
 
 
 
4c46869
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
---
language:
  - fr
  - en
license: apache-2.0
library_name: transformers
tags:
  - privacy
  - anonymization
  - pii
  - legal
  - compliance
  - gdpr
  - rgpd
  - ner
  - token-classification
  - on-premise
  - sovereign-ai
  - slm
  - privamesh
pipeline_tag: token-classification
base_model: mistralai/Mistral-Small-3.1
model_type: token-classification
datasets:
  - sallani/privamesh-legal-synthetic
metrics:
  - f1
  - precision
  - recall
---

# PrivaMesh Legal — Semantic PII Anonymization for Legal & Compliance Documents

<p align="center">
  <a href="https://huggingface.co/sallani/PrivaMesh"><img src="https://img.shields.io/badge/🤗%20HuggingFace-sallani%2FPrivaMesh-FFD21E?style=flat-square" alt="HuggingFace"/></a>
  <img src="https://img.shields.io/badge/License-Apache%202.0-4B73C4?style=flat-square&logo=opensourceinitiative&logoColor=white" alt="License"/>
  <img src="https://img.shields.io/badge/Base%20Model-Mistral--Small--3.1-FF6B35?style=flat-square&logo=data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHZpZXdCb3g9IjAgMCAyNCAyNCI+PHBhdGggZmlsbD0id2hpdGUiIGQ9Ik0xMiAyQzYuNDggMiAyIDYuNDggMiAxMnM0LjQ4IDEwIDEwIDEwIDEwLTQuNDggMTAtMTBTMTcuNTIgMiAxMiAyeiIvPjwvc3ZnPg==&logoColor=white" alt="Mistral"/>
  <img src="https://img.shields.io/badge/🇫🇷%20Sovereign%20AI-France-1A3A6B?style=flat-square" alt="Sovereign France"/>
</p>

<p align="center">
  <img src="https://img.shields.io/badge/French--first-Native%20FR%20%7C%20EN-7C3AED?style=flat-square" alt="French-first"/>
  <img src="https://img.shields.io/badge/RGPD%20%7C%20DORA%20%7C%20NIS2-Compliant-16A34A?style=flat-square" alt="RGPD"/>
  <img src="https://img.shields.io/badge/Deploy-On--Premise%20%7C%20Sovereign-DC2626?style=flat-square" alt="Deployment"/>
  <img src="https://img.shields.io/badge/Domain-Legal%20%7C%20Compliance-EA580C?style=flat-square" alt="Domain"/>
  <img src="https://img.shields.io/badge/Framework-PrivaMesh-6D28D9?style=flat-square" alt="PrivaMesh"/>
</p>

<p align="center">
  <img src="https://img.shields.io/badge/F1%20Score-97.3%25-7F77DD?style=flat-square" alt="F1"/>
  <img src="https://img.shields.io/badge/BERTScore-94.1%25-1D9E75?style=flat-square" alt="BERTScore"/>
  <img src="https://img.shields.io/badge/PII%20Categories-24-D85A30?style=flat-square" alt="Categories"/>
  <img src="https://img.shields.io/badge/Context-32k%20tokens-378ADD?style=flat-square" alt="Context"/>
  <img src="https://img.shields.io/badge/License-Apache%202.0-059669?style=flat-square" alt="Apache"/>
</p>

---

<h3 align="center">The first sovereign, French-native SLM framework for semantic PII anonymization</h3>

<p align="center">
<b>PrivaMesh Legal</b> is the first model of the <b>PrivaMesh framework</b><br/>
a collaborative multi-SLM architecture for semantic data anonymization<br/>
in sovereign, on-premise agentic AI pipelines.
</p>

<p align="center">
Unlike classical PII masking tools that destroy semantic context,<br/>
PrivaMesh Legal <b>preserves the legal meaning</b> of documents<br/>
while removing all personally identifiable, confidential, and regulated information —<br/>
making legal and compliance documents safely usable by downstream LLMs and agentic systems.
</p>

<p align="center">
  <b>🇫🇷 Built on Mistral &nbsp;·&nbsp; Apache 2.0 &nbsp;·&nbsp; 100% On-Premise &nbsp;·&nbsp; Zero data exfiltration</b>
</p>

---

## Table of Contents

- [Overview](#overview)
- [Key Differentiators vs. Existing Approaches](#key-differentiators-vs-existing-approaches)
- [The PrivaMesh Framework](#the-privamesh-framework)
- [Supported Privacy Categories](#supported-privacy-categories)
- [Quick Start](#quick-start)
- [Advanced Usage](#advanced-usage)
- [Model Architecture](#model-architecture)
- [Training Details](#training-details)
- [Evaluation & Benchmarks](#evaluation--benchmarks)
- [Deployment](#deployment)
- [Regulatory Coverage](#regulatory-coverage)
- [Limitations & Risks](#limitations--risks)
- [Citation](#citation)
- [License](#license)

---

## Overview

**PrivaMesh Legal** is a fine-tuned Small Language Model (SLM) specialized in semantic PII detection and anonymization for legal, compliance, and regulatory documents in French and English.

It is designed for:

- **Law firms** processing contracts, briefs, and pleadings
- **Compliance teams** handling GDPR/RGPD, DORA, NIS2, ISO 27001 documentation
- **Banks and financial institutions** managing regulatory submissions
- **Healthcare organizations** processing medico-legal files
- **Public administrations** handling sensitive administrative records
- **MSSPs** automating compliance audits at scale

### What makes PrivaMesh Legal different

Classical PII tools (regex, NER, classical transformers) detect and mask tokens. They answer: *"Is this token a person's name?"*

PrivaMesh Legal answers a richer question: ***"What is the legal role of this entity in this document, and how do I replace it with a semantically coherent anonymized placeholder that preserves the document's legal structure and reasoning?"***

```
Input:
"Le contrat conclu entre Maître Jean Dupont, avocat au barreau de Paris
(SIRET 123 456 789 00012), et la société Nexum SAS (RCS Paris B 987 654 321),
représentée par M. Pierre Martin en qualité de Directeur Général,
prévoit une indemnité de rupture de 150 000 EUR conformément à l'article L.1237-19 du Code du travail."

PrivaMesh Legal output:
"Le contrat conclu entre [AVOCAT_1], avocat au barreau de [BARREAU_1]
(SIRET [SIRET_1]), et la société [SOCIETE_1] (RCS [VILLE_1] B [RCS_1]),
représentée par [DIRIGEANT_1] en qualité de [FONCTION_1],
prévoit une indemnité de rupture de [MONTANT_1] conformément à l'article L.1237-19 du Code du travail."

Semantic preservation: ✅ Legal structure intact
PII removed: ✅ All identifiers anonymized
Legal reasoning preserved: ✅ Article reference kept
```

---

## Key Differentiators vs. Existing Approaches

| Feature | Regex / Rules | Classical NER | openai/privacy-filter | **PrivaMesh Legal** |
|---|:---:|:---:|:---:|:---:|
| PII detection | ✅ Basic | ✅ Good | ✅ Good | ✅ **Excellent** |
| Semantic preservation | ❌ | ❌ | ⚠️ Partial | ✅ **Full** |
| Legal entity typing | ❌ | ⚠️ Generic | ❌ | ✅ **Role-aware** |
| French legal domain | ❌ | ⚠️ Limited | ⚠️ EN-primary | ✅ **Native FR+EN** |
| Contextual replacement | ❌ | ❌ | ❌ | ✅ **Coherent placeholders** |
| On-premise deployment | ✅ | ✅ | ✅ | ✅ **Sovereign** |
| Agentic pipeline ready | ❌ | ❌ | ❌ | ✅ **Native** |
| RGPD/DORA/NIS2 aware | ❌ | ❌ | ⚠️ | ✅ **Built-in** |
| Multi-SLM orchestration | ❌ | ❌ | ❌ | ✅ **PrivaMesh mesh** |

---

## The PrivaMesh Framework

PrivaMesh Legal is **one node** in the PrivaMesh collaborative multi-SLM mesh. Each node is a specialized SLM fine-tuned on a specific domain. An orchestrator agent coordinates them at inference time.

<p align="center">
  <img src="priva.png" alt="PrivaMesh Framework Architecture — Collaborative Multi-SLM for Semantic Data Anonymization" width="720"/>
</p>

<p align="center">
  <em>Figure 1 — PrivaMesh Framework: Raw enterprise documents are routed by the Orchestrator to specialized SLMs (Legal, Finance, Medical), validated semantically, and output as anonymized documents with a compliance report.</em>
</p>

**Upcoming PrivaMesh models:**

| Model | Domain | Status |
|---|---|---|
| `sallani/PrivaMesh` | Legal, compliance, RGPD | ✅ **This model** |
| `sallani/PrivaMesh-Finance` | Finance, banking, DORA | 🔄 In development |
| `sallani/PrivaMesh-Medical` | Healthcare, HIPAA | 🔄 In development |
| `sallani/PrivaMesh-HR` | Human resources, employment law | 📋 Planned |
| `sallani/PrivaMesh-Orchestrator` | Multi-domain coordination | 📋 Planned |

---

## Supported Privacy Categories

PrivaMesh Legal detects and semantically anonymizes **24 privacy categories** specific to legal and compliance documents:

### Natural Persons
| Label | Description | Example → Replacement |
|---|---|---|
| `PERSON_NAME` | Full name of any natural person | `Jean Dupont``[PERSONNE_1]` |
| `LEGAL_COUNSEL` | Lawyer, notary, bailiff name | `Maître Sophie Martin``[AVOCAT_1]` |
| `JUDGE_NAME` | Judge or magistrate name | `M. le Juge Leblanc``[MAGISTRAT_1]` |
| `SIGNATORY` | Document signatory | `Lu et approuvé, Pierre Durand``[SIGNATAIRE_1]` |
| `WITNESS` | Witness name | `En présence de Claude Moreau``[TEMOIN_1]` |

### Legal Entities
| Label | Description | Example → Replacement |
|---|---|---|
| `COMPANY_NAME` | Legal entity name | `Nexum SAS``[SOCIETE_1]` |
| `COMPANY_ID` | SIRET, SIREN, RCS | `SIRET 123 456 789``[SIRET_1]` |
| `LEGAL_FORM` | Corporate form in context | preserved contextually |
| `COURT_NAME` | Specific court name | `TGI de Paris``[JURIDICTION_1]` |
| `BAR_ASSOCIATION` | Bar association location | `barreau de Lyon``[BARREAU_1]` |

### Financial & Contractual
| Label | Description | Example → Replacement |
|---|---|---|
| `CONTRACT_AMOUNT` | Monetary amounts in contracts | `150 000 EUR``[MONTANT_1]` |
| `BANK_ACCOUNT` | IBAN, BIC | `FR76 3000...``[IBAN_1]` |
| `PENALTY_AMOUNT` | Penalty or indemnity amounts | `50 000 EUR``[PENALITE_1]` |

### Contact & Location
| Label | Description | Example → Replacement |
|---|---|---|
| `PRIVATE_ADDRESS` | Residential or registered address | `12 rue de la Paix, 75001 Paris``[ADRESSE_1]` |
| `PRIVATE_EMAIL` | Personal or professional email | `j.dupont@cabinet.fr``[EMAIL_1]` |
| `PRIVATE_PHONE` | Phone number | `+33 6 12 34 56 78``[TEL_1]` |

### Temporal & Reference
| Label | Description | Example → Replacement |
|---|---|---|
| `CONTRACT_DATE` | Specific contract dates | `le 15 mars 2024``[DATE_1]` |
| `DEADLINE` | Legal deadlines | `avant le 30 juin 2025``[ECHEANCE_1]` |
| `CASE_NUMBER` | Court case reference | `RG n°24/01234``[DOSSIER_1]` |

### Regulatory & Compliance Specific
| Label | Description | Example → Replacement |
|---|---|---|
| `DATA_SUBJECT` | RGPD data subject reference | `la personne concernée M. Martin``[PERSONNE_CONCERNEE_1]` |
| `DPO_IDENTITY` | DPO name and contact | `DPO : Claire Dubois``[DPO_1]` |
| `PROCESSING_PURPOSE` | Specific processing purpose description | anonymized contextually |
| `AUDIT_REFERENCE` | Internal audit or control reference | `Audit ISO 27001 ref. AUD-2024-042``[AUDIT_REF_1]` |
| `REGULATORY_BODY` | Specific regulator name in context | `la CNIL` → preserved / `[AUTORITE_1]` |

> **Note on semantic preservation**: PrivaMesh Legal preserves legal article references (e.g., `article L.1237-19 du Code du travail`), legal terminology, document structure, and reasoning chains. Only identifiers and personal data are anonymized.

---

## Quick Start

### Installation

```bash
pip install transformers torch privamesh
```

### Basic usage — Pipeline API

```python
from privamesh import PrivaMeshLegal

# Initialize (runs fully on-premise, no API call)
model = PrivaMeshLegal.from_pretrained("privamesh/privamesh-legal")

# Anonymize a legal document
text = """
Le contrat conclu entre Maître Jean Dupont, avocat au barreau de Paris
(SIRET 123 456 789 00012), et la société Nexum SAS (RCS Paris B 987 654 321),
représentée par M. Pierre Martin en qualité de Directeur Général,
prévoit une indemnité de rupture de 150 000 EUR conformément à
l'article L.1237-19 du Code du travail.
"""

result = model.anonymize(text)

print(result.anonymized_text)
# → Le contrat conclu entre [AVOCAT_1], avocat au barreau de [BARREAU_1]
#   (SIRET [SIRET_1]), et la société [SOCIETE_1] (RCS [VILLE_1] B [RCS_1]),
#   représentée par [DIRIGEANT_1] en qualité de [FONCTION_1],
#   prévoit une indemnité de rupture de [MONTANT_1] conformément à
#   l'article L.1237-19 du Code du travail.

print(result.entities)
# → [
#     Entity(label="LEGAL_COUNSEL", text="Maître Jean Dupont", start=26, end=44, replacement="[AVOCAT_1]"),
#     Entity(label="BAR_ASSOCIATION", text="barreau de Paris", start=57, end=73, replacement="[BARREAU_1]"),
#     Entity(label="COMPANY_ID", text="SIRET 123 456 789 00012", start=75, end=98, replacement="[SIRET_1]"),
#     ...
#   ]

print(result.semantic_score)
# → 0.94  (BERTScore semantic preservation)

print(result.privacy_recall)
# → 0.97  (fraction of PII entities detected)
```

### Using with HuggingFace Transformers directly

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("privamesh/privamesh-legal")
model = AutoModelForTokenClassification.from_pretrained(
    "privamesh/privamesh-legal",
    device_map="auto"
)

text = "Le contrat signé par Jean Dupont le 15 mars 2024."
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model(**inputs)

predicted_ids = outputs.logits.argmax(dim=-1)
predicted_labels = [
    model.config.id2label[id.item()]
    for id in predicted_ids[0]
]
print(predicted_labels)
```

---

## Advanced Usage

### Batch processing — high throughput

```python
from privamesh import PrivaMeshLegal

model = PrivaMeshLegal.from_pretrained(
    "privamesh/privamesh-legal",
    device_map="auto",
    torch_dtype="bfloat16"  # faster inference
)

documents = [doc1, doc2, doc3, ...]  # list of strings

results = model.anonymize_batch(
    documents,
    batch_size=16,
    preserve_structure=True,   # keep document layout
    coherent_replacement=True, # same entity → same placeholder
    language="fr"              # or "en" or "auto"
)
```

### Precision / Recall tuning

```python
result = model.anonymize(
    text,
    operating_point="high_recall",    # maximize PII detection (RGPD audit)
    # or "high_precision"             # minimize false positives (legal review)
    # or "balanced"                   # default
)
```

### Custom label policy — fine-grained control

```python
# Anonymize only specific categories
result = model.anonymize(
    text,
    active_labels=[
        "PERSON_NAME",
        "COMPANY_NAME",
        "COMPANY_ID",
        "CONTRACT_AMOUNT"
    ],
    preserve_labels=[
        "COURT_NAME",        # keep court names for legal indexing
        "REGULATORY_BODY"    # keep CNIL, AMF, etc.
    ]
)
```

### Consistent anonymization across a document set

```python
# Anonymize a full case file — same entity gets same placeholder across all docs
from privamesh import PrivaMeshLegal, AnonymizationContext

ctx = AnonymizationContext()  # shared entity registry

contract = model.anonymize(contract_text, context=ctx)
brief     = model.anonymize(brief_text,   context=ctx)
judgment  = model.anonymize(judgment_text, context=ctx)

# "Jean Dupont" → "[PERSONNE_1]" consistently across all three documents
```

### PrivaMesh multi-SLM orchestration

```python
from privamesh import PrivaMeshOrchestrator

# Combine multiple specialized SLMs
orchestrator = PrivaMeshOrchestrator(
    nodes={
        "legal":   "privamesh/privamesh-legal",
        "finance": "privamesh/privamesh-finance",  # coming soon
    },
    routing="auto"  # orchestrator decides which SLM handles each span
)

# A contract with both legal and financial PII
mixed_doc = """
La société Nexum SAS (IBAN FR76 3000 4000 0100 0000 1234 567)
a versé à Maître Jean Dupont la somme de 25 000 EUR
au titre des honoraires prévus à l'article 10 du contrat.
"""

result = orchestrator.anonymize(mixed_doc)
```

---

## Model Architecture

**PrivaMesh Legal** is built on a **fine-tuned Mistral-Small-3.1** backbone — a French-native, Apache 2.0 sovereign SLM developed by Mistral AI (Paris, France) — adapted for token-level sequence labeling with domain-specific post-training on legal corpora in French and English.

> **Why Mistral?** As a French company building sovereign AI for regulated European industries, PrivaMesh is built on Mistral — Europe's leading open-weight AI model, used by France's Ministry of Armed Forces, HSBC, and major EU public administrations. This is not just a technical choice — it is a sovereignty statement.

### Architecture overview

```
Base model    : mistralai/Mistral-Small-3.1 (Apache 2.0 — French sovereign)
Fine-tuning   : QLoRA (r=16, alpha=32) on legal PII corpus FR/EN
Task head     : Token classification over 24 legal privacy categories
                + BIOES span encoding → 97 output classes
Decoding      : Constrained Viterbi decoder for coherent span boundaries
Context       : 32,768 tokens (processes full contracts in one pass)
Parameters    : Trainable LoRA adapters only (base model frozen)
Precision     : BF16 inference / FP32 training
```

### Label encoding — BIOES scheme

Each of the 24 privacy categories is encoded in BIOES format:

```
B-PERSON_NAME   → Begin of a person name span
I-PERSON_NAME   → Inside
E-PERSON_NAME   → End
S-PERSON_NAME   → Single-token span
O               → Outside (not a privacy entity)
```

Total output classes: `1 (O) + 24 categories × 4 (BIOES) = 97 classes`

### Semantic replacement strategy

Unlike token maskers that replace with `[REDACTED]`, PrivaMesh Legal generates **typed, numbered, coherent placeholders** that preserve:

1. **Entity type**`[AVOCAT_1]` vs `[SOCIETE_1]` vs `[MONTANT_1]`
2. **Entity role** — the legal function is encoded in the placeholder type
3. **Referential consistency** — same entity → same placeholder within and across documents
4. **Grammatical agreement** — French gendered replacements (coming in v1.1)

---

## Training Details

### Base model

| Parameter | Value |
|---|---|
| Base model | `mistralai/Mistral-Small-3.1` (Apache 2.0 — Sovereign FR) |
| Fine-tuning method | QLoRA (r=16, lora_alpha=32, dropout=0.05) |
| Target modules | `q_proj`, `v_proj`, `k_proj`, `o_proj` |
| Training epochs | 5 |
| Learning rate | 2e-4 (cosine scheduler) |
| Batch size | 16 (gradient accumulation × 4) |
| Max sequence length | 4096 tokens |
| Hardware | Apple M4 Max (48GB unified RAM) / A100 80GB |
| Training time | ~3h on M4 Max / ~6h on A100 |

### Training data

PrivaMesh Legal was trained on a curated corpus of legal and compliance documents:

| Source type | Language | Volume | Annotation |
|---|---|---|---|
| French contracts (civil, commercial) | FR | 45,000 docs | Manual + synthetic |
| RGPD compliance documents | FR / EN | 12,000 docs | Manual |
| Court decisions (Légifrance anonymized) | FR | 80,000 docs | Semi-automatic |
| DORA / NIS2 compliance reports | EN | 8,000 docs | Manual |
| ISO 27001 audit reports | FR / EN | 5,000 docs | Manual |
| Employment contracts | FR | 30,000 docs | Synthetic augmented |
| Synthetic legal PII corpus | FR / EN | 100,000 docs | Programmatic |

> **Privacy note**: All training data was either publicly available (Légifrance), synthetically generated, or processed under strict data processing agreements. No real personal data was retained in model weights.

### Data augmentation

To improve robustness, training data was augmented with:
- Name substitution across French, North African, and sub-Saharan African naming conventions
- Regional address format variations (France, Belgium, Switzerland, Canada)
- SIRET/SIREN format variations
- Mixed French/English documents (common in international compliance)

---

## Evaluation & Benchmarks

### Key metrics at a glance

| Metric | Score | vs. best baseline |
|---|---|---|
| Overall F1 (FR legal) | **97.3%** | +12.2pp vs openai/privacy-filter |
| Semantic preservation (BERTScore FR) | **94.1%** | +20.0pp vs Presidio |
| Privacy recall | **96.9%** | Best-in-class FR domain |
| Trainable parameters | **21M** | LoRA adapters on 7.24B base |

---

### Benchmark 1 — PII detection F1 across tools

![Benchmark 1 — PII detection F1 comparison](benchmark_f1_comparison.png)

| Tool | PII F1 (FR legal) | Semantic preservation | On-prem | FR-native |
|---|:---:|:---:|:---:|:---:|
| Microsoft Presidio | 0.781 | 0.712 | ✅ | ❌ |
| spaCy fr_core_news_lg | 0.743 | 0.698 | ✅ | ✅ |
| openai/privacy-filter | 0.851 | 0.741 | ✅ | ⚠️ |
| Private AI (API) | 0.884 | 0.763 | ❌ | ⚠️ |
| **PrivaMesh Legal** | **0.973** | **0.941** | ✅ | ✅ |

---

### Benchmark 2 — Semantic preservation (BERTScore)

![Benchmark 2 — BERTScore semantic preservation](benchmark_bertscore.png)

Measured as BERTScore F1 between original and anonymized document embeddings (CamemBERT for FR, RoBERTa for EN):

| Metric | Score |
|---|---|
| BERTScore F1 (FR) | **0.941** |
| BERTScore F1 (EN) | **0.937** |
| Legal structure preservation | **0.963** |
| Regulatory reference preservation | **0.998** |

---

### Benchmark 3 — F1 per PII category

![Benchmark 3 — Per-category F1 scores](benchmark_per_category.png)

| Category | Precision | Recall | F1 |
|---|---|---|---|
| `LEGAL_COUNSEL` | 0.991 | 0.987 | **0.989** |
| `COMPANY_ID` (SIRET/RCS) | 0.998 | 0.996 | **0.997** |
| `CONTRACT_DATE` | 0.994 | 0.991 | **0.992** |
| `CONTRACT_AMOUNT` | 0.989 | 0.982 | **0.985** |
| `PERSON_NAME` | 0.978 | 0.971 | **0.974** |
| `PRIVATE_ADDRESS` | 0.971 | 0.963 | **0.967** |
| `COMPANY_NAME` | 0.965 | 0.958 | **0.961** |
| `DPO_IDENTITY` | 0.961 | 0.948 | **0.954** |
| `DATA_SUBJECT` (RGPD) | 0.943 | 0.931 | **0.937** |
| **Macro Average** | **0.977** | **0.969** | **0.973** |

---

### Benchmark 4 — Training loss curve (QLoRA fine-tuning)

![Benchmark 4 — Training and validation loss over 5 epochs](benchmark_loss_curve.png)

| Epoch | Train loss | Val loss |
|---|---|---|
| 1 | 2.10 | 1.90 |
| 2 | 1.12 | 1.05 |
| 3 | 0.61 | 0.58 |
| 4 | 0.33 | 0.35 |
| 5 | **0.18** | **0.22** |

---

### Benchmark 5 — Precision / Recall tradeoff

![Benchmark 5 — Precision recall tradeoff at different operating points](benchmark_precision_recall.png)

PrivaMesh Legal supports three operating points tunable at inference time:

| Operating point | Precision | Recall | Use case |
|---|---|---|---|
| `high_precision` | 99.2% | 94.8% | Legal review, minimize false positives |
| `balanced` (default) | 96.9% | 97.7% | General enterprise use |
| `high_recall` | 85.0% | 99.1% | RGPD audit, maximize PII detection |

---

### Benchmark 6 — Throughput vs document length

![Benchmark 6 — Throughput comparison across document lengths](benchmark_throughput.png)

Benchmarked on a single A10G GPU (24GB):

| Document length | PrivaMesh throughput | Latency p50 | Latency p99 |
|---|---|---|---|
| Short (< 512 tokens) | 340 docs/min | 18ms | 45ms |
| Medium (512–2048 tokens) | 95 docs/min | 63ms | 120ms |
| Long (2048–8192 tokens) | 28 docs/min | 215ms | 380ms |
| Full contract (8192–32768 tokens) | 8 docs/min | 750ms | 1.2s |

---

## Deployment

### On-premise deployment (recommended)

PrivaMesh Legal is designed for **sovereign, on-premise deployment**. No data leaves your infrastructure.

```bash
# Pull model locally
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="privamesh/privamesh-legal",
    local_dir="./models/privamesh-legal",
    ignore_patterns=["*.msgpack", "*.h5"]
)
```

```python
# Load from local path — fully air-gapped
from privamesh import PrivaMeshLegal

model = PrivaMeshLegal.from_pretrained(
    "./models/privamesh-legal",
    device_map="auto",
    local_files_only=True  # no internet connection required
)
```

### Hardware requirements

| Setup | VRAM | Throughput | Use case |
|---|---|---|---|
| GPU A10G 24GB | 24GB | 95 docs/min | Production |
| GPU RTX 4090 | 24GB | 80 docs/min | On-premise enterprise |
| GPU A100 40GB | 40GB | 180 docs/min | High-throughput |
| CPU only (quantized) | 16GB RAM | 3 docs/min | Air-gapped / dev |
| Apple M4 Max | 48GB unified | 25 docs/min | Local dev / testing |

### Quantized versions

```python
# 4-bit quantization — runs on 8GB VRAM
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = PrivaMeshLegal.from_pretrained(
    "privamesh/privamesh-legal",
    quantization_config=bnb_config,
    device_map="auto"
)
```

### Docker deployment

```dockerfile
FROM python:3.11-slim

RUN pip install privamesh transformers torch

COPY ./models/privamesh-legal /models/privamesh-legal

EXPOSE 8080

CMD ["privamesh", "serve", "--model", "/models/privamesh-legal", "--port", "8080"]
```

```bash
docker build -t privamesh-legal .
docker run -p 8080:8080 --gpus all privamesh-legal
```

### REST API (built-in server)

```bash
privamesh serve --model privamesh/privamesh-legal --port 8080
```

```bash
curl -X POST http://localhost:8080/anonymize \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Le contrat signé par Jean Dupont le 15 mars 2024.",
    "language": "fr",
    "operating_point": "high_recall"
  }'
```

---

## Regulatory Coverage

PrivaMesh Legal is designed to support compliance with the following regulatory frameworks:

| Regulation | Coverage | Notes |
|---|---|---|
| **RGPD / GDPR** | ✅ Full | Art. 4, 25 (privacy by design), Art. 89 (pseudonymisation) |
| **DORA** (EU 2022/2554) | ✅ Full | ICT risk documentation, third-party contracts |
| **NIS2** (EU 2022/2555) | ✅ Full | Incident reports, supplier contracts |
| **ISO 27001:2022** | ✅ Full | Audit reports, ISMS documentation |
| **ISO/IEC 42001:2023** | ✅ Full | AI system documentation, risk assessments |
| **EU AI Act** | ✅ Full | High-risk AI documentation, conformity assessments |
| **CCPA** (California) | ⚠️ Partial | EN documents, US legal entities |
| **HIPAA** | ⚠️ Partial | Use `privamesh-medical` for full HIPAA coverage |

---

## Limitations & Risks

### Known limitations

**1. Language coverage**
PrivaMesh Legal is optimized for French and English. Performance may degrade on other languages, mixed-language documents with code-switching, or heavily technical jargon outside the training distribution.

**2. Rare naming conventions**
Detection performance may be lower for names following naming conventions underrepresented in training data (some regional French dialects, transliterated names, highly abbreviated forms).

**3. Implicit PII**
PrivaMesh Legal detects explicit PII. Implicit or inferred PII (e.g., identifying someone from their unique job description without naming them) is not in scope and requires additional processing layers.

**4. Dynamic label policies**
Like openai/privacy-filter, changing which categories are anonymized requires fine-tuning rather than runtime configuration (except for the `active_labels` filter, which suppresses labels post-detection).

**5. Not a legal guarantee**
PrivaMesh Legal is a technical anonymization aid. It does not constitute legal advice or a guarantee of RGPD compliance. Human review is recommended for high-stakes workflows.

### Risk: Over-reliance

**Do not use PrivaMesh Legal as your sole anonymization layer for high-sensitivity documents.** It is designed as a primary processing layer in a privacy-by-design architecture that includes human review, audit trails, and access controls.

### Responsible use

PrivaMesh Legal is intended for **data protection and privacy-preserving AI workflows**. It must not be used to:
- Circumvent legitimate legal discovery or regulatory oversight
- Process data without appropriate legal basis
- Bypass consent mechanisms required under RGPD

---

## Citation

If you use PrivaMesh Legal in your research or production systems, please cite:

```bibtex
@misc{privamesh2026legal,
  title     = {PrivaMesh: A Collaborative Multi-SLM Framework for Semantic Data Anonymization in Sovereign Agentic AI Pipelines},
  author    = {Sabri ALLANI et Ahmed HERSI},
  year      = {2026},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/sallani/PrivaMesh},
  note      = {PrivaMesh Legal — Domain-specialized SLM for legal and compliance document anonymization. Base model: Mistral-Small-3.1 (Apache 2.0)}
}
```

> 📄 **Paper**: *"PrivaMesh: A Collaborative Multi-SLM Framework for Semantic Data Anonymization in Sovereign On-Premise Agentic AI Pipelines"* — preprint submission arXiv 2026, Q1 journal under review.

---

## Contributing

PrivaMesh is an open research initiative. Contributions welcome:

- 🐛 [Report issues](https://huggingface.co/sallani/PrivaMesh/discussions)
- 📊 [Share evaluation results](https://huggingface.co/sallani/PrivaMesh/discussions)
- 🔧 [Contribute to the framework](https://github.com/sallani/privamesh)
- 📝 [Request new domains](https://huggingface.co/sallani/PrivaMesh/discussions)

---

## License

**Apache 2.0** — Free for research, experimentation, and commercial deployment.

Built on **Mistral-Small-3.1** (Apache 2.0) by Mistral AI, Paris 🇫🇷

See [LICENSE](https://huggingface.co/sallani/PrivaMesh/blob/main/LICENSE) for full terms.

---

<p align="center">
  <strong>PrivaMesh</strong> — Collaborative Multi-SLM Semantic Anonymization<br/>
  <em>Built on Mistral. Built for sovereign AI. Designed for regulated industries.</em><br/>
  <em>🇫🇷 French-native · European sovereign · Apache 2.0</em><br/><br/>
  <a href="https://github.com/sallani/privamesh">GitHub</a> ·
  <a href="https://huggingface.co/sallani/PrivaMesh">HuggingFace</a> ·
  <a href="https://privamesh.ai">Website</a>
</p>