File size: 2,245 Bytes
463bf05
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
---
license: apache-2.0
language: en
tags:
  - ner
  - pii
  - privacy
  - token-classification
  - deberta
  - onnx
library_name: onnxruntime
pipeline_tag: token-classification
---

# Shade V5 — On-Device PII Detection

Fast, accurate PII (Personally Identifiable Information) detection model for privacy-preserving AI pipelines. Detects 12 entity types with 97.6% F1 score.

## Quick Start

```python
pip install veil-phantom
```

```python
from veil_phantom import VeilClient

veil = VeilClient()  # auto-downloads this model
result = veil.redact("John Smith sent $5M to john@acme.com")
result.sanitized  # "[PERSON_1] sent [AMOUNT_1] to [EMAIL_1]"
```

## Model Details

| Property | Value |
|----------|-------|
| Architecture | DeBERTa-v3-xsmall |
| Parameters | 22M |
| Format | ONNX |
| Size | 270 MB |
| Inference | <50ms on CPU |
| F1 Score | 97.6% (in-distribution) |
| F1 Score | 97.3% (out-of-distribution) |
| Task | BIO Token Classification |
| Labels | 25 (12 entity types × B/I + O) |

## Entity Types

| Type | F1 | Examples |
|------|-----|----------|
| PERSON | 96.3% | Names (Western, African, Asian, South African) |
| ORG | 97.6% | Companies, institutions |
| EMAIL | 100% | Email addresses |
| PHONE | 98.4% | Phone numbers (international formats) |
| MONEY | 99.6% | Monetary amounts |
| DATE | 97.8% | Dates, times, schedules |
| ADDRESS | 99.4% | Street addresses |
| GOVID | 97.7% | SSN, SA ID, passport |
| BANKACCT | 92.9% | Bank account numbers, IBAN |
| CARD | 100% | Credit/debit card numbers |
| IPADDR | 100% | IP addresses |
| CASE | 97.8% | Legal case numbers |

## Training

- **Base model**: microsoft/deberta-v3-xsmall
- **Training data**: 116K examples from business meetings, legal proceedings, financial transactions
- **Tokenizer**: Unigram (128K vocab)
- **OOD gap**: 0.3% (97.6% → 97.3%)

## Files

- `ShadeV5.onnx` — ONNX model (270 MB)
- `tokenizer.json` — HuggingFace fast tokenizer
- `tokenizer_config.json` — Tokenizer configuration
- `shade_label_map.json` — BIO label → entity type mapping

## License

Apache 2.0

## Part of VeilPhantom

This model powers [VeilPhantom](https://github.com/veil-privacy/veil-phantom), an open-source PII redaction SDK for agentic AI pipelines.