Update README.md
Browse files
README.md
CHANGED
|
@@ -27,11 +27,35 @@ It achieves the following results on the evaluation set:
|
|
| 27 |
|
| 28 |
## Model description
|
| 29 |
|
| 30 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
## Intended uses & limitations
|
| 33 |
|
| 34 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
## Training and evaluation data
|
| 37 |
|
|
|
|
| 27 |
|
| 28 |
## Model description
|
| 29 |
|
| 30 |
+
SensiGuard-PII is a token-classification model fine-tuned to detect common PII/PCI/PHI fields (e.g., names, emails, phone, SSN, card numbers, bank details, IPs, API keys). The base encoder is microsoft/deberta-v3-base trained on a mixture of synthetic, weak-labeled, and public PII datasets, using BIO tagging with class weighting to handle imbalance.
|
| 31 |
+
Sample Usage:
|
| 32 |
+
```
|
| 33 |
+
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
|
| 34 |
+
|
| 35 |
+
model_id = "your_namespace/SensiGuard-PII"
|
| 36 |
+
tok = AutoTokenizer.from_pretrained(model_id)
|
| 37 |
+
model = AutoModelForTokenClassification.from_pretrained(model_id)
|
| 38 |
+
|
| 39 |
+
nlp = pipeline("token-classification", model=model, tokenizer=tok, aggregation_strategy="simple")
|
| 40 |
+
text = "My SSN is 123-45-6789 and my card is 4111 1111 1111 1111."
|
| 41 |
+
print(nlp(text))
|
| 42 |
+
# [{'entity_group': 'SSN', 'score': 0.99, 'word': '123-45-6789', 'start': 10, 'end': 21},
|
| 43 |
+
```
|
| 44 |
|
| 45 |
## Intended uses & limitations
|
| 46 |
|
| 47 |
+
### Intended Uses
|
| 48 |
+
|
| 49 |
+
- Ingress/egress scanning for applications or LLM systems to identify sensitive spans.
|
| 50 |
+
- Redaction or logging workflows where you need start/end offsets and label types.
|
| 51 |
+
- Semi-supervised bootstrapping: weak-label new corpora with this model and fine-tune further.
|
| 52 |
+
|
| 53 |
+
### Limitations
|
| 54 |
+
|
| 55 |
+
- Not a silver bullet: precision/recall can vary by domain, language (primarily English), and formatting.
|
| 56 |
+
- PCI: needs coverage for diverse card formats; pair with regex + Luhn validation and post-processing thresholds.
|
| 57 |
+
- May miss edge cases or yield false positives on lookalike numbers/strings; test on your own data.
|
| 58 |
+
- No safety/ethical filtering beyond PII detection; downstream policy is your responsibility.
|
| 59 |
|
| 60 |
## Training and evaluation data
|
| 61 |
|