TensorGreed
/

SensiGuard-PII

Token Classification

Generated from Trainer

Model card Files Files and versions

TensorGreed commited on 7 days ago

Commit

ccf904d

·

verified ·

1 Parent(s): e41bdd4

Update README.md

Files changed (1) hide show

README.md +26 -2

README.md CHANGED Viewed

@@ -27,11 +27,35 @@ It achieves the following results on the evaluation set:
 ## Model description
-More information needed
 ## Intended uses & limitations
-More information needed
 ## Training and evaluation data

 ## Model description
+SensiGuard-PII is a token-classification model fine-tuned to detect common PII/PCI/PHI fields (e.g., names, emails, phone, SSN, card numbers, bank details, IPs, API keys). The base encoder is microsoft/deberta-v3-base trained on a mixture of synthetic, weak-labeled, and public PII datasets, using BIO tagging with class weighting to handle imbalance.
+Sample Usage:
+```
+from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
+model_id = "your_namespace/SensiGuard-PII"
+tok = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForTokenClassification.from_pretrained(model_id)
+nlp = pipeline("token-classification", model=model, tokenizer=tok, aggregation_strategy="simple")
+text = "My SSN is 123-45-6789 and my card is 4111 1111 1111 1111."
+print(nlp(text))
+# [{'entity_group': 'SSN', 'score': 0.99, 'word': '123-45-6789', 'start': 10, 'end': 21},
+```
 ## Intended uses & limitations
+### Intended Uses
+- Ingress/egress scanning for applications or LLM systems to identify sensitive spans.
+- Redaction or logging workflows where you need start/end offsets and label types.
+- Semi-supervised bootstrapping: weak-label new corpora with this model and fine-tune further.
+### Limitations
+- Not a silver bullet: precision/recall can vary by domain, language (primarily English), and formatting.
+- PCI: needs coverage for diverse card formats; pair with regex + Luhn validation and post-processing thresholds.
+- May miss edge cases or yield false positives on lookalike numbers/strings; test on your own data.
+- No safety/ethical filtering beyond PII detection; downstream policy is your responsibility.
 ## Training and evaluation data