TensorGreed commited on
Commit
ccf904d
·
verified ·
1 Parent(s): e41bdd4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -2
README.md CHANGED
@@ -27,11 +27,35 @@ It achieves the following results on the evaluation set:
27
 
28
  ## Model description
29
 
30
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
31
 
32
  ## Intended uses & limitations
33
 
34
- More information needed
 
 
 
 
 
 
 
 
 
 
 
35
 
36
  ## Training and evaluation data
37
 
 
27
 
28
  ## Model description
29
 
30
+ SensiGuard-PII is a token-classification model fine-tuned to detect common PII/PCI/PHI fields (e.g., names, emails, phone, SSN, card numbers, bank details, IPs, API keys). The base encoder is microsoft/deberta-v3-base trained on a mixture of synthetic, weak-labeled, and public PII datasets, using BIO tagging with class weighting to handle imbalance.
31
+ Sample Usage:
32
+ ```
33
+ from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
34
+
35
+ model_id = "your_namespace/SensiGuard-PII"
36
+ tok = AutoTokenizer.from_pretrained(model_id)
37
+ model = AutoModelForTokenClassification.from_pretrained(model_id)
38
+
39
+ nlp = pipeline("token-classification", model=model, tokenizer=tok, aggregation_strategy="simple")
40
+ text = "My SSN is 123-45-6789 and my card is 4111 1111 1111 1111."
41
+ print(nlp(text))
42
+ # [{'entity_group': 'SSN', 'score': 0.99, 'word': '123-45-6789', 'start': 10, 'end': 21},
43
+ ```
44
 
45
  ## Intended uses & limitations
46
 
47
+ ### Intended Uses
48
+
49
+ - Ingress/egress scanning for applications or LLM systems to identify sensitive spans.
50
+ - Redaction or logging workflows where you need start/end offsets and label types.
51
+ - Semi-supervised bootstrapping: weak-label new corpora with this model and fine-tune further.
52
+
53
+ ### Limitations
54
+
55
+ - Not a silver bullet: precision/recall can vary by domain, language (primarily English), and formatting.
56
+ - PCI: needs coverage for diverse card formats; pair with regex + Luhn validation and post-processing thresholds.
57
+ - May miss edge cases or yield false positives on lookalike numbers/strings; test on your own data.
58
+ - No safety/ethical filtering beyond PII detection; downstream policy is your responsibility.
59
 
60
  ## Training and evaluation data
61