Initial release: compliance classifier v1 (134M params, 99.2% accuracy)

Files changed (3) hide show

README.md +175 -0
deycoding.compliance-classifier-in-1-0.json +0 -0
deycoding.compliance-classifier-in-1-0.pt +3 -0

README.md CHANGED Viewed

@@ -1,3 +1,178 @@
 ---
 license: cc-by-nc-4.0
 ---

 ---
 license: cc-by-nc-4.0
+language:
+- en
+- hi
+tags:
+- bert
+- classifier
+- compliance
+- pii-detection
+- fsi
+- query-routing
+- financial-services
+library_name: pytorch
+pipeline_tag: text-classification
+model-index:
+- name: deycoding.compliance-classifier-in-1-0
+  results:
+  - task:
+      type: text-classification
+    metrics:
+    - name: Accuracy
+      type: accuracy
+      value: 99.2
 ---
+# BERT Compliance Classifier Router
+A 134M parameter BERT encoder model trained from scratch for Financial Services (FSI) query classification with PII detection and compliance-aware routing.
+## Model Description
+This model classifies incoming user queries into 4 routing categories for cost-optimized, compliance-aware LLM serving in regulated industries:
+| Label | Complexity | PII | Routing Action |
+|-------|-----------|-----|----------------|
+| `simple_no_pii` | Low | No | Small model, cross-region allowed |
+| `simple_pii` | Low | Yes | Small model, local only (data residency) |
+| `complex_no_pii` | High | No | Large model, cross-region allowed |
+| `complex_pii` | High | Yes | Large model, local only (data residency) |
+## Key Results
+- **Accuracy:** 99.2%
+- **PII Recall:** ~100%
+- **Latency:** ~7ms (GPU) / ~72ms (CPU)
+- **Throughput:** ~130 queries/sec per GPU
+- **Model Size:** 134M parameters / ~530 MB
+## Files
+| File | Description |
+|------|-------------|
+| `deycoding.compliance-classifier-in-1-0.pt` | Model weights (PyTorch state_dict) |
+| `deycoding.compliance-classifier-in-1-0.json` | BPE Tokenizer (32K vocab) |
+## Architecture
+- **Type:** BERT Encoder (bidirectional transformer, no causal mask)
+- **Dimensions:** 768
+- **Layers:** 12
+- **Attention Heads:** 12
+- **FFN Dimension:** 3072
+- **Max Sequence Length:** 128 tokens (inference) / 512 tokens (pre-training)
+- **Vocabulary:** 32,000 (BPE, includes `<mask>` token)
+- **Activation:** GELU
+- **Normalization:** LayerNorm
+- **Classification Head:** Linear(768→768) → Tanh → Dropout → Linear(768→4)
+## Training
+### Pre-training
+- **Objective:** Masked Language Model (MLM), 15% masking (80/10/10)
+- **Data:** English Wikipedia (2B tokens, 500K steps)
+- **Batch size:** 8, sequence length: 512
+- **LR:** 1e-4 → 1e-5 (cosine schedule, warmup 2000 steps)
+- **Hardware:** NVIDIA L4 (24 GB), ~48 hours
+- **Final Loss:** 1.815
+### Fine-tuning
+- **Data:** 50,000 synthetic FSI examples (balanced, 12,500 per class)
+- **PII Types:** 14 (PAN, Aadhaar, phone, email, UPI, DOB, card, DL, voter, passport, address, IFSC)
+- **Input Formats:** Structured + unstructured (human-typed messy input)
+- **Languages:** English + Hinglish (15%)
+- **Steps:** 8,000, batch=32, LR=2e-5 → 2e-6
+- **Hardware:** NVIDIA L4, ~15 minutes
+- **Final Accuracy:** 99.2%
+## Usage
+```python
+import torch
+import torch.nn.functional as F
+from tokenizers import Tokenizer
+# Load tokenizer
+tokenizer = Tokenizer.from_file("deycoding.compliance-classifier-in-1-0.json")
+# Load model (requires architecture definition — see repository)
+model.load_state_dict(torch.load("deycoding.compliance-classifier-in-1-0.pt", map_location="cpu"))
+model.eval()
+# Classify a query
+text = "Check balance for Amit Patel account 4532-8876-1234"
+ids = tokenizer.encode(text).ids[:128]
+pad_len = 128 - len(ids)
+input_ids = torch.tensor([ids + [0] * pad_len])
+attn_mask = torch.tensor([[1] * len(ids) + [0] * pad_len])
+with torch.no_grad():
+    probs = F.softmax(model(input_ids, pad_mask=attn_mask), dim=-1)
+labels = ["simple_no_pii", "simple_pii", "complex_no_pii", "complex_pii"]
+prediction = labels[probs.argmax().item()]
+confidence = probs.max().item() * 100
+print(f"{prediction} ({confidence:.1f}%)")
+# Output: simple_pii (100.0%)
+```
+## PII Detection Capabilities
+Detects personal identifiable information in both structured and unstructured (human-typed) formats:
+| PII Type | Structured | Unstructured |
+|----------|-----------|--------------|
+| PAN | ABCDE1234F | pan abcde1234f |
+| Aadhaar | 1234 5678 9012 | aadhar no 123456789012 |
+| Phone | +91-98765-43210 | my number is 9876543210 |
+| Email | name@gmail.com | name at gmail dot com |
+| UPI | name@oksbi | my upi is 9876@paytm |
+| Account | 1234-5678-9012 | a/c 12345678 |
+| Card | XXXX-XXXX-XXXX-1234 | card ending 1234 |
+| DOB | 15/03/1990 | born on 15 march 1990 |
+| DL | MH-0120190012345 | dl number mh01 2019 0012345 |
+| Passport | J1234567 | passport J1234567 |
+| Voter ID | ABC1234567 | voter id ABC1234567 |
+| Address | Flat 4B, Tower 2, Koramangala | flat 4b tower 2 koramangala bangalore 560034 |
+| IFSC | SBIN0123456 | ifsc SBIN0123456 |
+## Intended Use
+- Query routing in multi-tier LLM serving architectures
+- PII detection for data residency compliance (GDPR, RBI, DPDP Act)
+- Cost optimization — route simple queries to cheaper models (65-73% savings)
+- Financial services, healthcare, legal — any regulated industry
+## Limitations
+- Trained on synthetic data — fine-tune on real queries for production
+- English + Hinglish only — other languages not covered
+- Max 128 tokens — very long queries get truncated
+- PII detection is learned (not regex) — may miss novel PII formats not in training data
+## Ethical Considerations
+- Model makes routing decisions, not content decisions
+- PII detection is conservative (prefers false positive over false negative)
+- Data residency enforcement is architectural — PII queries physically cannot reach cross-region infrastructure
+## Citation
+```bibtex
+@misc{dey2026classifier,
+  title={Classifier-Gated Multi-Tier LLM Routing for Cost-Optimized Serving in Regulated Industries},
+  author={Abhishek Dey},
+  year={2026},
+  url={https://huggingface.co/deycoding/bert-compliance-classifier-router}
+}
+```
+## Author
+**Abhishek Dey**
+- HuggingFace: [deycoding](https://huggingface.co/deycoding)
+## License
+CC-BY-NC-4.0 — Non-commercial use permitted with attribution. Commercial licensing available upon request. Contact author for commercial inquiries.

deycoding.compliance-classifier-in-1-0.json ADDED Viewed

The diff for this file is too large to render. See raw diff

deycoding.compliance-classifier-in-1-0.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4b8d98aff65ef2cdf39b8476c53341ed7308a53bfaa08289874f1e81ddd6554e
+size 441368485