Abhishek Dey commited on
Commit
5a9e457
Β·
1 Parent(s): de60867

Update model card: Indian Compliance Classifier Router

Browse files
Files changed (1) hide show
  1. README.md +28 -27
README.md CHANGED
@@ -4,13 +4,14 @@ language:
4
  - en
5
  - hi
6
  tags:
7
- - bert
8
  - classifier
9
  - compliance
10
  - pii-detection
11
  - fsi
12
  - query-routing
13
  - financial-services
 
14
  library_name: pytorch
15
  pipeline_tag: text-classification
16
  model-index:
@@ -24,15 +25,15 @@ model-index:
24
  value: 99.2
25
  ---
26
 
27
- # BERT Compliance Classifier Router
28
 
29
- A 134M parameter BERT encoder model built entirely from scratch β€” from pre-training through fine-tuning β€” designed specifically for regulated industries such as Financial Services, Healthcare, and Legal. The model serves as an intelligent query router that classifies incoming prompts based on two critical dimensions: query complexity and the presence of Personally Identifiable Information (PII). This enables cost-optimized, compliance-aware LLM serving where sensitive data never leaves local infrastructure.
30
 
31
  Unlike traditional keyword-based or regex-based PII detection, this model learns contextual patterns from training data, enabling it to detect PII in both structured formats (like PAN: ABCDE1234F) and unstructured human-typed input (like "my pan is abcde1234f" or "Mera PAN number batao"). The model supports English and Hinglish (Hindi-English code-mixed) queries commonly seen in Indian financial services.
32
 
33
  ## Model Description
34
 
35
- The core purpose of this model is to sit at the entry point of a multi-tier LLM serving architecture. When a user query arrives, this classifier makes a routing decision in under 10 milliseconds β€” determining whether the query should go to a small or large language model, and whether the data must stay on local infrastructure or can be processed cross-region. This architectural pattern delivers 65-73% cost savings compared to routing all queries to a single large model, while simultaneously enforcing data residency compliance that is architecturally impossible with Mixture-of-Experts (MoE) approaches.
36
 
37
  The model outputs one of four labels, each mapping directly to a routing action:
38
 
@@ -48,15 +49,15 @@ The model outputs one of four labels, each mapping directly to a routing action:
48
  The model achieves production-ready performance across all metrics. The 99.2% accuracy on a held-out test set demonstrates strong generalization, while the sub-10ms latency ensures the routing decision adds negligible overhead to the overall request lifecycle. The throughput of 130+ queries per second on a single GPU means a single instance can serve millions of classification requests per day.
49
 
50
  - **Accuracy:** 99.2%
51
- - **PII Recall:** ~100% (conservative β€” prefers false positive over missed PII)
52
  - **Latency:** ~7ms (GPU) / ~72ms (CPU)
53
  - **Throughput:** ~130 queries/sec per GPU instance
54
- - **Model Size:** 134M parameters / ~530 MB on disk
55
  - **Inference Memory:** ~530 MB VRAM (fits on any GPU including T4 16GB)
56
 
57
  ## Files
58
 
59
- The repository contains two files required for inference. The model weights file contains the trained parameters of the BERT encoder plus the classification head. The tokenizer file contains the BPE vocabulary and merge rules trained on English Wikipedia, including special tokens required for the model.
60
 
61
  | File | Size | Description |
62
  |------|------|-------------|
@@ -65,31 +66,31 @@ The repository contains two files required for inference. The model weights file
65
 
66
  ## Architecture
67
 
68
- The model uses a standard BERT encoder architecture β€” a bidirectional transformer that attends to all tokens simultaneously (no causal mask). This is fundamentally different from decoder-only models like GPT which can only see past tokens. The bidirectional attention is critical for classification tasks because the model needs to understand the full context of a query before making a routing decision. The classification is performed on the first token's hidden state, which aggregates information from the entire input sequence through self-attention.
69
 
70
- - **Type:** BERT Encoder (bidirectional transformer, no causal mask)
71
  - **Dimensions:** 768
72
  - **Layers:** 12
73
  - **Attention Heads:** 12
74
  - **FFN Dimension:** 3072
75
  - **Max Sequence Length:** 128 tokens (inference) / 512 tokens (pre-training)
76
- - **Vocabulary:** 32,000 (BPE, includes `<mask>` token at ID=4)
77
- - **Activation:** GELU (standard for BERT-family models)
78
  - **Normalization:** LayerNorm (pre-norm variant)
79
- - **Classification Head:** Linear(768β†’768) β†’ Tanh β†’ Dropout β†’ Linear(768β†’4)
80
 
81
  ## Training
82
 
83
- The model was trained in two phases following the standard pre-train then fine-tune paradigm. Pre-training teaches the model general English language understanding through the Masked Language Model (MLM) objective β€” predicting randomly masked tokens from context. Fine-tuning then specializes this general understanding into the specific 4-class classification task using labeled FSI examples.
84
 
85
  ### Pre-training
86
 
87
- Pre-training was conducted on English Wikipedia (2B tokens) using the Masked Language Model objective. During each training step, 15% of input tokens are randomly masked (80% replaced with [MASK], 10% replaced with random token, 10% kept unchanged), and the model learns to predict the original tokens. This teaches deep contextual understanding of English language patterns, grammar, and world knowledge β€” providing a strong foundation for downstream classification.
88
 
89
  - **Objective:** Masked Language Model (MLM), 15% masking (80/10/10)
90
  - **Data:** English Wikipedia (2B tokens processed over 500K steps)
91
  - **Batch size:** 8, sequence length: 512
92
- - **Learning Rate:** 1e-4 β†’ 1e-5 (cosine schedule, warmup 2000 steps)
93
  - **Optimizer:** AdamW (betas=0.9/0.999, weight_decay=0.01)
94
  - **Hardware:** NVIDIA L4 (24 GB), ~48 hours
95
  - **Precision:** BF16 mixed precision with gradient checkpointing
@@ -104,13 +105,13 @@ Fine-tuning was performed on 50,000 synthetic FSI examples carefully designed to
104
  - **Input Formats:** Structured + unstructured (human-typed messy input)
105
  - **Languages:** English + Hinglish (15% code-mixed)
106
  - **Steps:** 8,000, batch=32, sequence length=128
107
- - **Learning Rate:** 2e-5 β†’ 2e-6 (cosine schedule)
108
  - **Hardware:** NVIDIA L4, ~15 minutes
109
  - **Final Accuracy:** 99.2%
110
 
111
  ## Usage
112
 
113
- Using this model requires defining the BERT architecture (as the weights are stored as a PyTorch state_dict without the architecture). The model accepts tokenized input with an attention mask and returns logits for 4 classes. A softmax converts logits to probabilities, and the argmax gives the predicted label.
114
 
115
  ```python
116
  import torch
@@ -120,7 +121,7 @@ from tokenizers import Tokenizer
120
  # Load tokenizer
121
  tokenizer = Tokenizer.from_file("deycoding.compliance-classifier-in-1-0.json")
122
 
123
- # Load model weights (requires architecture definition β€” see documentation)
124
  model.load_state_dict(torch.load("deycoding.compliance-classifier-in-1-0.pt", map_location="cpu"))
125
  model.eval()
126
 
@@ -168,7 +169,7 @@ This model is designed for deployment as a lightweight, low-latency routing laye
168
 
169
  - Query routing in multi-tier LLM serving architectures
170
  - PII detection for data residency compliance (GDPR, RBI Data Localization, India DPDP Act)
171
- - Cost optimization β€” route simple queries to cheaper/smaller models (65-73% savings)
172
  - Financial services (banking, insurance, capital markets)
173
  - Healthcare (patient data routing)
174
  - Legal (privileged information detection)
@@ -177,20 +178,20 @@ This model is designed for deployment as a lightweight, low-latency routing laye
177
 
178
  While the model achieves high accuracy on the test set, there are important limitations to consider before production deployment. The training data is synthetic (template-generated with variations), which means the model may not generalize perfectly to all real-world query patterns. Production deployment should include a feedback loop where misclassified queries are collected and used to improve subsequent training iterations.
179
 
180
- - Trained on synthetic data β€” recommend fine-tuning on real customer queries for production
181
- - English + Hinglish only β€” other Indian languages (Tamil, Telugu, Bengali, etc.) not covered
182
- - Max 128 tokens β€” very long queries get truncated, potentially losing PII at the end
183
- - PII detection is learned (not regex) β€” may miss novel PII formats not represented in training data
184
- - No entity extraction β€” model detects presence of PII but doesn't extract or mask specific values
185
- - Confidence calibration not verified β€” high confidence doesn't guarantee correctness on out-of-distribution inputs
186
 
187
  ## Ethical Considerations
188
 
189
- This model is designed to enhance privacy and compliance, not to circumvent it. The routing decisions enforce data residency by ensuring PII-containing queries physically cannot reach cross-region infrastructure. The model makes conservative decisions β€” when uncertain, it defaults to the most restrictive routing (local processing), preferring false positives (unnecessary local routing) over false negatives (PII leaking to cross-region).
190
 
191
  - Model makes routing decisions, not content decisions
192
  - PII detection is conservative (prefers false positive over missed PII)
193
- - Data residency enforcement is architectural β€” PII queries physically cannot reach cross-region infrastructure
194
  - No user data is stored or logged by the model itself
195
  - Model should be part of a defense-in-depth strategy, not the sole PII control
196
 
 
4
  - en
5
  - hi
6
  tags:
7
+ - encoder
8
  - classifier
9
  - compliance
10
  - pii-detection
11
  - fsi
12
  - query-routing
13
  - financial-services
14
+ - india
15
  library_name: pytorch
16
  pipeline_tag: text-classification
17
  model-index:
 
25
  value: 99.2
26
  ---
27
 
28
+ # Indian Compliance Classifier Router
29
 
30
+ A ~140M parameter encoder model built entirely from scratch - from pre-training through fine-tuning - designed specifically for regulated industries such as Financial Services, Healthcare, and Legal. The model serves as an intelligent query router that classifies incoming prompts based on two critical dimensions: query complexity and the presence of Personally Identifiable Information (PII). This enables cost-optimized, compliance-aware LLM serving where sensitive data never leaves local infrastructure.
31
 
32
  Unlike traditional keyword-based or regex-based PII detection, this model learns contextual patterns from training data, enabling it to detect PII in both structured formats (like PAN: ABCDE1234F) and unstructured human-typed input (like "my pan is abcde1234f" or "Mera PAN number batao"). The model supports English and Hinglish (Hindi-English code-mixed) queries commonly seen in Indian financial services.
33
 
34
  ## Model Description
35
 
36
+ The core purpose of this model is to sit at the entry point of a multi-tier LLM serving architecture. When a user query arrives, this classifier makes a routing decision in under 10 milliseconds - determining whether the query should go to a small or large language model, and whether the data must stay on local infrastructure or can be processed cross-region. This architectural pattern delivers 65-73% cost savings compared to routing all queries to a single large model, while simultaneously enforcing data residency compliance that is architecturally impossible with Mixture-of-Experts (MoE) approaches.
37
 
38
  The model outputs one of four labels, each mapping directly to a routing action:
39
 
 
49
  The model achieves production-ready performance across all metrics. The 99.2% accuracy on a held-out test set demonstrates strong generalization, while the sub-10ms latency ensures the routing decision adds negligible overhead to the overall request lifecycle. The throughput of 130+ queries per second on a single GPU means a single instance can serve millions of classification requests per day.
50
 
51
  - **Accuracy:** 99.2%
52
+ - **PII Recall:** ~100% (conservative - prefers false positive over missed PII)
53
  - **Latency:** ~7ms (GPU) / ~72ms (CPU)
54
  - **Throughput:** ~130 queries/sec per GPU instance
55
+ - **Model Size:** ~140M parameters / ~530 MB on disk
56
  - **Inference Memory:** ~530 MB VRAM (fits on any GPU including T4 16GB)
57
 
58
  ## Files
59
 
60
+ The repository contains two files required for inference. The model weights file contains the trained parameters of the encoder plus the classification head. The tokenizer file contains the BPE vocabulary and merge rules trained on English Wikipedia, including special tokens required for the model.
61
 
62
  | File | Size | Description |
63
  |------|------|-------------|
 
66
 
67
  ## Architecture
68
 
69
+ The model uses a bidirectional transformer encoder architecture that attends to all tokens simultaneously (no causal mask). This is fundamentally different from decoder-only models like GPT which can only see past tokens. The bidirectional attention is critical for classification tasks because the model needs to understand the full context of a query before making a routing decision. The classification is performed on the first token's hidden state, which aggregates information from the entire input sequence through self-attention.
70
 
71
+ - **Type:** Bidirectional Transformer Encoder (no causal mask)
72
  - **Dimensions:** 768
73
  - **Layers:** 12
74
  - **Attention Heads:** 12
75
  - **FFN Dimension:** 3072
76
  - **Max Sequence Length:** 128 tokens (inference) / 512 tokens (pre-training)
77
+ - **Vocabulary:** 32,000 (BPE, includes special tokens)
78
+ - **Activation:** GELU
79
  - **Normalization:** LayerNorm (pre-norm variant)
80
+ - **Classification Head:** Linear(768-768) - Tanh - Dropout - Linear(768-4)
81
 
82
  ## Training
83
 
84
+ The model was trained in two phases following the standard pre-train then fine-tune paradigm. Pre-training teaches the model general English language understanding through the Masked Language Model (MLM) objective - predicting randomly masked tokens from context. Fine-tuning then specializes this general understanding into the specific 4-class classification task using labeled FSI examples.
85
 
86
  ### Pre-training
87
 
88
+ Pre-training was conducted on English Wikipedia (2B tokens) using the Masked Language Model objective. During each training step, 15% of input tokens are randomly masked (80% replaced with mask token, 10% replaced with random token, 10% kept unchanged), and the model learns to predict the original tokens. This teaches deep contextual understanding of English language patterns, grammar, and world knowledge - providing a strong foundation for downstream classification.
89
 
90
  - **Objective:** Masked Language Model (MLM), 15% masking (80/10/10)
91
  - **Data:** English Wikipedia (2B tokens processed over 500K steps)
92
  - **Batch size:** 8, sequence length: 512
93
+ - **Learning Rate:** 1e-4 to 1e-5 (cosine schedule, warmup 2000 steps)
94
  - **Optimizer:** AdamW (betas=0.9/0.999, weight_decay=0.01)
95
  - **Hardware:** NVIDIA L4 (24 GB), ~48 hours
96
  - **Precision:** BF16 mixed precision with gradient checkpointing
 
105
  - **Input Formats:** Structured + unstructured (human-typed messy input)
106
  - **Languages:** English + Hinglish (15% code-mixed)
107
  - **Steps:** 8,000, batch=32, sequence length=128
108
+ - **Learning Rate:** 2e-5 to 2e-6 (cosine schedule)
109
  - **Hardware:** NVIDIA L4, ~15 minutes
110
  - **Final Accuracy:** 99.2%
111
 
112
  ## Usage
113
 
114
+ Using this model requires defining the encoder architecture (as the weights are stored as a PyTorch state_dict without the architecture). The model accepts tokenized input with an attention mask and returns logits for 4 classes. A softmax converts logits to probabilities, and the argmax gives the predicted label.
115
 
116
  ```python
117
  import torch
 
121
  # Load tokenizer
122
  tokenizer = Tokenizer.from_file("deycoding.compliance-classifier-in-1-0.json")
123
 
124
+ # Load model weights (requires architecture definition - see documentation)
125
  model.load_state_dict(torch.load("deycoding.compliance-classifier-in-1-0.pt", map_location="cpu"))
126
  model.eval()
127
 
 
169
 
170
  - Query routing in multi-tier LLM serving architectures
171
  - PII detection for data residency compliance (GDPR, RBI Data Localization, India DPDP Act)
172
+ - Cost optimization - route simple queries to cheaper/smaller models (65-73% savings)
173
  - Financial services (banking, insurance, capital markets)
174
  - Healthcare (patient data routing)
175
  - Legal (privileged information detection)
 
178
 
179
  While the model achieves high accuracy on the test set, there are important limitations to consider before production deployment. The training data is synthetic (template-generated with variations), which means the model may not generalize perfectly to all real-world query patterns. Production deployment should include a feedback loop where misclassified queries are collected and used to improve subsequent training iterations.
180
 
181
+ - Trained on synthetic data - recommend fine-tuning on real customer queries for production
182
+ - English + Hinglish only - other Indian languages (Tamil, Telugu, Bengali, etc.) not covered
183
+ - Max 128 tokens - very long queries get truncated, potentially losing PII at the end
184
+ - PII detection is learned (not regex) - may miss novel PII formats not represented in training data
185
+ - No entity extraction - model detects presence of PII but does not extract or mask specific values
186
+ - Confidence calibration not verified - high confidence does not guarantee correctness on out-of-distribution inputs
187
 
188
  ## Ethical Considerations
189
 
190
+ This model is designed to enhance privacy and compliance, not to circumvent it. The routing decisions enforce data residency by ensuring PII-containing queries physically cannot reach cross-region infrastructure. The model makes conservative decisions - when uncertain, it defaults to the most restrictive routing (local processing), preferring false positives (unnecessary local routing) over false negatives (PII leaking to cross-region).
191
 
192
  - Model makes routing decisions, not content decisions
193
  - PII detection is conservative (prefers false positive over missed PII)
194
+ - Data residency enforcement is architectural - PII queries physically cannot reach cross-region infrastructure
195
  - No user data is stored or logged by the model itself
196
  - Model should be part of a defense-in-depth strategy, not the sole PII control
197