File size: 5,984 Bytes
b1827de
 
edcd302
b1827de
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c7327b9
b1827de
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8dba7a2
 
 
 
 
 
b1827de
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
edcd302
 
24df4e9
edcd302
 
 
 
 
 
24df4e9
edcd302
b1827de
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
---
language: en
license: apache-2.0
base_model: google/bert_uncased_L-4_H-256_A-4
tags:
- pii
- privacy
- routing
- text-classification
- knowledge-distillation
- tinybert
- traffic-control
datasets:
- ai4privacy/pii-masking-65k
metrics:
- f1
library_name: transformers
pipeline_tag: text-classification
model-index:
- name: bert-tiny-pii-router
  results:
  - task:
      type: text-classification
      name: PII Routing
    dataset:
      name: ai4privacy/pii-masking-65k
      type: ai4privacy/pii-masking-65k
    metrics:
    - name: F1
      type: f1
      value: 0.96
---

# bert-tiny-pii-router: The Semantic Gatekeeper

> *In the era of massive LLMs, sometimes the smartest solution is the smallest one.*

This is a **42MB Traffic Controller** designed for high-performance NLP pipelines. Instead of blindly running heavy extraction models (NER, Regex, Address Parsers) on every user query, this model acts as a "Gatekeeper" to predict *which* entities are present before extraction begins.

It was distilled from an **XLM-RoBERTa** teacher into a **TinyBERT** student, achieving a **9.1x speedup** while retaining 96% of the teacher's accuracy.

## The Problem: The "Blind Pipeline"
Traditional PII extraction pipelines are wasteful. They run every specialist model on every input:
* **Regex** is fast but generates false positives (e.g., confusing dates `12/05/2024` for IP addresses).
* **NER** models are accurate but heavy and slow.
* **LLMs** are perfect extractors but expensive, high-latency, and overkill for simple tasks.

## The Solution: The Router Pattern
Instead of a linear pipeline, we use a **Multi-Label Classifier** at the very front. It reads the text once and outputs a probability vector indicating which specialized engines are needed.

**Example Decision:**
```json
Input: "Schedule a meeting with John in London on Friday"
Output: {
  "NER": 0.99,      // -> Trigger Name Extractor
  "ADDRESS": 0.98,  // -> Trigger Geocoder
  "TEMPORAL": 0.95, // -> Trigger Date Parser
  "REGEX": 0.01     // -> Skip Regex Engine (Save Compute)
}

```

## Tag Purification: Solving Class Overlap

A major challenge in training this router was **Class Overlap**. Standard models struggle with ambiguous definitions.

> **The Ambiguity:**
> * **Is a Date a Regex?** A date like `12/05/2024` fits a regex pattern. But semantically, it belongs to the **TEMPORAL** engine, not the REGEX scanner.
> * **Is a State a Name?** "California" is technically a Named Entity (NER). But for routing purposes, it must be sent to the Geocoder (**ADDRESS**), not the Person/Org extractor.
> 
> 

To fix this, the training data underwent a **Tag Purification** layer. We mapped 38 granular tags into 4 distinct "Routing Engines" to enforce strict semantic boundaries:

| Source Tag | Action | Destination Engine (Label) |
| --- | --- | --- |
| **DATE / TIME** | **Removed** from Regex | `TEMPORAL` |
| **CITY / STATE / ZIP** | **Removed** from NER | `ADDRESS` |
| **IP / EMAIL / IBAN** | **Kept** in Regex | `REGEX` |
| **PERSON / ORG** | **Kept** in NER | `NER` |

## Student vs. Teacher (Distillation Results)

The model was trained using **Knowledge Distillation**. The student (`TinyBERT`) was forced to mimic the "soft targets" (thought process) of the teacher (`XLM-RoBERTa`), not just the final labels.

**Hardware:** Apple M4 Max
**Speedup:** 9.1x

| Model | Parameters | Size | Throughput | F1 Retention |
| --- | --- | --- | --- | --- |
| **Teacher** (XLM-R) | 278M | ~1 GB | ~360 samples/sec | 100% (Baseline) |
| **Student** (TinyBERT) | **11M** | **42 MB** | **~3,300 samples/sec** | **96%** |

## Evaluation Results

The model shows strong separation between the routing categories, with minimal confusion between semantically distinct classes (e.g., Address vs. Regex).

![Confusion Matrix](confusion_matrix.png)

## Usage

This model outputs **Multi-Label Probabilities** for the 4 engines. We recommend a threshold of **0.5** to trigger a route.

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "pinialt/bert-tiny-pii-router"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

def route_query(text, threshold=0.5):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
    
    with torch.no_grad():
        logits = model(**inputs).logits
    
    # Sigmoid for multi-label (independent probabilities)
    probs = torch.sigmoid(logits)[0]
    
    # Map IDs to Labels
    active_routes = [
        model.config.id2label[i] 
        for i, score in enumerate(probs) 
        if score > threshold
    ]
    
    if not active_routes:
        return "⚡ Direct to LLM (No PII)"
        
    return f"🚦 Route to Engines: {', '.join(active_routes)}"

# === EXAMPLES ===
# 1. Complex Multi-Entity Request
print(route_query("Schedule a meeting with John in London on Friday"))
# Output: 🚦 Route to Engines: NER, ADDRESS, TEMPORAL

# 2. Pure Address
print(route_query("Ship to 123 Main St, New York, NY"))
# Output: 🚦 Route to Engines: ADDRESS

```

## Training & Distillation Details

* **Teacher Model:** `xlm-roberta-base`
* **Student Model:** `google/bert_uncased_L-4_H-256_A-4`
* **Dataset:** `ai4privacy/pii-masking-65k` (Filtered & Purified)
* **Loss Function:** 


## License

This project is licensed under the **Apache License 2.0**.
Finetuned on Google's bert_uncased_L-4_H-256_A-4 model
```
@article{turc2019,
  title={Well-Read Students Learn Better: On the Importance of Pre-training Compact Models},
  author={Turc, Iulia and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
  journal={arXiv preprint arXiv:1908.08962v2 },
  year={2019}
}
```


## Links

* **Dataset:** [ai4privacy/pii-masking-65k](https://huggingface.co/datasets/ai4privacy/pii-masking-65k)
* **Training Code:** [pinialt/pii-router](https://www.google.com/search?q=https://github.com/pinialt/pii-router)