File size: 5,984 Bytes
b1827de edcd302 b1827de c7327b9 b1827de 8dba7a2 b1827de edcd302 24df4e9 edcd302 24df4e9 edcd302 b1827de |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 |
---
language: en
license: apache-2.0
base_model: google/bert_uncased_L-4_H-256_A-4
tags:
- pii
- privacy
- routing
- text-classification
- knowledge-distillation
- tinybert
- traffic-control
datasets:
- ai4privacy/pii-masking-65k
metrics:
- f1
library_name: transformers
pipeline_tag: text-classification
model-index:
- name: bert-tiny-pii-router
results:
- task:
type: text-classification
name: PII Routing
dataset:
name: ai4privacy/pii-masking-65k
type: ai4privacy/pii-masking-65k
metrics:
- name: F1
type: f1
value: 0.96
---
# bert-tiny-pii-router: The Semantic Gatekeeper
> *In the era of massive LLMs, sometimes the smartest solution is the smallest one.*
This is a **42MB Traffic Controller** designed for high-performance NLP pipelines. Instead of blindly running heavy extraction models (NER, Regex, Address Parsers) on every user query, this model acts as a "Gatekeeper" to predict *which* entities are present before extraction begins.
It was distilled from an **XLM-RoBERTa** teacher into a **TinyBERT** student, achieving a **9.1x speedup** while retaining 96% of the teacher's accuracy.
## The Problem: The "Blind Pipeline"
Traditional PII extraction pipelines are wasteful. They run every specialist model on every input:
* **Regex** is fast but generates false positives (e.g., confusing dates `12/05/2024` for IP addresses).
* **NER** models are accurate but heavy and slow.
* **LLMs** are perfect extractors but expensive, high-latency, and overkill for simple tasks.
## The Solution: The Router Pattern
Instead of a linear pipeline, we use a **Multi-Label Classifier** at the very front. It reads the text once and outputs a probability vector indicating which specialized engines are needed.
**Example Decision:**
```json
Input: "Schedule a meeting with John in London on Friday"
Output: {
"NER": 0.99, // -> Trigger Name Extractor
"ADDRESS": 0.98, // -> Trigger Geocoder
"TEMPORAL": 0.95, // -> Trigger Date Parser
"REGEX": 0.01 // -> Skip Regex Engine (Save Compute)
}
```
## Tag Purification: Solving Class Overlap
A major challenge in training this router was **Class Overlap**. Standard models struggle with ambiguous definitions.
> **The Ambiguity:**
> * **Is a Date a Regex?** A date like `12/05/2024` fits a regex pattern. But semantically, it belongs to the **TEMPORAL** engine, not the REGEX scanner.
> * **Is a State a Name?** "California" is technically a Named Entity (NER). But for routing purposes, it must be sent to the Geocoder (**ADDRESS**), not the Person/Org extractor.
>
>
To fix this, the training data underwent a **Tag Purification** layer. We mapped 38 granular tags into 4 distinct "Routing Engines" to enforce strict semantic boundaries:
| Source Tag | Action | Destination Engine (Label) |
| --- | --- | --- |
| **DATE / TIME** | **Removed** from Regex | `TEMPORAL` |
| **CITY / STATE / ZIP** | **Removed** from NER | `ADDRESS` |
| **IP / EMAIL / IBAN** | **Kept** in Regex | `REGEX` |
| **PERSON / ORG** | **Kept** in NER | `NER` |
## Student vs. Teacher (Distillation Results)
The model was trained using **Knowledge Distillation**. The student (`TinyBERT`) was forced to mimic the "soft targets" (thought process) of the teacher (`XLM-RoBERTa`), not just the final labels.
**Hardware:** Apple M4 Max
**Speedup:** 9.1x
| Model | Parameters | Size | Throughput | F1 Retention |
| --- | --- | --- | --- | --- |
| **Teacher** (XLM-R) | 278M | ~1 GB | ~360 samples/sec | 100% (Baseline) |
| **Student** (TinyBERT) | **11M** | **42 MB** | **~3,300 samples/sec** | **96%** |
## Evaluation Results
The model shows strong separation between the routing categories, with minimal confusion between semantically distinct classes (e.g., Address vs. Regex).

## Usage
This model outputs **Multi-Label Probabilities** for the 4 engines. We recommend a threshold of **0.5** to trigger a route.
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "pinialt/bert-tiny-pii-router"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
def route_query(text, threshold=0.5):
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
with torch.no_grad():
logits = model(**inputs).logits
# Sigmoid for multi-label (independent probabilities)
probs = torch.sigmoid(logits)[0]
# Map IDs to Labels
active_routes = [
model.config.id2label[i]
for i, score in enumerate(probs)
if score > threshold
]
if not active_routes:
return "⚡ Direct to LLM (No PII)"
return f"🚦 Route to Engines: {', '.join(active_routes)}"
# === EXAMPLES ===
# 1. Complex Multi-Entity Request
print(route_query("Schedule a meeting with John in London on Friday"))
# Output: 🚦 Route to Engines: NER, ADDRESS, TEMPORAL
# 2. Pure Address
print(route_query("Ship to 123 Main St, New York, NY"))
# Output: 🚦 Route to Engines: ADDRESS
```
## Training & Distillation Details
* **Teacher Model:** `xlm-roberta-base`
* **Student Model:** `google/bert_uncased_L-4_H-256_A-4`
* **Dataset:** `ai4privacy/pii-masking-65k` (Filtered & Purified)
* **Loss Function:**
## License
This project is licensed under the **Apache License 2.0**.
Finetuned on Google's bert_uncased_L-4_H-256_A-4 model
```
@article{turc2019,
title={Well-Read Students Learn Better: On the Importance of Pre-training Compact Models},
author={Turc, Iulia and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
journal={arXiv preprint arXiv:1908.08962v2 },
year={2019}
}
```
## Links
* **Dataset:** [ai4privacy/pii-masking-65k](https://huggingface.co/datasets/ai4privacy/pii-masking-65k)
* **Training Code:** [pinialt/pii-router](https://www.google.com/search?q=https://github.com/pinialt/pii-router) |