|
|
--- |
|
|
language: en |
|
|
license: apache-2.0 |
|
|
base_model: google/bert_uncased_L-4_H-256_A-4 |
|
|
tags: |
|
|
- pii |
|
|
- privacy |
|
|
- routing |
|
|
- text-classification |
|
|
- knowledge-distillation |
|
|
- tinybert |
|
|
- traffic-control |
|
|
datasets: |
|
|
- ai4privacy/pii-masking-65k |
|
|
metrics: |
|
|
- f1 |
|
|
library_name: transformers |
|
|
pipeline_tag: text-classification |
|
|
model-index: |
|
|
- name: bert-tiny-pii-router |
|
|
results: |
|
|
- task: |
|
|
type: text-classification |
|
|
name: PII Routing |
|
|
dataset: |
|
|
name: ai4privacy/pii-masking-65k |
|
|
type: ai4privacy/pii-masking-65k |
|
|
metrics: |
|
|
- name: F1 |
|
|
type: f1 |
|
|
value: 0.96 |
|
|
--- |
|
|
|
|
|
# bert-tiny-pii-router: The Semantic Gatekeeper |
|
|
|
|
|
> *In the era of massive LLMs, sometimes the smartest solution is the smallest one.* |
|
|
|
|
|
This is a **42MB Traffic Controller** designed for high-performance NLP pipelines. Instead of blindly running heavy extraction models (NER, Regex, Address Parsers) on every user query, this model acts as a "Gatekeeper" to predict *which* entities are present before extraction begins. |
|
|
|
|
|
It was distilled from an **XLM-RoBERTa** teacher into a **TinyBERT** student, achieving a **9.1x speedup** while retaining 96% of the teacher's accuracy. |
|
|
|
|
|
## The Problem: The "Blind Pipeline" |
|
|
Traditional PII extraction pipelines are wasteful. They run every specialist model on every input: |
|
|
* **Regex** is fast but generates false positives (e.g., confusing dates `12/05/2024` for IP addresses). |
|
|
* **NER** models are accurate but heavy and slow. |
|
|
* **LLMs** are perfect extractors but expensive, high-latency, and overkill for simple tasks. |
|
|
|
|
|
## The Solution: The Router Pattern |
|
|
Instead of a linear pipeline, we use a **Multi-Label Classifier** at the very front. It reads the text once and outputs a probability vector indicating which specialized engines are needed. |
|
|
|
|
|
**Example Decision:** |
|
|
```json |
|
|
Input: "Schedule a meeting with John in London on Friday" |
|
|
Output: { |
|
|
"NER": 0.99, // -> Trigger Name Extractor |
|
|
"ADDRESS": 0.98, // -> Trigger Geocoder |
|
|
"TEMPORAL": 0.95, // -> Trigger Date Parser |
|
|
"REGEX": 0.01 // -> Skip Regex Engine (Save Compute) |
|
|
} |
|
|
|
|
|
``` |
|
|
|
|
|
## Tag Purification: Solving Class Overlap |
|
|
|
|
|
A major challenge in training this router was **Class Overlap**. Standard models struggle with ambiguous definitions. |
|
|
|
|
|
> **The Ambiguity:** |
|
|
> * **Is a Date a Regex?** A date like `12/05/2024` fits a regex pattern. But semantically, it belongs to the **TEMPORAL** engine, not the REGEX scanner. |
|
|
> * **Is a State a Name?** "California" is technically a Named Entity (NER). But for routing purposes, it must be sent to the Geocoder (**ADDRESS**), not the Person/Org extractor. |
|
|
> |
|
|
> |
|
|
|
|
|
To fix this, the training data underwent a **Tag Purification** layer. We mapped 38 granular tags into 4 distinct "Routing Engines" to enforce strict semantic boundaries: |
|
|
|
|
|
| Source Tag | Action | Destination Engine (Label) | |
|
|
| --- | --- | --- | |
|
|
| **DATE / TIME** | **Removed** from Regex | `TEMPORAL` | |
|
|
| **CITY / STATE / ZIP** | **Removed** from NER | `ADDRESS` | |
|
|
| **IP / EMAIL / IBAN** | **Kept** in Regex | `REGEX` | |
|
|
| **PERSON / ORG** | **Kept** in NER | `NER` | |
|
|
|
|
|
## Student vs. Teacher (Distillation Results) |
|
|
|
|
|
The model was trained using **Knowledge Distillation**. The student (`TinyBERT`) was forced to mimic the "soft targets" (thought process) of the teacher (`XLM-RoBERTa`), not just the final labels. |
|
|
|
|
|
**Hardware:** Apple M4 Max |
|
|
**Speedup:** 9.1x |
|
|
|
|
|
| Model | Parameters | Size | Throughput | F1 Retention | |
|
|
| --- | --- | --- | --- | --- | |
|
|
| **Teacher** (XLM-R) | 278M | ~1 GB | ~360 samples/sec | 100% (Baseline) | |
|
|
| **Student** (TinyBERT) | **11M** | **42 MB** | **~3,300 samples/sec** | **96%** | |
|
|
|
|
|
## Evaluation Results |
|
|
|
|
|
The model shows strong separation between the routing categories, with minimal confusion between semantically distinct classes (e.g., Address vs. Regex). |
|
|
|
|
|
 |
|
|
|
|
|
## Usage |
|
|
|
|
|
This model outputs **Multi-Label Probabilities** for the 4 engines. We recommend a threshold of **0.5** to trigger a route. |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
import torch |
|
|
|
|
|
model_name = "pinialt/bert-tiny-pii-router" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
|
|
|
|
def route_query(text, threshold=0.5): |
|
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128) |
|
|
|
|
|
with torch.no_grad(): |
|
|
logits = model(**inputs).logits |
|
|
|
|
|
# Sigmoid for multi-label (independent probabilities) |
|
|
probs = torch.sigmoid(logits)[0] |
|
|
|
|
|
# Map IDs to Labels |
|
|
active_routes = [ |
|
|
model.config.id2label[i] |
|
|
for i, score in enumerate(probs) |
|
|
if score > threshold |
|
|
] |
|
|
|
|
|
if not active_routes: |
|
|
return "⚡ Direct to LLM (No PII)" |
|
|
|
|
|
return f"🚦 Route to Engines: {', '.join(active_routes)}" |
|
|
|
|
|
# === EXAMPLES === |
|
|
# 1. Complex Multi-Entity Request |
|
|
print(route_query("Schedule a meeting with John in London on Friday")) |
|
|
# Output: 🚦 Route to Engines: NER, ADDRESS, TEMPORAL |
|
|
|
|
|
# 2. Pure Address |
|
|
print(route_query("Ship to 123 Main St, New York, NY")) |
|
|
# Output: 🚦 Route to Engines: ADDRESS |
|
|
|
|
|
``` |
|
|
|
|
|
## Training & Distillation Details |
|
|
|
|
|
* **Teacher Model:** `xlm-roberta-base` |
|
|
* **Student Model:** `google/bert_uncased_L-4_H-256_A-4` |
|
|
* **Dataset:** `ai4privacy/pii-masking-65k` (Filtered & Purified) |
|
|
* **Loss Function:** |
|
|
|
|
|
|
|
|
## License |
|
|
|
|
|
This project is licensed under the **Apache License 2.0**. |
|
|
Finetuned on Google's bert_uncased_L-4_H-256_A-4 model |
|
|
``` |
|
|
@article{turc2019, |
|
|
title={Well-Read Students Learn Better: On the Importance of Pre-training Compact Models}, |
|
|
author={Turc, Iulia and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina}, |
|
|
journal={arXiv preprint arXiv:1908.08962v2 }, |
|
|
year={2019} |
|
|
} |
|
|
``` |
|
|
|
|
|
|
|
|
## Links |
|
|
|
|
|
* **Dataset:** [ai4privacy/pii-masking-65k](https://huggingface.co/datasets/ai4privacy/pii-masking-65k) |
|
|
* **Training Code:** [pinialt/pii-router](https://www.google.com/search?q=https://github.com/pinialt/pii-router) |