language: en
license: apache-2.0
base_model: google/bert_uncased_L-4_H-256_A-4
tags:
- pii
- privacy
- routing
- text-classification
- knowledge-distillation
- tinybert
- traffic-control
datasets:
- ai4privacy/pii-masking-65k
metrics:
- f1
library_name: transformers
pipeline_tag: text-classification
model-index:
- name: bert-tiny-pii-router
results:
- task:
type: text-classification
name: PII Routing
dataset:
name: ai4privacy/pii-masking-65k
type: ai4privacy/pii-masking-65k
metrics:
- name: F1
type: f1
value: 0.96
bert-tiny-pii-router: The Semantic Gatekeeper
In the era of massive LLMs, sometimes the smartest solution is the smallest one.
This is a 42MB Traffic Controller designed for high-performance NLP pipelines. Instead of blindly running heavy extraction models (NER, Regex, Address Parsers) on every user query, this model acts as a "Gatekeeper" to predict which entities are present before extraction begins.
It was distilled from an XLM-RoBERTa teacher into a TinyBERT student, achieving a 9.1x speedup while retaining 96% of the teacher's accuracy.
The Problem: The "Blind Pipeline"
Traditional PII extraction pipelines are wasteful. They run every specialist model on every input:
- Regex is fast but generates false positives (e.g., confusing dates
12/05/2024for IP addresses). - NER models are accurate but heavy and slow.
- LLMs are perfect extractors but expensive, high-latency, and overkill for simple tasks.
The Solution: The Router Pattern
Instead of a linear pipeline, we use a Multi-Label Classifier at the very front. It reads the text once and outputs a probability vector indicating which specialized engines are needed.
Example Decision:
Input: "Schedule a meeting with John in London on Friday"
Output: {
"NER": 0.99, // -> Trigger Name Extractor
"ADDRESS": 0.98, // -> Trigger Geocoder
"TEMPORAL": 0.95, // -> Trigger Date Parser
"REGEX": 0.01 // -> Skip Regex Engine (Save Compute)
}
Tag Purification: Solving Class Overlap
A major challenge in training this router was Class Overlap. Standard models struggle with ambiguous definitions.
The Ambiguity:
- Is a Date a Regex? A date like
12/05/2024fits a regex pattern. But semantically, it belongs to the TEMPORAL engine, not the REGEX scanner.- Is a State a Name? "California" is technically a Named Entity (NER). But for routing purposes, it must be sent to the Geocoder (ADDRESS), not the Person/Org extractor.
To fix this, the training data underwent a Tag Purification layer. We mapped 38 granular tags into 4 distinct "Routing Engines" to enforce strict semantic boundaries:
| Source Tag | Action | Destination Engine (Label) |
|---|---|---|
| DATE / TIME | Removed from Regex | TEMPORAL |
| CITY / STATE / ZIP | Removed from NER | ADDRESS |
| IP / EMAIL / IBAN | Kept in Regex | REGEX |
| PERSON / ORG | Kept in NER | NER |
Student vs. Teacher (Distillation Results)
The model was trained using Knowledge Distillation. The student (TinyBERT) was forced to mimic the "soft targets" (thought process) of the teacher (XLM-RoBERTa), not just the final labels.
Hardware: Apple M4 Max Speedup: 9.1x
| Model | Parameters | Size | Throughput | F1 Retention |
|---|---|---|---|---|
| Teacher (XLM-R) | 278M | ~1 GB | ~360 samples/sec | 100% (Baseline) |
| Student (TinyBERT) | 11M | 42 MB | ~3,300 samples/sec | 96% |
Evaluation Results
The model shows strong separation between the routing categories, with minimal confusion between semantically distinct classes (e.g., Address vs. Regex).
Usage
This model outputs Multi-Label Probabilities for the 4 engines. We recommend a threshold of 0.5 to trigger a route.
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "pinialt/bert-tiny-pii-router"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
def route_query(text, threshold=0.5):
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
with torch.no_grad():
logits = model(**inputs).logits
# Sigmoid for multi-label (independent probabilities)
probs = torch.sigmoid(logits)[0]
# Map IDs to Labels
active_routes = [
model.config.id2label[i]
for i, score in enumerate(probs)
if score > threshold
]
if not active_routes:
return "⚡ Direct to LLM (No PII)"
return f"🚦 Route to Engines: {', '.join(active_routes)}"
# === EXAMPLES ===
# 1. Complex Multi-Entity Request
print(route_query("Schedule a meeting with John in London on Friday"))
# Output: 🚦 Route to Engines: NER, ADDRESS, TEMPORAL
# 2. Pure Address
print(route_query("Ship to 123 Main St, New York, NY"))
# Output: 🚦 Route to Engines: ADDRESS
Training & Distillation Details
- Teacher Model:
xlm-roberta-base - Student Model:
google/bert_uncased_L-4_H-256_A-4 - Dataset:
ai4privacy/pii-masking-65k(Filtered & Purified) - Loss Function:
License
This project is licensed under the Apache License 2.0. Finetuned on Google's bert_uncased_L-4_H-256_A-4 model
@article{turc2019,
title={Well-Read Students Learn Better: On the Importance of Pre-training Compact Models},
author={Turc, Iulia and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
journal={arXiv preprint arXiv:1908.08962v2 },
year={2019}
}
Links
- Dataset: ai4privacy/pii-masking-65k
- Training Code: pinialt/pii-router
