bert-tiny-pii-router / README.md

pinialt

Update README.md

24df4e9 verified 6 days ago

preview code

raw

history blame contribute delete

5.98 kB

metadata

language: en
license: apache-2.0
base_model: google/bert_uncased_L-4_H-256_A-4
tags:
  - pii
  - privacy
  - routing
  - text-classification
  - knowledge-distillation
  - tinybert
  - traffic-control
datasets:
  - ai4privacy/pii-masking-65k
metrics:
  - f1
library_name: transformers
pipeline_tag: text-classification
model-index:
  - name: bert-tiny-pii-router
    results:
      - task:
          type: text-classification
          name: PII Routing
        dataset:
          name: ai4privacy/pii-masking-65k
          type: ai4privacy/pii-masking-65k
        metrics:
          - name: F1
            type: f1
            value: 0.96

bert-tiny-pii-router: The Semantic Gatekeeper

In the era of massive LLMs, sometimes the smartest solution is the smallest one.

This is a 42MB Traffic Controller designed for high-performance NLP pipelines. Instead of blindly running heavy extraction models (NER, Regex, Address Parsers) on every user query, this model acts as a "Gatekeeper" to predict which entities are present before extraction begins.

It was distilled from an XLM-RoBERTa teacher into a TinyBERT student, achieving a 9.1x speedup while retaining 96% of the teacher's accuracy.

The Problem: The "Blind Pipeline"

Traditional PII extraction pipelines are wasteful. They run every specialist model on every input:

Regex is fast but generates false positives (e.g., confusing dates 12/05/2024 for IP addresses).
NER models are accurate but heavy and slow.
LLMs are perfect extractors but expensive, high-latency, and overkill for simple tasks.

The Solution: The Router Pattern

Instead of a linear pipeline, we use a Multi-Label Classifier at the very front. It reads the text once and outputs a probability vector indicating which specialized engines are needed.

Example Decision:

Input: "Schedule a meeting with John in London on Friday"
Output: {
  "NER": 0.99,      // -> Trigger Name Extractor
  "ADDRESS": 0.98,  // -> Trigger Geocoder
  "TEMPORAL": 0.95, // -> Trigger Date Parser
  "REGEX": 0.01     // -> Skip Regex Engine (Save Compute)
}

Tag Purification: Solving Class Overlap

A major challenge in training this router was Class Overlap. Standard models struggle with ambiguous definitions.

The Ambiguity:

Is a Date a Regex? A date like 12/05/2024 fits a regex pattern. But semantically, it belongs to the TEMPORAL engine, not the REGEX scanner.

Is a State a Name? "California" is technically a Named Entity (NER). But for routing purposes, it must be sent to the Geocoder (ADDRESS), not the Person/Org extractor.

To fix this, the training data underwent a Tag Purification layer. We mapped 38 granular tags into 4 distinct "Routing Engines" to enforce strict semantic boundaries:

Source Tag	Action	Destination Engine (Label)
DATE / TIME	Removed from Regex	`TEMPORAL`
CITY / STATE / ZIP	Removed from NER	`ADDRESS`
IP / EMAIL / IBAN	Kept in Regex	`REGEX`
PERSON / ORG	Kept in NER	`NER`

Student vs. Teacher (Distillation Results)

The model was trained using Knowledge Distillation. The student (TinyBERT) was forced to mimic the "soft targets" (thought process) of the teacher (XLM-RoBERTa), not just the final labels.

Hardware: Apple M4 Max Speedup: 9.1x

Model	Parameters	Size	Throughput	F1 Retention
Teacher (XLM-R)	278M	~1 GB	~360 samples/sec	100% (Baseline)
Student (TinyBERT)	11M	42 MB	~3,300 samples/sec	96%

Evaluation Results

The model shows strong separation between the routing categories, with minimal confusion between semantically distinct classes (e.g., Address vs. Regex).

Usage

This model outputs Multi-Label Probabilities for the 4 engines. We recommend a threshold of 0.5 to trigger a route.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "pinialt/bert-tiny-pii-router"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

def route_query(text, threshold=0.5):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
    
    with torch.no_grad():
        logits = model(**inputs).logits
    
    # Sigmoid for multi-label (independent probabilities)
    probs = torch.sigmoid(logits)[0]
    
    # Map IDs to Labels
    active_routes = [
        model.config.id2label[i] 
        for i, score in enumerate(probs) 
        if score > threshold
    ]
    
    if not active_routes:
        return "⚡ Direct to LLM (No PII)"
        
    return f"🚦 Route to Engines: {', '.join(active_routes)}"

# === EXAMPLES ===
# 1. Complex Multi-Entity Request
print(route_query("Schedule a meeting with John in London on Friday"))
# Output: 🚦 Route to Engines: NER, ADDRESS, TEMPORAL

# 2. Pure Address
print(route_query("Ship to 123 Main St, New York, NY"))
# Output: 🚦 Route to Engines: ADDRESS

Training & Distillation Details

Teacher Model: xlm-roberta-base
Student Model: google/bert_uncased_L-4_H-256_A-4
Dataset: ai4privacy/pii-masking-65k (Filtered & Purified)
Loss Function:

License

This project is licensed under the Apache License 2.0. Finetuned on Google's bert_uncased_L-4_H-256_A-4 model

@article{turc2019,
  title={Well-Read Students Learn Better: On the Importance of Pre-training Compact Models},
  author={Turc, Iulia and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
  journal={arXiv preprint arXiv:1908.08962v2 },
  year={2019}
}

pinialt
/

bert-tiny-pii-router