pinialt's picture
Update README.md
24df4e9 verified
metadata
language: en
license: apache-2.0
base_model: google/bert_uncased_L-4_H-256_A-4
tags:
  - pii
  - privacy
  - routing
  - text-classification
  - knowledge-distillation
  - tinybert
  - traffic-control
datasets:
  - ai4privacy/pii-masking-65k
metrics:
  - f1
library_name: transformers
pipeline_tag: text-classification
model-index:
  - name: bert-tiny-pii-router
    results:
      - task:
          type: text-classification
          name: PII Routing
        dataset:
          name: ai4privacy/pii-masking-65k
          type: ai4privacy/pii-masking-65k
        metrics:
          - name: F1
            type: f1
            value: 0.96

bert-tiny-pii-router: The Semantic Gatekeeper

In the era of massive LLMs, sometimes the smartest solution is the smallest one.

This is a 42MB Traffic Controller designed for high-performance NLP pipelines. Instead of blindly running heavy extraction models (NER, Regex, Address Parsers) on every user query, this model acts as a "Gatekeeper" to predict which entities are present before extraction begins.

It was distilled from an XLM-RoBERTa teacher into a TinyBERT student, achieving a 9.1x speedup while retaining 96% of the teacher's accuracy.

The Problem: The "Blind Pipeline"

Traditional PII extraction pipelines are wasteful. They run every specialist model on every input:

  • Regex is fast but generates false positives (e.g., confusing dates 12/05/2024 for IP addresses).
  • NER models are accurate but heavy and slow.
  • LLMs are perfect extractors but expensive, high-latency, and overkill for simple tasks.

The Solution: The Router Pattern

Instead of a linear pipeline, we use a Multi-Label Classifier at the very front. It reads the text once and outputs a probability vector indicating which specialized engines are needed.

Example Decision:

Input: "Schedule a meeting with John in London on Friday"
Output: {
  "NER": 0.99,      // -> Trigger Name Extractor
  "ADDRESS": 0.98,  // -> Trigger Geocoder
  "TEMPORAL": 0.95, // -> Trigger Date Parser
  "REGEX": 0.01     // -> Skip Regex Engine (Save Compute)
}

Tag Purification: Solving Class Overlap

A major challenge in training this router was Class Overlap. Standard models struggle with ambiguous definitions.

The Ambiguity:

  • Is a Date a Regex? A date like 12/05/2024 fits a regex pattern. But semantically, it belongs to the TEMPORAL engine, not the REGEX scanner.
  • Is a State a Name? "California" is technically a Named Entity (NER). But for routing purposes, it must be sent to the Geocoder (ADDRESS), not the Person/Org extractor.

To fix this, the training data underwent a Tag Purification layer. We mapped 38 granular tags into 4 distinct "Routing Engines" to enforce strict semantic boundaries:

Source Tag Action Destination Engine (Label)
DATE / TIME Removed from Regex TEMPORAL
CITY / STATE / ZIP Removed from NER ADDRESS
IP / EMAIL / IBAN Kept in Regex REGEX
PERSON / ORG Kept in NER NER

Student vs. Teacher (Distillation Results)

The model was trained using Knowledge Distillation. The student (TinyBERT) was forced to mimic the "soft targets" (thought process) of the teacher (XLM-RoBERTa), not just the final labels.

Hardware: Apple M4 Max Speedup: 9.1x

Model Parameters Size Throughput F1 Retention
Teacher (XLM-R) 278M ~1 GB ~360 samples/sec 100% (Baseline)
Student (TinyBERT) 11M 42 MB ~3,300 samples/sec 96%

Evaluation Results

The model shows strong separation between the routing categories, with minimal confusion between semantically distinct classes (e.g., Address vs. Regex).

Confusion Matrix

Usage

This model outputs Multi-Label Probabilities for the 4 engines. We recommend a threshold of 0.5 to trigger a route.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "pinialt/bert-tiny-pii-router"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

def route_query(text, threshold=0.5):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
    
    with torch.no_grad():
        logits = model(**inputs).logits
    
    # Sigmoid for multi-label (independent probabilities)
    probs = torch.sigmoid(logits)[0]
    
    # Map IDs to Labels
    active_routes = [
        model.config.id2label[i] 
        for i, score in enumerate(probs) 
        if score > threshold
    ]
    
    if not active_routes:
        return "⚡ Direct to LLM (No PII)"
        
    return f"🚦 Route to Engines: {', '.join(active_routes)}"

# === EXAMPLES ===
# 1. Complex Multi-Entity Request
print(route_query("Schedule a meeting with John in London on Friday"))
# Output: 🚦 Route to Engines: NER, ADDRESS, TEMPORAL

# 2. Pure Address
print(route_query("Ship to 123 Main St, New York, NY"))
# Output: 🚦 Route to Engines: ADDRESS

Training & Distillation Details

  • Teacher Model: xlm-roberta-base
  • Student Model: google/bert_uncased_L-4_H-256_A-4
  • Dataset: ai4privacy/pii-masking-65k (Filtered & Purified)
  • Loss Function:

License

This project is licensed under the Apache License 2.0. Finetuned on Google's bert_uncased_L-4_H-256_A-4 model

@article{turc2019,
  title={Well-Read Students Learn Better: On the Importance of Pre-training Compact Models},
  author={Turc, Iulia and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
  journal={arXiv preprint arXiv:1908.08962v2 },
  year={2019}
}

Links