File size: 9,672 Bytes
0d9572e 58c8a59 0d9572e 18e9ee7 0d9572e c9b3805 0d9572e c9b3805 0d9572e c9b3805 0d9572e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 | ---
license: other
license_name: 8f-ai-license-v1.0
license_link: https://huggingface.co/8Fai/license
tags:
- privacy
- pii-detection
- pii-redaction
- token-classification
- sliding-window-attention
- rope
- swiglu
language:
- en
library_name: transformers
---
# Context-Filter
[]()
[]()
[]()
[]()
> Context-Filter is a compact, purpose-built privacy filtering model for real-time PII detection and redaction. At 38M parameters it runs comfortably on CPU or any consumer GPU, and supports sequences up to 32,768 tokens via Sliding Window Attention. It ships with a built-in regex hybrid layer ensuring near-zero false negatives on structured formats such as emails, IPs, and social security numbers.
---
## Highlights
- **Custom Architecture — Not a Fine-Tune**: Context-Filter is trained from scratch using a purpose-designed encoder: Grouped Query Attention (8Q / 4KV heads), RMSNorm, RoPE with θ = 500,000, and SwiGLU FFNs. No base model weights are reused.
- **32K Context via Sliding Window Attention**: Each token attends to a local window of ±512 tokens. Memory scales as O(n · w) rather than O(n²), making long-document redaction practical on commodity hardware.
- **12 PII Entity Classes**: Covers personal identity, financial, network, and government-issued identifiers across a single BIO tagging head.
- **Focal Loss Training**: Trained with focal loss (γ = 2.0) to suppress the dominant O-label class and sharpen precision on rare entity spans.
- **Dual Output Modes**: Returns either semantic labels (`private_email`) or bracketed redaction tags (`[EMAIL]`), selectable per call.
- **Per-Entity Confidence Scores**: Every detected span carries a softmax confidence value, enabling downstream threshold filtering.
- **Regex Hybrid Layer**: A built-in post-processing pass applies deterministic regex patterns for structured PII formats, guaranteeing recall on well-defined identifiers regardless of model uncertainty.
---
## Model Overview
| Property | Value |
|---|---|
| **Type** | Token Classification (BIO NER) |
| **Architecture** | Custom Encoder (Context-Filter) |
| **Training** | From scratch — synthetic data only |
| **Parameters** | ~61M |
| **Context Length** | 32,768 tokens |
| **VRAM (bfloat16)** | ~252 MB |
| **VRAM (int8)** | ~76 MB |
| **Tokenizer** | GPT-2 BPE (50,257 vocabulary) |
### Architecture Specification
| Component | Value |
|---|---|
| Hidden Dimension | 512 |
| Number of Layers | 10 |
| Attention Heads (Q / KV) | 8 / 4 (GQA) |
| Head Dimension | 64 |
| FFN Intermediate Dimension | 1,792 |
| FFN Activation | SwiGLU |
| Attention Pattern | Sliding Window (window = 512) |
| Position Encoding | RoPE (θ = 500,000) |
| Normalisation | RMSNorm (ε = 1e-6) |
| Vocabulary Size | 50,257 |
| Context Length | 32,768 tokens |
### Entity Classes
| Label | Type | Examples |
|---|---|---|
| `PERSON` | Full names | *Jane Smith*, *Dr. Erik Larsson* |
| `EMAIL` | Email addresses | *user@domain.com* |
| `PHONE` | Phone numbers | *+1-555-234-5678*, *07700 900123* |
| `ADDRESS` | Postal addresses | *42 Baker Street, London* |
| `SSN` | Social security numbers | *452-78-9012* |
| `CREDITCARD` | Payment card numbers | *4111-1111-1111-1111* |
| `IP` | IPv4 addresses | *192.168.1.104* |
| `DATE` | Dates of birth and event dates | *1990-07-12*, *March 15, 2024* |
| `ORG` | Organisation names | *Acme Corp*, *St. Mary's Hospital* |
| `USERNAME` | Handles and usernames | *john_doe*, *@alice_m* |
| `PASSPORT` | Passport numbers | *A7843921* |
| `DRIVERSLICENSE` | Driver's licence numbers | *K482910* |
---
## Quickstart
### Installation
```bash
pip install torch transformers
```
### Load the Model
```python
import torch
from context_filter_v2_train import ContextFilterInference
engine = ContextFilterInference("./context_filter_v2")
```
### Redact Mode — `[ENTITY]` brackets
```python
result = engine.filter(
"My name is Andrew and my Gmail is Andrew@gmail.com and live in Sweden",
mode="redact",
)
print(result["filtered"])
# My name is [PERSON] and my Gmail is [EMAIL] and live in Sweden
```
### Label Mode — semantic placeholders
```python
result = engine.filter(
"My name is Andrew and my Gmail is Andrew@gmail.com and live in Sweden",
mode="label",
)
print(result["filtered"])
# My name is private_person and my Gmail is private_email and live in Sweden
```
### Entity Spans with Confidence
```python
for entity in result["entities"]:
print(entity)
# {'type': 'PERSON', 'start': 11, 'end': 17, 'text': 'Andrew', 'confidence': 0.987}
# {'type': 'EMAIL', 'start': 33, 'end': 49, 'text': 'Andrew@gmail.com', 'confidence': 0.995}
```
### Batch Processing
```python
texts = [
"Call Sarah at +1-555-234-5678.",
"Server 192.168.1.1 accessed by john_doe on 2024-03-15.",
"Account: Michael Chen, SSN: 452-78-9012.",
]
results = engine.filter_batch(texts, mode="redact")
for r in results:
print(r["filtered"])
# Call Sarah at [PHONE].
# Server [IP] accessed by [USERNAME] on [DATE].
# Account: [PERSON], SSN: [SSN].
```
### Disable Regex Hybrid (model-only predictions)
```python
result = engine.filter(text, mode="redact", regex_hybrid=False)
```
---
## Output Format Reference
### `filter()` return value
```python
{
"filtered": str, # processed text with PII replaced
"entities": [
{
"type": str, # entity class name (e.g. "EMAIL")
"start": int, # character start offset in original text
"end": int, # character end offset in original text
"text": str, # original PII span
"confidence": float, # softmax confidence [0.0 – 1.0]
},
...
]
}
```
### Mode comparison
| Input | `mode="label"` | `mode="redact"` |
|---|---|---|
| `Andrew@gmail.com` | `private_email` | `[EMAIL]` |
| `Jane Smith` | `private_person` | `[PERSON]` |
| `+1-555-234-5678` | `private_phone` | `[PHONE]` |
| `452-78-9012` | `private_ssn` | `[SSN]` |
| `192.168.1.104` | `private_ip` | `[IP]` |
| `A7843921` | `private_passport` | `[PASSPORT]` |
---
## Performance Characteristics
| Hardware | Throughput | Latency (512 tok) |
|---|---|---|
| A100 40GB (bfloat16) | ~85,000 tok/s | ~6 ms |
| RTX 4090 (bfloat16) | ~52,000 tok/s | ~10 ms |
| RTX 3080 (bfloat16) | ~28,000 tok/s | ~18 ms |
| CPU (int8, 16 cores) | ~4,200 tok/s | ~120 ms |
*Throughput measured at batch size 32. Latency measured at batch size 1.*
### Memory Footprint
| Precision | VRAM |
|---|---|
| bfloat16 (default) | ~152 MB |
| float32 | ~304 MB |
| int8 quantised | ~76 MB |
---
## Intended Use Cases
| Use Case | Description |
|---|---|
| **Log sanitisation** | Strip PII from server logs, audit trails, and telemetry pipelines before storage |
| **Document redaction** | Redact legal, medical, or HR documents before sharing or archival |
| **Data anonymisation** | Pre-process training datasets to remove personal identifiers |
| **API response filtering** | Inline filter for LLM or API outputs before they reach end users |
| **Compliance pipelines** | GDPR / CCPA / HIPAA pre-processing layer |
| **Chat moderation** | Real-time PII removal in messaging or support platforms |
| **IDE / copilot integration** | Client-side PII guard before code or prompts are sent to remote APIs |
---
## Hybrid Detection Strategy
Context-Filter uses a two-layer detection approach for maximum recall:
**Layer 1 — Neural Model**: The transformer encoder reads full sentence context to detect ambiguous PII such as person names, organisation names, and contextual dates that regex cannot identify.
**Layer 2 — Regex Safety Net**: A deterministic pass using compiled regular expressions guarantees recall on structurally defined formats (email, IPv4, SSN, credit card, phone, passport, driver's licence) regardless of model confidence.
The two layers are merged with entity-level deduplication: spans already found by the model are not double-tagged. This combination eliminates the false-negative failure mode of pure-neural approaches while maintaining the contextual understanding that regex-only tools cannot provide.
---
## Limitations
- **English-Primary**: Training templates are predominantly English-language. Names and organisation names in non-Latin scripts may have reduced recall.
- **Highly Nested PII**: Overlapping or recursively nested PII spans (e.g., an email containing a person's name as the local part) are resolved to the outermost detected entity.
- **Synthetic Training Data**: The model was trained entirely on procedurally generated examples. Domain-specific PII formats not covered by the synthetic generator (e.g., jurisdiction-specific ID numbers) may have lower recall until fine-tuned on real-world samples.
- **Contextual Dates**: Generic dates (e.g., publication dates, historical dates) may occasionally be tagged as DATE. Post-filter confidence thresholding (e.g., `confidence > 0.8`) can reduce these false positives.
- **No Document Structure Awareness**: The model operates on raw token sequences without awareness of HTML, Markdown, or JSON structure. Strip formatting before passing structured documents.
---
## License
Context-Filter is released under the **Apache License 2.0**.
---
<div align="center">
<sub>Context-Filter — purpose-built for privacy, not adapted for it.</sub>
</div> |