DistilBERT Secret Masker (Fast Expert)

Fast Expert model for SecMask MoE system - specialized for rapid secret detection in short to medium-length texts (≤512 tokens).

🎯 Overview

Fine-tuned DistilBERT model for detecting and classifying secrets (API keys, tokens, credentials) in text using Named Entity Recognition (NER). Serves as the Fast Expert in the SecMask Mixture of Experts architecture, handling 92.7% of inference requests with ~6ms latency.

Key Features

✅ High Speed: 11ms P50 latency on CPU
✅ High Precision: 82% (NER-only), 92.3% with filters
✅ Production Ready: Handles 92.7% of real-world cases
✅ Lightweight: 265MB (66M parameters)
✅ Multi-Secret Types: AWS keys, GitHub tokens, JWTs, API keys, PEM blocks, K8s secrets

Production Performance: When combined with post-processing filters (PEM blocks, K8s secrets, pattern matching), achieves 92.3% precision, 80% recall, F1: 0.857. The NER model alone achieves 82% precision and 38% recall. See comprehensive benchmarks for details.

Recommended Configuration: Fast Expert + Filters (this model with post-processing) is the recommended production setup, outperforming Full MoE configurations. See Configuration Guide for usage recommendations.

Detected Secret Types

Secret Type	Example Pattern	F1 Score
AWS Access Keys	`AKIA...`	0.92
GitHub Personal Tokens	`ghp_...`, `gho_...`	0.88
JWT Tokens	`eyJ0eXAiOiJKV1QiLCJhbGc...`	0.85
Generic API Keys	`sk-proj-...`, `api_key=...`	0.79
PEM Certificate Blocks	`-----BEGIN PRIVATE KEY-----`	0.95
Kubernetes Secrets (data)	`kind: Secret` → `data:` values	0.81
Database Credentials	Connection strings with passwords	0.74

🚀 Quick Start

Installation

pip install transformers torch

Basic Usage

Standalone (Direct):

from transformers import pipeline

# Load model
classifier = pipeline(
    "token-classification",
    model="AndrewAndrewsen/distilbert-secret-masker",
    aggregation_strategy="simple"
)

# Detect secrets
text = "My API key is sk-proj-1234567890abcdefghijklmnopqrstuvwxyz"
results = classifier(text)

print(results)
# [{'entity_group': 'SECRET', 'score': 0.95, 'word': 'sk-proj-1234567890abcdefghijklmnopqrstuvwxyz', ...}]

Recommended (via SecMask MoE):

# Clone SecMask repo
# git clone https://github.com/AndrewAndrewsen/secmask.git

from infer_moe import mask_text_moe

masked = mask_text_moe(
    "My GitHub token is ghp_1234567890abcdefghijklmnopqrstuvwxyz",
    fast_model_dir="AndrewAndrewsen/distilbert-secret-masker",
    tau=0.80,
    routing_mode="heuristic"
)

print(masked)
# "My GitHub token is [SECRET]"

Command Line (via SecMask)

# Clone repo
git clone https://github.com/AndrewAndrewsen/secmask.git
cd secmask

# Mask secrets
python infer_moe.py \
    --text "AWS key: AKIAIOSFODNN7EXAMPLE" \
    --fast-model AndrewAndrewsen/distilbert-secret-masker \
    --routing-mode heuristic \
    --tau 0.80

# Output: AWS key: [SECRET]

📊 Performance

Secret Detection Metrics

Metric	NER Only	With Filters (Recommended)
F1 Score	0.52	0.857
Precision	82%	92.3%
Recall	38%	80.0%
P50 Latency	11ms	11ms
P90 Latency	14ms	14ms
P99 Latency	17ms	17ms
Throughput	84 req/s	84 req/s (CPU)

Note: NER-only metrics measured at τ=0.80. Production systems combine NER with post-processing filters (PEM blocks, K8s secrets, pattern matching) to achieve 92.3% precision and 80% recall. Post-processing adds no latency overhead. See BENCHMARK_RESULTS.md for comprehensive benchmarks.

When This Model Is Used (MoE Routing)

The router selects this Fast Expert when:

Token count ≤ 512
No multi-line structures (PEM blocks, K8s YAML)
Simple text patterns
Coverage: 92.7% of real-world requests

Note: The recommended production configuration is Fast Expert + Filters alone (without the Long Expert). This achieves better results than Full MoE. See Configuration Guide for details.

🏗️ Model Details

Architecture

Base Model: distilbert-base-uncased (66M params, Apache 2.0)
Task: Token Classification (NER)
Max Sequence Length: 512 tokens
Label: B-SECRET, I-SECRET, O (BIO tagging)

Training Details

Dataset: Custom SecretMask v2 (6,000 training examples)
Optimizer: AdamW (lr=5e-5)
Epochs: 3
Batch Size: 16
Hardware: GPU (NVIDIA A100 or equivalent)
Training Time: ~30 minutes

Evaluation

Evaluated on 600 held-out examples from SecretMask v2 test set:

Precision: 0.82
Recall: 0.38
F1: 0.52
Support: 1,021 secret tokens

Key Insights:

High Precision (82%): Very low false positive rate - safe for production
Lower Recall (38%): Misses some secrets when used standalone
Production Strategy: Combine with deterministic filters (see filters.py) for PEM blocks, K8s secrets, and AWS patterns to achieve >90% coverage
Threshold Tuning: Lower τ from 0.80 to 0.50 for higher recall (trade-off: more false positives)

💡 Use Cases

Production Applications

Pre-Commit Hooks - Prevent secrets in git commits
CI/CD Pipelines - Scan code before deployment
Log Sanitization - Remove secrets from application logs
API Response Filtering - Mask secrets in debug output
Documentation Cleanup - Sanitize before open-sourcing
Security Audits - Scan codebases for exposed credentials

Example: Pre-Commit Hook

# .git/hooks/pre-commit
from transformers import pipeline

classifier = pipeline("token-classification", model="AndrewAndrewsen/distilbert-secret-masker")

for file in staged_files:
    content = read_file(file)
    secrets = classifier(content)
    if secrets:
        print(f"❌ Secret detected in {file}!")
        exit(1)

See SecMask Examples for more.

⚠️ Limitations

Known Issues

Token Limit: Cannot handle texts >512 tokens (use Longformer expert)
English Only: Trained on English text
False Negatives: ~25% recall means some secrets may be missed
Context Sensitivity: May struggle with unusual formatting
Novel Patterns: May miss new secret types not in training data

Not Suitable For

❌ Non-English text
❌ Binary data or encrypted content
❌ Images/PDFs (extract text first)
❌ Very long documents (use longformer-secret-masker)
❌ Real-time streaming (consider batching)

Recommended Mitigations

Combine with filters: Use deterministic filters for PEM blocks, K8s secrets (see SecMask filters)
Adjust threshold: Lower tau for higher recall (more false positives)
Use MoE system: Automatic routing to appropriate expert
Add regex patterns: Supplement with custom patterns for your use case

📜 License & Attribution

Model License

Apache 2.0 (inherited from distilbert-base-uncased)

Base Model Attribution

This model is fine-tuned from:

Model: distilbert-base-uncased
Authors: Hugging Face
License: Apache 2.0

Citation:

@inproceedings{sanh2019distilbert,
  title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
  author={Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas},
  booktitle={NeurIPS EMC^2 Workshop},
  year={2019}
}

SecMask Code License

The SecMask inference code and training scripts are licensed under MIT. See GitHub repo.

🔗 Related Models

Model	Size	Max Tokens	Latency	Use Case
distilbert-secret-masker (this model)	265MB	512	6ms	Short texts, fast routing
longformer-secret-masker	592MB	2048	12ms	Long documents, configs
secretmask-gate	12KB	N/A	+0.2ms	Learned MoE routing

📚 Resources

GitHub Repository: AndrewAndrewsen/secmask
Documentation: README
Benchmarks: BENCHMARKS.md
Examples: EXAMPLES.md
Deployment: DEPLOYMENT.md

🤝 Contributing

Issues and contributions welcome! See CONTRIBUTING.md.

Developed by: Anders Andersson (@AndrewAndrewsen)
Part of: SecMask MoE System

Downloads last month: 18

Safetensors

Model size

66.4M params

Tensor type

F32

Model tree for AndrewAndrewsen/distilbert-secret-masker

Base model

distilbert/distilbert-base-uncased

Finetuned

(10736)

this model

Evaluation results

F1 Score on SecretMask v2 (600 test examples)
self-reported

0.520
Precision on SecretMask v2 (600 test examples)
self-reported

0.820
Recall on SecretMask v2 (600 test examples)
self-reported

0.380