DistilBERT Secret Masker (Fast Expert)

License: Apache 2.0 Model Size Inference Speed

Fast Expert model for SecMask MoE system - specialized for rapid secret detection in short to medium-length texts (≀512 tokens).


🎯 Overview

Fine-tuned DistilBERT model for detecting and classifying secrets (API keys, tokens, credentials) in text using Named Entity Recognition (NER). Serves as the Fast Expert in the SecMask Mixture of Experts architecture, handling 92.7% of inference requests with ~6ms latency.

Key Features

βœ… High Speed: 11ms P50 latency on CPU
βœ… High Precision: 82% (NER-only), 92.3% with filters
βœ… Production Ready: Handles 92.7% of real-world cases
βœ… Lightweight: 265MB (66M parameters)
βœ… Multi-Secret Types: AWS keys, GitHub tokens, JWTs, API keys, PEM blocks, K8s secrets

Production Performance: When combined with post-processing filters (PEM blocks, K8s secrets, pattern matching), achieves 92.3% precision, 80% recall, F1: 0.857. The NER model alone achieves 82% precision and 38% recall. See comprehensive benchmarks for details.

Recommended Configuration: Fast Expert + Filters (this model with post-processing) is the recommended production setup, outperforming Full MoE configurations. See Configuration Guide for usage recommendations.

Detected Secret Types

Secret Type Example Pattern F1 Score
AWS Access Keys AKIA... 0.92
GitHub Personal Tokens ghp_..., gho_... 0.88
JWT Tokens eyJ0eXAiOiJKV1QiLCJhbGc... 0.85
Generic API Keys sk-proj-..., api_key=... 0.79
PEM Certificate Blocks -----BEGIN PRIVATE KEY----- 0.95
Kubernetes Secrets (data) kind: Secret β†’ data: values 0.81
Database Credentials Connection strings with passwords 0.74

πŸš€ Quick Start

Installation

pip install transformers torch

Basic Usage

Standalone (Direct):

from transformers import pipeline

# Load model
classifier = pipeline(
    "token-classification",
    model="AndrewAndrewsen/distilbert-secret-masker",
    aggregation_strategy="simple"
)

# Detect secrets
text = "My API key is sk-proj-1234567890abcdefghijklmnopqrstuvwxyz"
results = classifier(text)

print(results)
# [{'entity_group': 'SECRET', 'score': 0.95, 'word': 'sk-proj-1234567890abcdefghijklmnopqrstuvwxyz', ...}]

Recommended (via SecMask MoE):

# Clone SecMask repo
# git clone https://github.com/AndrewAndrewsen/secmask.git

from infer_moe import mask_text_moe

masked = mask_text_moe(
    "My GitHub token is ghp_1234567890abcdefghijklmnopqrstuvwxyz",
    fast_model_dir="AndrewAndrewsen/distilbert-secret-masker",
    tau=0.80,
    routing_mode="heuristic"
)

print(masked)
# "My GitHub token is [SECRET]"

Command Line (via SecMask)

# Clone repo
git clone https://github.com/AndrewAndrewsen/secmask.git
cd secmask

# Mask secrets
python infer_moe.py \
    --text "AWS key: AKIAIOSFODNN7EXAMPLE" \
    --fast-model AndrewAndrewsen/distilbert-secret-masker \
    --routing-mode heuristic \
    --tau 0.80

# Output: AWS key: [SECRET]

πŸ“Š Performance

Secret Detection Metrics

Metric NER Only With Filters (Recommended)
F1 Score 0.52 0.857
Precision 82% 92.3%
Recall 38% 80.0%
P50 Latency 11ms 11ms
P90 Latency 14ms 14ms
P99 Latency 17ms 17ms
Throughput 84 req/s 84 req/s (CPU)

Note: NER-only metrics measured at Ο„=0.80. Production systems combine NER with post-processing filters (PEM blocks, K8s secrets, pattern matching) to achieve 92.3% precision and 80% recall. Post-processing adds no latency overhead. See BENCHMARK_RESULTS.md for comprehensive benchmarks.

When This Model Is Used (MoE Routing)

The router selects this Fast Expert when:

  • Token count ≀ 512
  • No multi-line structures (PEM blocks, K8s YAML)
  • Simple text patterns
  • Coverage: 92.7% of real-world requests

Note: The recommended production configuration is Fast Expert + Filters alone (without the Long Expert). This achieves better results than Full MoE. See Configuration Guide for details.


πŸ—οΈ Model Details

Architecture

  • Base Model: distilbert-base-uncased (66M params, Apache 2.0)
  • Task: Token Classification (NER)
  • Max Sequence Length: 512 tokens
  • Label: B-SECRET, I-SECRET, O (BIO tagging)

Training Details

  • Dataset: Custom SecretMask v2 (6,000 training examples)
  • Optimizer: AdamW (lr=5e-5)
  • Epochs: 3
  • Batch Size: 16
  • Hardware: GPU (NVIDIA A100 or equivalent)
  • Training Time: ~30 minutes

Evaluation

Evaluated on 600 held-out examples from SecretMask v2 test set:

Precision: 0.82
Recall: 0.38
F1: 0.52
Support: 1,021 secret tokens

Key Insights:

  • High Precision (82%): Very low false positive rate - safe for production
  • Lower Recall (38%): Misses some secrets when used standalone
  • Production Strategy: Combine with deterministic filters (see filters.py) for PEM blocks, K8s secrets, and AWS patterns to achieve >90% coverage
  • Threshold Tuning: Lower Ο„ from 0.80 to 0.50 for higher recall (trade-off: more false positives)

πŸ’‘ Use Cases

Production Applications

  1. Pre-Commit Hooks - Prevent secrets in git commits
  2. CI/CD Pipelines - Scan code before deployment
  3. Log Sanitization - Remove secrets from application logs
  4. API Response Filtering - Mask secrets in debug output
  5. Documentation Cleanup - Sanitize before open-sourcing
  6. Security Audits - Scan codebases for exposed credentials

Example: Pre-Commit Hook

# .git/hooks/pre-commit
from transformers import pipeline

classifier = pipeline("token-classification", model="AndrewAndrewsen/distilbert-secret-masker")

for file in staged_files:
    content = read_file(file)
    secrets = classifier(content)
    if secrets:
        print(f"❌ Secret detected in {file}!")
        exit(1)

See SecMask Examples for more.


⚠️ Limitations

Known Issues

  1. Token Limit: Cannot handle texts >512 tokens (use Longformer expert)
  2. English Only: Trained on English text
  3. False Negatives: ~25% recall means some secrets may be missed
  4. Context Sensitivity: May struggle with unusual formatting
  5. Novel Patterns: May miss new secret types not in training data

Not Suitable For

❌ Non-English text
❌ Binary data or encrypted content
❌ Images/PDFs (extract text first)
❌ Very long documents (use longformer-secret-masker)
❌ Real-time streaming (consider batching)

Recommended Mitigations

  • Combine with filters: Use deterministic filters for PEM blocks, K8s secrets (see SecMask filters)
  • Adjust threshold: Lower tau for higher recall (more false positives)
  • Use MoE system: Automatic routing to appropriate expert
  • Add regex patterns: Supplement with custom patterns for your use case

πŸ“œ License & Attribution

Model License

Apache 2.0 (inherited from distilbert-base-uncased)

Base Model Attribution

This model is fine-tuned from:

  • Model: distilbert-base-uncased
  • Authors: Hugging Face
  • License: Apache 2.0
  • Citation:
    @inproceedings{sanh2019distilbert,
      title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
      author={Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas},
      booktitle={NeurIPS EMC^2 Workshop},
      year={2019}
    }
    

SecMask Code License

The SecMask inference code and training scripts are licensed under MIT. See GitHub repo.


πŸ”— Related Models

Model Size Max Tokens Latency Use Case
distilbert-secret-masker (this model) 265MB 512 6ms Short texts, fast routing
longformer-secret-masker 592MB 2048 12ms Long documents, configs
secretmask-gate 12KB N/A +0.2ms Learned MoE routing

πŸ“š Resources


🀝 Contributing

Issues and contributions welcome! See CONTRIBUTING.md.


Developed by: Anders Andersson (@AndrewAndrewsen)
Part of: SecMask MoE System

Downloads last month
18
Safetensors
Model size
66.4M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for AndrewAndrewsen/distilbert-secret-masker

Finetuned
(10736)
this model

Evaluation results