File size: 7,609 Bytes
43261d6 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 | ---
license: apache-2.0
language:
- en
base_model:
- FacebookAI/xlm-roberta-base
pipeline_tag: token-classification
tags:
- security
- cybersecurity
---
# DeepPass2-XLM-RoBERTa Fine-tuned for Secret Detection
## Model Description
DeepPass2 is a fine-tuned version of `xlm-roberta-base` specifically designed for detecting passwords and secrets in documents through token classification. Unlike traditional regex-based approaches, this model understands context to identify both structured tokens (API keys, JWTs) and free-form passwords.
**Developed by:** Neeraj Gupta (SpecterOps)
**Model type:** Token Classification (Sequence Labeling)
**Base model:** [xlm-roberta-base](https://huggingface.co/xlm-roberta-base)
**Language(s):** English
**License:** [Same as base model]
**Fine-tuned with:** LoRA (Low-Rank Adaptation) through Unsloth
**Blog post:** [What's Your Secret?: Secret Scanning by DeepPass2](https://specterops.io/blog/2025/07/31/whats-your-secret-secret-scanning-by-deeppass2/)
## Model Architecture
### Base Model
- **Architecture:** XLM-RoBERTa-base (Cross-lingual RoBERTa)
- **Parameters:** ~278M (base model)
- **Max sequence length:** 512 tokens
- **Hidden size:** 768
- **Number of layers:** 12
- **Number of attention heads:** 12
### LoRA Configuration
```python
LoraConfig(
task_type=TaskType.TOKEN_CLS,
r=64, # Rank
lora_alpha=128, # Scaling parameter
lora_dropout=0.05, # Dropout probability
bias="none",
target_modules=["query", "key", "value", "dense"]
)
```
## Intended Use
This model is the BERT based model used in the DeepPass2 blog.
### Primary Use Case
- **Secret Detection:** Identify passwords, API keys, tokens, and other sensitive credentials in documents
- **Security Auditing:** Scan documents for potential credential leaks
- **Data Loss Prevention:** Pre-screen documents before sharing or publishing
### Input
- Text documents of any length (automatically chunked into 300-400 token segments) for DeepPass2 complete tool
- Text string of 512 tokens for particular instance of input to the Model
### Output
- Token-level binary classification:
- `0`: Non-credential token
- `1`: Credential/password token
## Training Data
### Dataset Composition
- **Total examples:** 23,000 (20,800 training, 2,200 testing)
- **Document types:** Synthetic Emails, technical documents, logs, configuration files
- **Password sources:**
- Real breached passwords from CrackStation's "real human" dump
- Synthetic passwords generated by LLMs
- Structured tokens (API keys, JWTs, etc.)
### Data Generation Process
1. **Base Documents:** 2,000 long documents (2000+ tokens each) generated using LLMs
- 50% containing passwords, 50% without
2. **Chunking:** Documents split into 300-400 token chunks with random boundaries
3. **Password Injection:** Real passwords inserted using skeleton sentences:
```
"Your account has been created with username: {user} and password: {pass}"
```
4. **Class Balance:** <0.3% of tokens are passwords (maintaining real-world distribution)
## Training Procedure
### Hardware
- Trained on MacBook Pro (64GB RAM) with MPS acceleration
- Can be trained on systems with 8-16GB RAM
### Hyperparameters
- **Epochs:** 4
- **Batch size:** 8 (per device)
- **Weight decay:** 0.01
- **Optimizer:** AdamW (default in Trainer)
- **Learning rate:** Default (5e-5)
- **Max sequence length:** 512 tokens
- **Random seed:** 2
### Training Process
```python
# Preprocessing
- Tokenization with offset mapping
- Label generation based on credential spans
- Padding to max_length with truncation
# Fine-tuning
- LoRA adapters applied to attention layers
- Binary cross-entropy loss
- Token-level classification head
```
## Performance Metrics
### Chunk-Level Metrics
| Metric | Score |
|--------|-------|
| **Strict Accuracy** | 86.67% |
| **Overlap Accuracy** | 97.72% |
### Password-Level Metrics
| Metric | Count/Rate |
|--------|------------|
| True Positives | 1,201 |
| True Negatives | 1,112 |
| False Positives | 49 (3.9%) |
| False Negatives | 138 |
| Overlap True Positives | 456 |
| **Recall** | 89.7% |
### Definitions
- **Strict Accuracy:** All passwords in chunk detected with 100% accuracy
- **Overlap Accuracy:** At least one password detected with >30% overlap with ground truth
## Limitations and Biases
### Known Limitations
1. **Context window:** Limited to 512 tokens per chunk
2. **Training data:** Primarily trained on LLM-generated documents which may not fully represent real-world documents
3. **Password types:** Better at detecting structured/complex passwords than simple dictionary words
4. **Tokenization boundaries:** SentencePiece tokenization can fragment passwords, affecting boundary detection
### Potential Biases
- May over-detect in technical documentation due to training distribution
- Tends to flag alphanumeric strings more readily than common words used as passwords
## Ethical Considerations
### Responsible Use
- **Privacy:** This model should only be used on documents you have permission to scan
- **Security:** Detected credentials should be handled securely and not logged or stored insecurely
- **False Positives:** Always verify detected credentials before taking action
### Misuse Potential
- Should not be used to scan documents without authorization
- Not intended for credential harvesting or malicious purposes
## Usage
### Installation
```bash
pip install transformers torch
```
### Quick Start
```python
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
# Load model and tokenizer
model_name = "path/to/deeppass2-xlm-roberta"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Classify tokens
def detect_passwords(text):
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
# Extract password tokens
password_tokens = [
token for token, label in zip(tokens, predictions[0])
if label == 1
]
return password_tokens
```
### Integration with DeepPass2
For production use, integrate with the full DeepPass2 pipeline:
1. NoseyParker regex filtering
2. BERT token classification (this model)
3. LLM validation for false positive reduction
See the [DeepPass2 repository](https://github.com/SpecterOps/DeepPass2) for complete implementation.
## Citation
```bibtex
@software{gupta2025deeppass2,
author = {Gupta, Neeraj},
title = {DeepPass2: Fine-tuned XLM-RoBERTa for Secret Detection},
year = {2025},
organization = {SpecterOps},
url = {https://huggingface.co/deeppass2-bert},
note = {Blog: \url{https://specterops.io/blog/2025/07/31/whats-your-secret-secret-scanning-by-deeppass2/}}
}
```
## Additional Information
### Model Versions
- **v6.0-BERT**: Current production version with LoRA adapters
- **merged-model**: LoRA weights merged with base model for easier deployment
### Related Links
- [DeepPass2 Blog Post](https://specterops.io/blog/2025/07/31/whats-your-secret-secret-scanning-by-deeppass2/)
- [Original DeepPass (2022)](https://posts.specterops.io/deeppass-finding-passwords-with-deep-learning-4d31c534cd00)
- [NoseyParker](https://github.com/praetorian-inc/noseyparker)
### Contact
For questions or issues, please open an issue on the [GitHub repository](https://github.com/SpecterOps/DeepPass2) |