File size: 7,609 Bytes
43261d6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
---
license: apache-2.0
language:
- en
base_model:
- FacebookAI/xlm-roberta-base
pipeline_tag: token-classification
tags:
- security
- cybersecurity
---

# DeepPass2-XLM-RoBERTa Fine-tuned for Secret Detection

## Model Description

DeepPass2 is a fine-tuned version of `xlm-roberta-base` specifically designed for detecting passwords and secrets in documents through token classification. Unlike traditional regex-based approaches, this model understands context to identify both structured tokens (API keys, JWTs) and free-form passwords.

**Developed by:** Neeraj Gupta (SpecterOps)  
**Model type:** Token Classification (Sequence Labeling)  
**Base model:** [xlm-roberta-base](https://huggingface.co/xlm-roberta-base)  
**Language(s):** English  
**License:** [Same as base model]  
**Fine-tuned with:** LoRA (Low-Rank Adaptation) through Unsloth
**Blog post:** [What's Your Secret?: Secret Scanning by DeepPass2](https://specterops.io/blog/2025/07/31/whats-your-secret-secret-scanning-by-deeppass2/)

## Model Architecture

### Base Model
- **Architecture:** XLM-RoBERTa-base (Cross-lingual RoBERTa)
- **Parameters:** ~278M (base model)
- **Max sequence length:** 512 tokens
- **Hidden size:** 768
- **Number of layers:** 12
- **Number of attention heads:** 12

### LoRA Configuration
```python
LoraConfig(
    task_type=TaskType.TOKEN_CLS,
    r=64,                    # Rank
    lora_alpha=128,          # Scaling parameter
    lora_dropout=0.05,       # Dropout probability
    bias="none",
    target_modules=["query", "key", "value", "dense"]
)
```

## Intended Use

This model is the BERT based model used in the DeepPass2 blog.

### Primary Use Case
- **Secret Detection:** Identify passwords, API keys, tokens, and other sensitive credentials in documents
- **Security Auditing:** Scan documents for potential credential leaks
- **Data Loss Prevention:** Pre-screen documents before sharing or publishing

### Input
- Text documents of any length (automatically chunked into 300-400 token segments) for DeepPass2 complete tool
- Text string of 512 tokens for particular instance of input to the Model

### Output
- Token-level binary classification:
  - `0`: Non-credential token
  - `1`: Credential/password token

## Training Data

### Dataset Composition
- **Total examples:** 23,000 (20,800 training, 2,200 testing)
- **Document types:** Synthetic Emails, technical documents, logs, configuration files
- **Password sources:**
  - Real breached passwords from CrackStation's "real human" dump
  - Synthetic passwords generated by LLMs
  - Structured tokens (API keys, JWTs, etc.)

### Data Generation Process
1. **Base Documents:** 2,000 long documents (2000+ tokens each) generated using LLMs
   - 50% containing passwords, 50% without
2. **Chunking:** Documents split into 300-400 token chunks with random boundaries
3. **Password Injection:** Real passwords inserted using skeleton sentences:
   ```
   "Your account has been created with username: {user} and password: {pass}"
   ```
4. **Class Balance:** <0.3% of tokens are passwords (maintaining real-world distribution)

## Training Procedure

### Hardware
- Trained on MacBook Pro (64GB RAM) with MPS acceleration
- Can be trained on systems with 8-16GB RAM

### Hyperparameters
- **Epochs:** 4
- **Batch size:** 8 (per device)
- **Weight decay:** 0.01
- **Optimizer:** AdamW (default in Trainer)
- **Learning rate:** Default (5e-5)
- **Max sequence length:** 512 tokens
- **Random seed:** 2

### Training Process
```python
# Preprocessing
- Tokenization with offset mapping
- Label generation based on credential spans
- Padding to max_length with truncation

# Fine-tuning
- LoRA adapters applied to attention layers
- Binary cross-entropy loss
- Token-level classification head
```

## Performance Metrics

### Chunk-Level Metrics
| Metric | Score |
|--------|-------|
| **Strict Accuracy** | 86.67% |
| **Overlap Accuracy** | 97.72% |

### Password-Level Metrics
| Metric | Count/Rate |
|--------|------------|
| True Positives | 1,201 |
| True Negatives | 1,112 |
| False Positives | 49 (3.9%) |
| False Negatives | 138 |
| Overlap True Positives | 456 |
| **Recall** | 89.7% |

### Definitions
- **Strict Accuracy:** All passwords in chunk detected with 100% accuracy
- **Overlap Accuracy:** At least one password detected with >30% overlap with ground truth

## Limitations and Biases

### Known Limitations
1. **Context window:** Limited to 512 tokens per chunk
2. **Training data:** Primarily trained on LLM-generated documents which may not fully represent real-world documents
3. **Password types:** Better at detecting structured/complex passwords than simple dictionary words
4. **Tokenization boundaries:** SentencePiece tokenization can fragment passwords, affecting boundary detection

### Potential Biases
- May over-detect in technical documentation due to training distribution
- Tends to flag alphanumeric strings more readily than common words used as passwords

## Ethical Considerations

### Responsible Use
- **Privacy:** This model should only be used on documents you have permission to scan
- **Security:** Detected credentials should be handled securely and not logged or stored insecurely
- **False Positives:** Always verify detected credentials before taking action

### Misuse Potential
- Should not be used to scan documents without authorization
- Not intended for credential harvesting or malicious purposes

## Usage

### Installation
```bash
pip install transformers torch
```

### Quick Start
```python
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

# Load model and tokenizer
model_name = "path/to/deeppass2-xlm-roberta"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Classify tokens
def detect_passwords(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    
    with torch.no_grad():
        outputs = model(**inputs)
    
    predictions = torch.argmax(outputs.logits, dim=-1)
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    
    # Extract password tokens
    password_tokens = [
        token for token, label in zip(tokens, predictions[0])
        if label == 1
    ]
    
    return password_tokens
```

### Integration with DeepPass2
For production use, integrate with the full DeepPass2 pipeline:
1. NoseyParker regex filtering
2. BERT token classification (this model)
3. LLM validation for false positive reduction

See the [DeepPass2 repository](https://github.com/SpecterOps/DeepPass2) for complete implementation.

## Citation

```bibtex
@software{gupta2025deeppass2,
  author = {Gupta, Neeraj},
  title = {DeepPass2: Fine-tuned XLM-RoBERTa for Secret Detection},
  year = {2025},
  organization = {SpecterOps},
  url = {https://huggingface.co/deeppass2-bert},
  note = {Blog: \url{https://specterops.io/blog/2025/07/31/whats-your-secret-secret-scanning-by-deeppass2/}}
}
```

## Additional Information

### Model Versions
- **v6.0-BERT**: Current production version with LoRA adapters
- **merged-model**: LoRA weights merged with base model for easier deployment

### Related Links
- [DeepPass2 Blog Post](https://specterops.io/blog/2025/07/31/whats-your-secret-secret-scanning-by-deeppass2/)
- [Original DeepPass (2022)](https://posts.specterops.io/deeppass-finding-passwords-with-deep-learning-4d31c534cd00)
- [NoseyParker](https://github.com/praetorian-inc/noseyparker)

### Contact
For questions or issues, please open an issue on the [GitHub repository](https://github.com/SpecterOps/DeepPass2)