gneeraj commited on
Commit
43261d6
·
verified ·
1 Parent(s): 5b1cddc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +229 -3
README.md CHANGED
@@ -1,3 +1,229 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ base_model:
6
+ - FacebookAI/xlm-roberta-base
7
+ pipeline_tag: token-classification
8
+ tags:
9
+ - security
10
+ - cybersecurity
11
+ ---
12
+
13
+ # DeepPass2-XLM-RoBERTa Fine-tuned for Secret Detection
14
+
15
+ ## Model Description
16
+
17
+ DeepPass2 is a fine-tuned version of `xlm-roberta-base` specifically designed for detecting passwords and secrets in documents through token classification. Unlike traditional regex-based approaches, this model understands context to identify both structured tokens (API keys, JWTs) and free-form passwords.
18
+
19
+ **Developed by:** Neeraj Gupta (SpecterOps)
20
+ **Model type:** Token Classification (Sequence Labeling)
21
+ **Base model:** [xlm-roberta-base](https://huggingface.co/xlm-roberta-base)
22
+ **Language(s):** English
23
+ **License:** [Same as base model]
24
+ **Fine-tuned with:** LoRA (Low-Rank Adaptation) through Unsloth
25
+ **Blog post:** [What's Your Secret?: Secret Scanning by DeepPass2](https://specterops.io/blog/2025/07/31/whats-your-secret-secret-scanning-by-deeppass2/)
26
+
27
+ ## Model Architecture
28
+
29
+ ### Base Model
30
+ - **Architecture:** XLM-RoBERTa-base (Cross-lingual RoBERTa)
31
+ - **Parameters:** ~278M (base model)
32
+ - **Max sequence length:** 512 tokens
33
+ - **Hidden size:** 768
34
+ - **Number of layers:** 12
35
+ - **Number of attention heads:** 12
36
+
37
+ ### LoRA Configuration
38
+ ```python
39
+ LoraConfig(
40
+ task_type=TaskType.TOKEN_CLS,
41
+ r=64, # Rank
42
+ lora_alpha=128, # Scaling parameter
43
+ lora_dropout=0.05, # Dropout probability
44
+ bias="none",
45
+ target_modules=["query", "key", "value", "dense"]
46
+ )
47
+ ```
48
+
49
+ ## Intended Use
50
+
51
+ This model is the BERT based model used in the DeepPass2 blog.
52
+
53
+ ### Primary Use Case
54
+ - **Secret Detection:** Identify passwords, API keys, tokens, and other sensitive credentials in documents
55
+ - **Security Auditing:** Scan documents for potential credential leaks
56
+ - **Data Loss Prevention:** Pre-screen documents before sharing or publishing
57
+
58
+ ### Input
59
+ - Text documents of any length (automatically chunked into 300-400 token segments) for DeepPass2 complete tool
60
+ - Text string of 512 tokens for particular instance of input to the Model
61
+
62
+ ### Output
63
+ - Token-level binary classification:
64
+ - `0`: Non-credential token
65
+ - `1`: Credential/password token
66
+
67
+ ## Training Data
68
+
69
+ ### Dataset Composition
70
+ - **Total examples:** 23,000 (20,800 training, 2,200 testing)
71
+ - **Document types:** Synthetic Emails, technical documents, logs, configuration files
72
+ - **Password sources:**
73
+ - Real breached passwords from CrackStation's "real human" dump
74
+ - Synthetic passwords generated by LLMs
75
+ - Structured tokens (API keys, JWTs, etc.)
76
+
77
+ ### Data Generation Process
78
+ 1. **Base Documents:** 2,000 long documents (2000+ tokens each) generated using LLMs
79
+ - 50% containing passwords, 50% without
80
+ 2. **Chunking:** Documents split into 300-400 token chunks with random boundaries
81
+ 3. **Password Injection:** Real passwords inserted using skeleton sentences:
82
+ ```
83
+ "Your account has been created with username: {user} and password: {pass}"
84
+ ```
85
+ 4. **Class Balance:** <0.3% of tokens are passwords (maintaining real-world distribution)
86
+
87
+ ## Training Procedure
88
+
89
+ ### Hardware
90
+ - Trained on MacBook Pro (64GB RAM) with MPS acceleration
91
+ - Can be trained on systems with 8-16GB RAM
92
+
93
+ ### Hyperparameters
94
+ - **Epochs:** 4
95
+ - **Batch size:** 8 (per device)
96
+ - **Weight decay:** 0.01
97
+ - **Optimizer:** AdamW (default in Trainer)
98
+ - **Learning rate:** Default (5e-5)
99
+ - **Max sequence length:** 512 tokens
100
+ - **Random seed:** 2
101
+
102
+ ### Training Process
103
+ ```python
104
+ # Preprocessing
105
+ - Tokenization with offset mapping
106
+ - Label generation based on credential spans
107
+ - Padding to max_length with truncation
108
+
109
+ # Fine-tuning
110
+ - LoRA adapters applied to attention layers
111
+ - Binary cross-entropy loss
112
+ - Token-level classification head
113
+ ```
114
+
115
+ ## Performance Metrics
116
+
117
+ ### Chunk-Level Metrics
118
+ | Metric | Score |
119
+ |--------|-------|
120
+ | **Strict Accuracy** | 86.67% |
121
+ | **Overlap Accuracy** | 97.72% |
122
+
123
+ ### Password-Level Metrics
124
+ | Metric | Count/Rate |
125
+ |--------|------------|
126
+ | True Positives | 1,201 |
127
+ | True Negatives | 1,112 |
128
+ | False Positives | 49 (3.9%) |
129
+ | False Negatives | 138 |
130
+ | Overlap True Positives | 456 |
131
+ | **Recall** | 89.7% |
132
+
133
+ ### Definitions
134
+ - **Strict Accuracy:** All passwords in chunk detected with 100% accuracy
135
+ - **Overlap Accuracy:** At least one password detected with >30% overlap with ground truth
136
+
137
+ ## Limitations and Biases
138
+
139
+ ### Known Limitations
140
+ 1. **Context window:** Limited to 512 tokens per chunk
141
+ 2. **Training data:** Primarily trained on LLM-generated documents which may not fully represent real-world documents
142
+ 3. **Password types:** Better at detecting structured/complex passwords than simple dictionary words
143
+ 4. **Tokenization boundaries:** SentencePiece tokenization can fragment passwords, affecting boundary detection
144
+
145
+ ### Potential Biases
146
+ - May over-detect in technical documentation due to training distribution
147
+ - Tends to flag alphanumeric strings more readily than common words used as passwords
148
+
149
+ ## Ethical Considerations
150
+
151
+ ### Responsible Use
152
+ - **Privacy:** This model should only be used on documents you have permission to scan
153
+ - **Security:** Detected credentials should be handled securely and not logged or stored insecurely
154
+ - **False Positives:** Always verify detected credentials before taking action
155
+
156
+ ### Misuse Potential
157
+ - Should not be used to scan documents without authorization
158
+ - Not intended for credential harvesting or malicious purposes
159
+
160
+ ## Usage
161
+
162
+ ### Installation
163
+ ```bash
164
+ pip install transformers torch
165
+ ```
166
+
167
+ ### Quick Start
168
+ ```python
169
+ from transformers import AutoModelForTokenClassification, AutoTokenizer
170
+ import torch
171
+
172
+ # Load model and tokenizer
173
+ model_name = "path/to/deeppass2-xlm-roberta"
174
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
175
+ model = AutoModelForTokenClassification.from_pretrained(model_name)
176
+
177
+ # Classify tokens
178
+ def detect_passwords(text):
179
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
180
+
181
+ with torch.no_grad():
182
+ outputs = model(**inputs)
183
+
184
+ predictions = torch.argmax(outputs.logits, dim=-1)
185
+ tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
186
+
187
+ # Extract password tokens
188
+ password_tokens = [
189
+ token for token, label in zip(tokens, predictions[0])
190
+ if label == 1
191
+ ]
192
+
193
+ return password_tokens
194
+ ```
195
+
196
+ ### Integration with DeepPass2
197
+ For production use, integrate with the full DeepPass2 pipeline:
198
+ 1. NoseyParker regex filtering
199
+ 2. BERT token classification (this model)
200
+ 3. LLM validation for false positive reduction
201
+
202
+ See the [DeepPass2 repository](https://github.com/SpecterOps/DeepPass2) for complete implementation.
203
+
204
+ ## Citation
205
+
206
+ ```bibtex
207
+ @software{gupta2025deeppass2,
208
+ author = {Gupta, Neeraj},
209
+ title = {DeepPass2: Fine-tuned XLM-RoBERTa for Secret Detection},
210
+ year = {2025},
211
+ organization = {SpecterOps},
212
+ url = {https://huggingface.co/deeppass2-bert},
213
+ note = {Blog: \url{https://specterops.io/blog/2025/07/31/whats-your-secret-secret-scanning-by-deeppass2/}}
214
+ }
215
+ ```
216
+
217
+ ## Additional Information
218
+
219
+ ### Model Versions
220
+ - **v6.0-BERT**: Current production version with LoRA adapters
221
+ - **merged-model**: LoRA weights merged with base model for easier deployment
222
+
223
+ ### Related Links
224
+ - [DeepPass2 Blog Post](https://specterops.io/blog/2025/07/31/whats-your-secret-secret-scanning-by-deeppass2/)
225
+ - [Original DeepPass (2022)](https://posts.specterops.io/deeppass-finding-passwords-with-deep-learning-4d31c534cd00)
226
+ - [NoseyParker](https://github.com/praetorian-inc/noseyparker)
227
+
228
+ ### Contact
229
+ For questions or issues, please open an issue on the [GitHub repository](https://github.com/SpecterOps/DeepPass2)