deeppass2-bert / README.md
gneeraj's picture
Update README.md
43261d6 verified
---
license: apache-2.0
language:
- en
base_model:
- FacebookAI/xlm-roberta-base
pipeline_tag: token-classification
tags:
- security
- cybersecurity
---
# DeepPass2-XLM-RoBERTa Fine-tuned for Secret Detection
## Model Description
DeepPass2 is a fine-tuned version of `xlm-roberta-base` specifically designed for detecting passwords and secrets in documents through token classification. Unlike traditional regex-based approaches, this model understands context to identify both structured tokens (API keys, JWTs) and free-form passwords.
**Developed by:** Neeraj Gupta (SpecterOps)
**Model type:** Token Classification (Sequence Labeling)
**Base model:** [xlm-roberta-base](https://huggingface.co/xlm-roberta-base)
**Language(s):** English
**License:** [Same as base model]
**Fine-tuned with:** LoRA (Low-Rank Adaptation) through Unsloth
**Blog post:** [What's Your Secret?: Secret Scanning by DeepPass2](https://specterops.io/blog/2025/07/31/whats-your-secret-secret-scanning-by-deeppass2/)
## Model Architecture
### Base Model
- **Architecture:** XLM-RoBERTa-base (Cross-lingual RoBERTa)
- **Parameters:** ~278M (base model)
- **Max sequence length:** 512 tokens
- **Hidden size:** 768
- **Number of layers:** 12
- **Number of attention heads:** 12
### LoRA Configuration
```python
LoraConfig(
task_type=TaskType.TOKEN_CLS,
r=64, # Rank
lora_alpha=128, # Scaling parameter
lora_dropout=0.05, # Dropout probability
bias="none",
target_modules=["query", "key", "value", "dense"]
)
```
## Intended Use
This model is the BERT based model used in the DeepPass2 blog.
### Primary Use Case
- **Secret Detection:** Identify passwords, API keys, tokens, and other sensitive credentials in documents
- **Security Auditing:** Scan documents for potential credential leaks
- **Data Loss Prevention:** Pre-screen documents before sharing or publishing
### Input
- Text documents of any length (automatically chunked into 300-400 token segments) for DeepPass2 complete tool
- Text string of 512 tokens for particular instance of input to the Model
### Output
- Token-level binary classification:
- `0`: Non-credential token
- `1`: Credential/password token
## Training Data
### Dataset Composition
- **Total examples:** 23,000 (20,800 training, 2,200 testing)
- **Document types:** Synthetic Emails, technical documents, logs, configuration files
- **Password sources:**
- Real breached passwords from CrackStation's "real human" dump
- Synthetic passwords generated by LLMs
- Structured tokens (API keys, JWTs, etc.)
### Data Generation Process
1. **Base Documents:** 2,000 long documents (2000+ tokens each) generated using LLMs
- 50% containing passwords, 50% without
2. **Chunking:** Documents split into 300-400 token chunks with random boundaries
3. **Password Injection:** Real passwords inserted using skeleton sentences:
```
"Your account has been created with username: {user} and password: {pass}"
```
4. **Class Balance:** <0.3% of tokens are passwords (maintaining real-world distribution)
## Training Procedure
### Hardware
- Trained on MacBook Pro (64GB RAM) with MPS acceleration
- Can be trained on systems with 8-16GB RAM
### Hyperparameters
- **Epochs:** 4
- **Batch size:** 8 (per device)
- **Weight decay:** 0.01
- **Optimizer:** AdamW (default in Trainer)
- **Learning rate:** Default (5e-5)
- **Max sequence length:** 512 tokens
- **Random seed:** 2
### Training Process
```python
# Preprocessing
- Tokenization with offset mapping
- Label generation based on credential spans
- Padding to max_length with truncation
# Fine-tuning
- LoRA adapters applied to attention layers
- Binary cross-entropy loss
- Token-level classification head
```
## Performance Metrics
### Chunk-Level Metrics
| Metric | Score |
|--------|-------|
| **Strict Accuracy** | 86.67% |
| **Overlap Accuracy** | 97.72% |
### Password-Level Metrics
| Metric | Count/Rate |
|--------|------------|
| True Positives | 1,201 |
| True Negatives | 1,112 |
| False Positives | 49 (3.9%) |
| False Negatives | 138 |
| Overlap True Positives | 456 |
| **Recall** | 89.7% |
### Definitions
- **Strict Accuracy:** All passwords in chunk detected with 100% accuracy
- **Overlap Accuracy:** At least one password detected with >30% overlap with ground truth
## Limitations and Biases
### Known Limitations
1. **Context window:** Limited to 512 tokens per chunk
2. **Training data:** Primarily trained on LLM-generated documents which may not fully represent real-world documents
3. **Password types:** Better at detecting structured/complex passwords than simple dictionary words
4. **Tokenization boundaries:** SentencePiece tokenization can fragment passwords, affecting boundary detection
### Potential Biases
- May over-detect in technical documentation due to training distribution
- Tends to flag alphanumeric strings more readily than common words used as passwords
## Ethical Considerations
### Responsible Use
- **Privacy:** This model should only be used on documents you have permission to scan
- **Security:** Detected credentials should be handled securely and not logged or stored insecurely
- **False Positives:** Always verify detected credentials before taking action
### Misuse Potential
- Should not be used to scan documents without authorization
- Not intended for credential harvesting or malicious purposes
## Usage
### Installation
```bash
pip install transformers torch
```
### Quick Start
```python
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
# Load model and tokenizer
model_name = "path/to/deeppass2-xlm-roberta"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Classify tokens
def detect_passwords(text):
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
# Extract password tokens
password_tokens = [
token for token, label in zip(tokens, predictions[0])
if label == 1
]
return password_tokens
```
### Integration with DeepPass2
For production use, integrate with the full DeepPass2 pipeline:
1. NoseyParker regex filtering
2. BERT token classification (this model)
3. LLM validation for false positive reduction
See the [DeepPass2 repository](https://github.com/SpecterOps/DeepPass2) for complete implementation.
## Citation
```bibtex
@software{gupta2025deeppass2,
author = {Gupta, Neeraj},
title = {DeepPass2: Fine-tuned XLM-RoBERTa for Secret Detection},
year = {2025},
organization = {SpecterOps},
url = {https://huggingface.co/deeppass2-bert},
note = {Blog: \url{https://specterops.io/blog/2025/07/31/whats-your-secret-secret-scanning-by-deeppass2/}}
}
```
## Additional Information
### Model Versions
- **v6.0-BERT**: Current production version with LoRA adapters
- **merged-model**: LoRA weights merged with base model for easier deployment
### Related Links
- [DeepPass2 Blog Post](https://specterops.io/blog/2025/07/31/whats-your-secret-secret-scanning-by-deeppass2/)
- [Original DeepPass (2022)](https://posts.specterops.io/deeppass-finding-passwords-with-deep-learning-4d31c534cd00)
- [NoseyParker](https://github.com/praetorian-inc/noseyparker)
### Contact
For questions or issues, please open an issue on the [GitHub repository](https://github.com/SpecterOps/DeepPass2)