|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- FacebookAI/xlm-roberta-base |
|
|
pipeline_tag: token-classification |
|
|
tags: |
|
|
- security |
|
|
- cybersecurity |
|
|
--- |
|
|
|
|
|
# DeepPass2-XLM-RoBERTa Fine-tuned for Secret Detection |
|
|
|
|
|
## Model Description |
|
|
|
|
|
DeepPass2 is a fine-tuned version of `xlm-roberta-base` specifically designed for detecting passwords and secrets in documents through token classification. Unlike traditional regex-based approaches, this model understands context to identify both structured tokens (API keys, JWTs) and free-form passwords. |
|
|
|
|
|
**Developed by:** Neeraj Gupta (SpecterOps) |
|
|
**Model type:** Token Classification (Sequence Labeling) |
|
|
**Base model:** [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) |
|
|
**Language(s):** English |
|
|
**License:** [Same as base model] |
|
|
**Fine-tuned with:** LoRA (Low-Rank Adaptation) through Unsloth |
|
|
**Blog post:** [What's Your Secret?: Secret Scanning by DeepPass2](https://specterops.io/blog/2025/07/31/whats-your-secret-secret-scanning-by-deeppass2/) |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
### Base Model |
|
|
- **Architecture:** XLM-RoBERTa-base (Cross-lingual RoBERTa) |
|
|
- **Parameters:** ~278M (base model) |
|
|
- **Max sequence length:** 512 tokens |
|
|
- **Hidden size:** 768 |
|
|
- **Number of layers:** 12 |
|
|
- **Number of attention heads:** 12 |
|
|
|
|
|
### LoRA Configuration |
|
|
```python |
|
|
LoraConfig( |
|
|
task_type=TaskType.TOKEN_CLS, |
|
|
r=64, # Rank |
|
|
lora_alpha=128, # Scaling parameter |
|
|
lora_dropout=0.05, # Dropout probability |
|
|
bias="none", |
|
|
target_modules=["query", "key", "value", "dense"] |
|
|
) |
|
|
``` |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
This model is the BERT based model used in the DeepPass2 blog. |
|
|
|
|
|
### Primary Use Case |
|
|
- **Secret Detection:** Identify passwords, API keys, tokens, and other sensitive credentials in documents |
|
|
- **Security Auditing:** Scan documents for potential credential leaks |
|
|
- **Data Loss Prevention:** Pre-screen documents before sharing or publishing |
|
|
|
|
|
### Input |
|
|
- Text documents of any length (automatically chunked into 300-400 token segments) for DeepPass2 complete tool |
|
|
- Text string of 512 tokens for particular instance of input to the Model |
|
|
|
|
|
### Output |
|
|
- Token-level binary classification: |
|
|
- `0`: Non-credential token |
|
|
- `1`: Credential/password token |
|
|
|
|
|
## Training Data |
|
|
|
|
|
### Dataset Composition |
|
|
- **Total examples:** 23,000 (20,800 training, 2,200 testing) |
|
|
- **Document types:** Synthetic Emails, technical documents, logs, configuration files |
|
|
- **Password sources:** |
|
|
- Real breached passwords from CrackStation's "real human" dump |
|
|
- Synthetic passwords generated by LLMs |
|
|
- Structured tokens (API keys, JWTs, etc.) |
|
|
|
|
|
### Data Generation Process |
|
|
1. **Base Documents:** 2,000 long documents (2000+ tokens each) generated using LLMs |
|
|
- 50% containing passwords, 50% without |
|
|
2. **Chunking:** Documents split into 300-400 token chunks with random boundaries |
|
|
3. **Password Injection:** Real passwords inserted using skeleton sentences: |
|
|
``` |
|
|
"Your account has been created with username: {user} and password: {pass}" |
|
|
``` |
|
|
4. **Class Balance:** <0.3% of tokens are passwords (maintaining real-world distribution) |
|
|
|
|
|
## Training Procedure |
|
|
|
|
|
### Hardware |
|
|
- Trained on MacBook Pro (64GB RAM) with MPS acceleration |
|
|
- Can be trained on systems with 8-16GB RAM |
|
|
|
|
|
### Hyperparameters |
|
|
- **Epochs:** 4 |
|
|
- **Batch size:** 8 (per device) |
|
|
- **Weight decay:** 0.01 |
|
|
- **Optimizer:** AdamW (default in Trainer) |
|
|
- **Learning rate:** Default (5e-5) |
|
|
- **Max sequence length:** 512 tokens |
|
|
- **Random seed:** 2 |
|
|
|
|
|
### Training Process |
|
|
```python |
|
|
# Preprocessing |
|
|
- Tokenization with offset mapping |
|
|
- Label generation based on credential spans |
|
|
- Padding to max_length with truncation |
|
|
|
|
|
# Fine-tuning |
|
|
- LoRA adapters applied to attention layers |
|
|
- Binary cross-entropy loss |
|
|
- Token-level classification head |
|
|
``` |
|
|
|
|
|
## Performance Metrics |
|
|
|
|
|
### Chunk-Level Metrics |
|
|
| Metric | Score | |
|
|
|--------|-------| |
|
|
| **Strict Accuracy** | 86.67% | |
|
|
| **Overlap Accuracy** | 97.72% | |
|
|
|
|
|
### Password-Level Metrics |
|
|
| Metric | Count/Rate | |
|
|
|--------|------------| |
|
|
| True Positives | 1,201 | |
|
|
| True Negatives | 1,112 | |
|
|
| False Positives | 49 (3.9%) | |
|
|
| False Negatives | 138 | |
|
|
| Overlap True Positives | 456 | |
|
|
| **Recall** | 89.7% | |
|
|
|
|
|
### Definitions |
|
|
- **Strict Accuracy:** All passwords in chunk detected with 100% accuracy |
|
|
- **Overlap Accuracy:** At least one password detected with >30% overlap with ground truth |
|
|
|
|
|
## Limitations and Biases |
|
|
|
|
|
### Known Limitations |
|
|
1. **Context window:** Limited to 512 tokens per chunk |
|
|
2. **Training data:** Primarily trained on LLM-generated documents which may not fully represent real-world documents |
|
|
3. **Password types:** Better at detecting structured/complex passwords than simple dictionary words |
|
|
4. **Tokenization boundaries:** SentencePiece tokenization can fragment passwords, affecting boundary detection |
|
|
|
|
|
### Potential Biases |
|
|
- May over-detect in technical documentation due to training distribution |
|
|
- Tends to flag alphanumeric strings more readily than common words used as passwords |
|
|
|
|
|
## Ethical Considerations |
|
|
|
|
|
### Responsible Use |
|
|
- **Privacy:** This model should only be used on documents you have permission to scan |
|
|
- **Security:** Detected credentials should be handled securely and not logged or stored insecurely |
|
|
- **False Positives:** Always verify detected credentials before taking action |
|
|
|
|
|
### Misuse Potential |
|
|
- Should not be used to scan documents without authorization |
|
|
- Not intended for credential harvesting or malicious purposes |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Installation |
|
|
```bash |
|
|
pip install transformers torch |
|
|
``` |
|
|
|
|
|
### Quick Start |
|
|
```python |
|
|
from transformers import AutoModelForTokenClassification, AutoTokenizer |
|
|
import torch |
|
|
|
|
|
# Load model and tokenizer |
|
|
model_name = "path/to/deeppass2-xlm-roberta" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForTokenClassification.from_pretrained(model_name) |
|
|
|
|
|
# Classify tokens |
|
|
def detect_passwords(text): |
|
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True) |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
|
|
|
predictions = torch.argmax(outputs.logits, dim=-1) |
|
|
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) |
|
|
|
|
|
# Extract password tokens |
|
|
password_tokens = [ |
|
|
token for token, label in zip(tokens, predictions[0]) |
|
|
if label == 1 |
|
|
] |
|
|
|
|
|
return password_tokens |
|
|
``` |
|
|
|
|
|
### Integration with DeepPass2 |
|
|
For production use, integrate with the full DeepPass2 pipeline: |
|
|
1. NoseyParker regex filtering |
|
|
2. BERT token classification (this model) |
|
|
3. LLM validation for false positive reduction |
|
|
|
|
|
See the [DeepPass2 repository](https://github.com/SpecterOps/DeepPass2) for complete implementation. |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@software{gupta2025deeppass2, |
|
|
author = {Gupta, Neeraj}, |
|
|
title = {DeepPass2: Fine-tuned XLM-RoBERTa for Secret Detection}, |
|
|
year = {2025}, |
|
|
organization = {SpecterOps}, |
|
|
url = {https://huggingface.co/deeppass2-bert}, |
|
|
note = {Blog: \url{https://specterops.io/blog/2025/07/31/whats-your-secret-secret-scanning-by-deeppass2/}} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Additional Information |
|
|
|
|
|
### Model Versions |
|
|
- **v6.0-BERT**: Current production version with LoRA adapters |
|
|
- **merged-model**: LoRA weights merged with base model for easier deployment |
|
|
|
|
|
### Related Links |
|
|
- [DeepPass2 Blog Post](https://specterops.io/blog/2025/07/31/whats-your-secret-secret-scanning-by-deeppass2/) |
|
|
- [Original DeepPass (2022)](https://posts.specterops.io/deeppass-finding-passwords-with-deep-learning-4d31c534cd00) |
|
|
- [NoseyParker](https://github.com/praetorian-inc/noseyparker) |
|
|
|
|
|
### Contact |
|
|
For questions or issues, please open an issue on the [GitHub repository](https://github.com/SpecterOps/DeepPass2) |