File size: 7,609 Bytes

43261d6

---
license: apache-2.0
language:
- en
base_model:
- FacebookAI/xlm-roberta-base
pipeline_tag: token-classification
tags:
- security
- cybersecurity
---

# DeepPass2-XLM-RoBERTa Fine-tuned for Secret Detection

## Model Description

DeepPass2 is a fine-tuned version of `xlm-roberta-base` specifically designed for detecting passwords and secrets in documents through token classification. Unlike traditional regex-based approaches, this model understands context to identify both structured tokens (API keys, JWTs) and free-form passwords.

**Developed by:** Neeraj Gupta (SpecterOps)  
**Model type:** Token Classification (Sequence Labeling)  
**Base model:** [xlm-roberta-base](https://huggingface.co/xlm-roberta-base)  
**Language(s):** English  
**License:** [Same as base model]  
**Fine-tuned with:** LoRA (Low-Rank Adaptation) through Unsloth
**Blog post:** [What's Your Secret?: Secret Scanning by DeepPass2](https://specterops.io/blog/2025/07/31/whats-your-secret-secret-scanning-by-deeppass2/)

## Model Architecture

### Base Model
- **Architecture:** XLM-RoBERTa-base (Cross-lingual RoBERTa)
- **Parameters:** ~278M (base model)
- **Max sequence length:** 512 tokens
- **Hidden size:** 768
- **Number of layers:** 12
- **Number of attention heads:** 12

### LoRA Configuration
```python
LoraConfig(
    task_type=TaskType.TOKEN_CLS,
    r=64,                    # Rank
    lora_alpha=128,          # Scaling parameter
    lora_dropout=0.05,       # Dropout probability
    bias="none",
    target_modules=["query", "key", "value", "dense"]
)
```

## Intended Use

This model is the BERT based model used in the DeepPass2 blog.

### Primary Use Case
- **Secret Detection:** Identify passwords, API keys, tokens, and other sensitive credentials in documents
- **Security Auditing:** Scan documents for potential credential leaks
- **Data Loss Prevention:** Pre-screen documents before sharing or publishing

### Input
- Text documents of any length (automatically chunked into 300-400 token segments) for DeepPass2 complete tool
- Text string of 512 tokens for particular instance of input to the Model

### Output
- Token-level binary classification:
  - `0`: Non-credential token
  - `1`: Credential/password token

## Training Data

### Dataset Composition
- **Total examples:** 23,000 (20,800 training, 2,200 testing)
- **Document types:** Synthetic Emails, technical documents, logs, configuration files
- **Password sources:**
  - Real breached passwords from CrackStation's "real human" dump
  - Synthetic passwords generated by LLMs
  - Structured tokens (API keys, JWTs, etc.)

### Data Generation Process
1. **Base Documents:** 2,000 long documents (2000+ tokens each) generated using LLMs
   - 50% containing passwords, 50% without
2. **Chunking:** Documents split into 300-400 token chunks with random boundaries
3. **Password Injection:** Real passwords inserted using skeleton sentences:
   ```
   "Your account has been created with username: {user} and password: {pass}"
   ```
4. **Class Balance:** <0.3% of tokens are passwords (maintaining real-world distribution)

## Training Procedure

### Hardware
- Trained on MacBook Pro (64GB RAM) with MPS acceleration
- Can be trained on systems with 8-16GB RAM

### Hyperparameters
- **Epochs:** 4
- **Batch size:** 8 (per device)
- **Weight decay:** 0.01
- **Optimizer:** AdamW (default in Trainer)
- **Learning rate:** Default (5e-5)
- **Max sequence length:** 512 tokens
- **Random seed:** 2

### Training Process
```python
# Preprocessing
- Tokenization with offset mapping
- Label generation based on credential spans
- Padding to max_length with truncation

# Fine-tuning
- LoRA adapters applied to attention layers
- Binary cross-entropy loss
- Token-level classification head
```

## Performance Metrics

### Chunk-Level Metrics
| Metric | Score |
|--------|-------|
| **Strict Accuracy** | 86.67% |
| **Overlap Accuracy** | 97.72% |

### Password-Level Metrics
| Metric | Count/Rate |
|--------|------------|
| True Positives | 1,201 |
| True Negatives | 1,112 |
| False Positives | 49 (3.9%) |
| False Negatives | 138 |
| Overlap True Positives | 456 |
| **Recall** | 89.7% |

### Definitions
- **Strict Accuracy:** All passwords in chunk detected with 100% accuracy
- **Overlap Accuracy:** At least one password detected with >30% overlap with ground truth

## Limitations and Biases

### Known Limitations
1. **Context window:** Limited to 512 tokens per chunk
2. **Training data:** Primarily trained on LLM-generated documents which may not fully represent real-world documents
3. **Password types:** Better at detecting structured/complex passwords than simple dictionary words
4. **Tokenization boundaries:** SentencePiece tokenization can fragment passwords, affecting boundary detection

### Potential Biases
- May over-detect in technical documentation due to training distribution
- Tends to flag alphanumeric strings more readily than common words used as passwords

## Ethical Considerations

### Responsible Use
- **Privacy:** This model should only be used on documents you have permission to scan
- **Security:** Detected credentials should be handled securely and not logged or stored insecurely
- **False Positives:** Always verify detected credentials before taking action

### Misuse Potential
- Should not be used to scan documents without authorization
- Not intended for credential harvesting or malicious purposes

## Usage

### Installation
```bash
pip install transformers torch
```

### Quick Start
```python
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

# Load model and tokenizer
model_name = "path/to/deeppass2-xlm-roberta"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Classify tokens
def detect_passwords(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    
    with torch.no_grad():
        outputs = model(**inputs)
    
    predictions = torch.argmax(outputs.logits, dim=-1)
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    
    # Extract password tokens
    password_tokens = [
        token for token, label in zip(tokens, predictions[0])
        if label == 1
    ]
    
    return password_tokens
```

### Integration with DeepPass2
For production use, integrate with the full DeepPass2 pipeline:
1. NoseyParker regex filtering
2. BERT token classification (this model)
3. LLM validation for false positive reduction

See the [DeepPass2 repository](https://github.com/SpecterOps/DeepPass2) for complete implementation.

## Citation

```bibtex
@software{gupta2025deeppass2,
  author = {Gupta, Neeraj},
  title = {DeepPass2: Fine-tuned XLM-RoBERTa for Secret Detection},
  year = {2025},
  organization = {SpecterOps},
  url = {https://huggingface.co/deeppass2-bert},
  note = {Blog: \url{https://specterops.io/blog/2025/07/31/whats-your-secret-secret-scanning-by-deeppass2/}}
}
```

## Additional Information

### Model Versions
- **v6.0-BERT**: Current production version with LoRA adapters
- **merged-model**: LoRA weights merged with base model for easier deployment

### Related Links
- [DeepPass2 Blog Post](https://specterops.io/blog/2025/07/31/whats-your-secret-secret-scanning-by-deeppass2/)
- [Original DeepPass (2022)](https://posts.specterops.io/deeppass-finding-passwords-with-deep-learning-4d31c534cd00)
- [NoseyParker](https://github.com/praetorian-inc/noseyparker)

### Contact
For questions or issues, please open an issue on the [GitHub repository](https://github.com/SpecterOps/DeepPass2)