gneeraj
/

deeppass2-bert

+---
+license: apache-2.0
+language:
+- en
+base_model:
+- FacebookAI/xlm-roberta-base
+pipeline_tag: token-classification
+tags:
+- security
+- cybersecurity
+---
+# DeepPass2-XLM-RoBERTa Fine-tuned for Secret Detection
+## Model Description
+DeepPass2 is a fine-tuned version of `xlm-roberta-base` specifically designed for detecting passwords and secrets in documents through token classification. Unlike traditional regex-based approaches, this model understands context to identify both structured tokens (API keys, JWTs) and free-form passwords.
+**Developed by:** Neeraj Gupta (SpecterOps)
+**Model type:** Token Classification (Sequence Labeling)
+**Base model:** [xlm-roberta-base](https://huggingface.co/xlm-roberta-base)
+**Language(s):** English
+**License:** [Same as base model]
+**Fine-tuned with:** LoRA (Low-Rank Adaptation) through Unsloth
+**Blog post:** [What's Your Secret?: Secret Scanning by DeepPass2](https://specterops.io/blog/2025/07/31/whats-your-secret-secret-scanning-by-deeppass2/)
+## Model Architecture
+### Base Model
+- **Architecture:** XLM-RoBERTa-base (Cross-lingual RoBERTa)
+- **Parameters:** ~278M (base model)
+- **Max sequence length:** 512 tokens
+- **Hidden size:** 768
+- **Number of layers:** 12
+- **Number of attention heads:** 12
+### LoRA Configuration
+```python
+LoraConfig(
+    task_type=TaskType.TOKEN_CLS,
+    r=64,                    # Rank
+    lora_alpha=128,          # Scaling parameter
+    lora_dropout=0.05,       # Dropout probability
+    bias="none",
+    target_modules=["query", "key", "value", "dense"]
+)
+```
+## Intended Use
+This model is the BERT based model used in the DeepPass2 blog.
+### Primary Use Case
+- **Secret Detection:** Identify passwords, API keys, tokens, and other sensitive credentials in documents
+- **Security Auditing:** Scan documents for potential credential leaks
+- **Data Loss Prevention:** Pre-screen documents before sharing or publishing
+### Input
+- Text documents of any length (automatically chunked into 300-400 token segments) for DeepPass2 complete tool
+- Text string of 512 tokens for particular instance of input to the Model
+### Output
+- Token-level binary classification:
+  - `0`: Non-credential token
+  - `1`: Credential/password token
+## Training Data
+### Dataset Composition
+- **Total examples:** 23,000 (20,800 training, 2,200 testing)
+- **Document types:** Synthetic Emails, technical documents, logs, configuration files
+- **Password sources:**
+  - Real breached passwords from CrackStation's "real human" dump
+  - Synthetic passwords generated by LLMs
+  - Structured tokens (API keys, JWTs, etc.)
+### Data Generation Process
+1. **Base Documents:** 2,000 long documents (2000+ tokens each) generated using LLMs
+   - 50% containing passwords, 50% without
+2. **Chunking:** Documents split into 300-400 token chunks with random boundaries
+3. **Password Injection:** Real passwords inserted using skeleton sentences:
+   ```
+   "Your account has been created with username: {user} and password: {pass}"
+   ```
+4. **Class Balance:** <0.3% of tokens are passwords (maintaining real-world distribution)
+## Training Procedure
+### Hardware
+- Trained on MacBook Pro (64GB RAM) with MPS acceleration
+- Can be trained on systems with 8-16GB RAM
+### Hyperparameters
+- **Epochs:** 4
+- **Batch size:** 8 (per device)
+- **Weight decay:** 0.01
+- **Optimizer:** AdamW (default in Trainer)
+- **Learning rate:** Default (5e-5)
+- **Max sequence length:** 512 tokens
+- **Random seed:** 2
+### Training Process
+```python
+# Preprocessing
+- Tokenization with offset mapping
+- Label generation based on credential spans
+- Padding to max_length with truncation
+# Fine-tuning
+- LoRA adapters applied to attention layers
+- Binary cross-entropy loss
+- Token-level classification head
+```
+## Performance Metrics
+### Chunk-Level Metrics
+| Metric | Score |
+|--------|-------|
+| **Strict Accuracy** | 86.67% |
+| **Overlap Accuracy** | 97.72% |
+### Password-Level Metrics
+| Metric | Count/Rate |
+|--------|------------|
+| True Positives | 1,201 |
+| True Negatives | 1,112 |
+| False Positives | 49 (3.9%) |
+| False Negatives | 138 |
+| Overlap True Positives | 456 |
+| **Recall** | 89.7% |
+### Definitions
+- **Strict Accuracy:** All passwords in chunk detected with 100% accuracy
+- **Overlap Accuracy:** At least one password detected with >30% overlap with ground truth
+## Limitations and Biases
+### Known Limitations
+1. **Context window:** Limited to 512 tokens per chunk
+2. **Training data:** Primarily trained on LLM-generated documents which may not fully represent real-world documents
+3. **Password types:** Better at detecting structured/complex passwords than simple dictionary words
+4. **Tokenization boundaries:** SentencePiece tokenization can fragment passwords, affecting boundary detection
+### Potential Biases
+- May over-detect in technical documentation due to training distribution
+- Tends to flag alphanumeric strings more readily than common words used as passwords
+## Ethical Considerations
+### Responsible Use
+- **Privacy:** This model should only be used on documents you have permission to scan
+- **Security:** Detected credentials should be handled securely and not logged or stored insecurely
+- **False Positives:** Always verify detected credentials before taking action
+### Misuse Potential
+- Should not be used to scan documents without authorization
+- Not intended for credential harvesting or malicious purposes
+## Usage
+### Installation
+```bash
+pip install transformers torch
+```
+### Quick Start
+```python
+from transformers import AutoModelForTokenClassification, AutoTokenizer
+import torch
+# Load model and tokenizer
+model_name = "path/to/deeppass2-xlm-roberta"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForTokenClassification.from_pretrained(model_name)
+# Classify tokens
+def detect_passwords(text):
+    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
+    with torch.no_grad():
+        outputs = model(**inputs)
+    predictions = torch.argmax(outputs.logits, dim=-1)
+    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
+    # Extract password tokens
+    password_tokens = [
+        token for token, label in zip(tokens, predictions[0])
+        if label == 1
+    ]
+    return password_tokens
+```
+### Integration with DeepPass2
+For production use, integrate with the full DeepPass2 pipeline:
+1. NoseyParker regex filtering
+2. BERT token classification (this model)
+3. LLM validation for false positive reduction
+See the [DeepPass2 repository](https://github.com/SpecterOps/DeepPass2) for complete implementation.
+## Citation
+```bibtex
+@software{gupta2025deeppass2,
+  author = {Gupta, Neeraj},
+  title = {DeepPass2: Fine-tuned XLM-RoBERTa for Secret Detection},
+  year = {2025},
+  organization = {SpecterOps},
+  url = {https://huggingface.co/deeppass2-bert},
+  note = {Blog: \url{https://specterops.io/blog/2025/07/31/whats-your-secret-secret-scanning-by-deeppass2/}}
+}
+```
+## Additional Information
+### Model Versions
+- **v6.0-BERT**: Current production version with LoRA adapters
+- **merged-model**: LoRA weights merged with base model for easier deployment
+### Related Links
+- [DeepPass2 Blog Post](https://specterops.io/blog/2025/07/31/whats-your-secret-secret-scanning-by-deeppass2/)
+- [Original DeepPass (2022)](https://posts.specterops.io/deeppass-finding-passwords-with-deep-learning-4d31c534cd00)
+- [NoseyParker](https://github.com/praetorian-inc/noseyparker)
+### Contact
+For questions or issues, please open an issue on the [GitHub repository](https://github.com/SpecterOps/DeepPass2)