Update README.md

563fae7 verified 19 days ago

14.5 kB

	---
	license: mit
	base_model:
	- microsoft/deberta-v3-base
	pipeline_tag: text-classification
	library_name: transformers
	tags:
	- deberta-v2
	- secrets
	- secret-detection
	- secure-codeing
	- Github-pipeline
	- deberta
	- cybersecurity
	- code-analysis
	---

	# Secrets Sentinel

	## The Problem

	Secrets pushed to repositories like GitHub create a critical security vulnerability. Once exposed:
	- Difficult to Remove: Requires coordination across multiple teams
	- Wide Impact: Secret rotation becomes mandatory and expensive
	- Persistent Risk: History can be exploited even after deletion

	## The Solution

	A fast, accurate Small Language Model (SLM) that detects secrets in code before they reach your repository. Designed to run in pre-receive hooks and CI/CD pipelines with a hard 5-second time limit.

	## Why This Model?

	\| Approach \| Speed \| Accuracy \| Cost \| Generic Secrets \|
	\|----------\|-------\|----------\|------\|-----------------\|
	\| ✗ Regex Tools \| Fast \| Low \| Free \| ✗ Poor \|
	\| ✗ Large LLMs \| Slow (>30s) \| High \| Expensive \| ✓ Great \|
	\| ✓ This SLM \| Ultra-Fast (<500ms) \| High \| Cheap \| ✓ Excellent \|

	Key Advantage: Detects generic secrets (not just patterns) using context-aware AI, unlike regex tools that rely on predefined patterns.

	## Model Details

	- Architecture: DeBERTa v3 Base (86M parameters)
	- Task: Binary sequence classification
	- Detection Labels:
	- `LABEL_0`: Normal code
	- `LABEL_1`: Secret detected
	- Inference Speed: ~100-200ms per line (GPU), ~500ms (CPU)
	- Fine-tuned for: Git diff lines and code snippets

	## Training Configuration

	- Loss: Weighted cross-entropy (handles class imbalance)
	- Optimization Metric: F1-score
	- Training Tech: BF16 precision, gradient checkpointing

	## Quick Start

	### Simple Pipeline Usage

	```python
	from transformers import pipeline

	# Load the pipeline with the secrets detection model
	classifier = pipeline(
	"text-classification",
	model="hypn05/secrets-sentinel"
	)

	# Define the input examples
	inputs = [
	"password='supersecret123'", # Expected: Secret
	"api_key = 'sk-1234567890abc'", # Expected: Secret
	"print('Hello, world!')", # Expected: Safe
	"def calculate_sum(a, b):", # Expected: Safe
	]

	# Run the classifier on the inputs
	results = classifier(inputs)

	# Print the actual input string and the result ("Secret" if LABEL_1, else "Safe")
	for input_text, result in zip(inputs, results):
	label = "Secret" if result['label'] == "LABEL_1" else "Safe"
	print(f"{label} \| {input_text}")
	```

	## Advanced Usage

	### Production-Ready Integration

	Perfect for pre-receive hooks with strict time constraints:

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	from typing import List, Tuple

	class SecretDetector:
	def __init__(self, device: str = None, compile: bool = True):
	"""Initialize the secret detector with optional compilation."""
	self.device = device or ("cuda" if torch.cuda.is_available() else "cpu")
	print(f"Loading model on {self.device.upper()}...")

	model_name = "hypn05/secrets-sentinel"
	self.tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

	self.model = AutoModelForSequenceClassification.from_pretrained(
	model_name,
	torch_dtype=torch.float16 if self.device == "cuda" else torch.float32,
	attn_implementation="eager"
	).to(self.device).eval()

	# ⚡ Compile for max speed (first run slower, subsequent runs 2-3x faster)
	if compile and self.device == "cuda":
	try:
	self.model = torch.compile(self.model)
	print("✓ Model compiled for production speed")
	except Exception as e:
	print(f"Compilation skipped: {e}")

	def detect_secrets(self, texts: List[str], threshold: float = 0.7) -> List[Tuple[str, str, float]]:
	"""Detect secrets in a batch of code snippets."""
	inputs = self.tokenizer(
	texts,
	padding=True,
	truncation=True,
	max_length=512,
	return_tensors="pt"
	).to(self.device)

	with torch.no_grad():
	outputs = self.model(**inputs)
	logits = outputs.logits
	scores = torch.softmax(logits, dim=-1)
	predictions = torch.argmax(logits, dim=-1)

	# Map to labels
	actual_model = self.model._orig_mod if hasattr(self.model, '_orig_mod') else self.model
	results = []
	for text, pred_idx, score in zip(texts, predictions, scores[:, 1]):
	label = actual_model.config.id2label[pred_idx.item()]
	confidence = score.item()
	results.append((text, label, confidence))

	return results


	# Example: Pre-receive hook integration
	def pre_receive_hook(git_diff: str, threshold: float = 0.5) -> bool:
	"""
	Check git diff for secrets.
	Returns True if safe to push, False if secrets detected.
	"""
	detector = SecretDetector(compile=True) # Warm up on first call

	# Extract lines from diff
	diff_lines = [line for line in git_diff.split('\n') if line.startswith(('+', '-'))]

	results = detector.detect_secrets(diff_lines, threshold=threshold)

	secrets_found = []
	for text, label, confidence in results:
	if label == "Secret" and confidence > threshold:
	secrets_found.append((text, confidence))

	if secrets_found:
	print(f"\n PUSH REJECTED: {len(secrets_found)} secret(s) detected!")
	for text, conf in secrets_found:
	print(f" [{conf:.1%}] {text[:80]}")
	return False

	print("✓ No secrets detected. Push allowed.")
	return True


	if __name__ == "__main__":
	# Example usage
	detector = SecretDetector()

	test_cases = [
	"password='secret123'",
	"api_token = 'ghp_abcd1234efgh5678ijkl'", # GitHub token
	"db_password = os.environ.get('DB_PASS')",
	"print('Hello, world!')",
	"def authenticate(username, password):",
	"AWS_SECRET_ACCESS_KEY = b'abc123xyz789'",
	"The weather is nice today",
	]

	print("\n" + "="*70)
	print("SECRET DETECTION RESULTS")
	print("="*70)
	results = detector.detect_secrets(test_cases)

	for text, label, confidence in results:
	status = "" if label == "Secret" else "✓"
	display_text = (text[:55] + "...") if len(text) > 55 else text
	print(f"{status} [{confidence:>6.1%}] {label:>8} \| {display_text}")
	print("="*70)
	```

	## Use Cases & Integration Scenarios

	### 1. Git Pre-Receive Hook (5-second limit)
	Prevent secrets from ever reaching your repository:

	```bash
	#!/bin/bash
	# .git/hooks/pre-receive

	python3 << 'EOF'
	import sys
	from secret_detector import pre_receive_hook

	git_diff = sys.stdin.read()
	if not pre_receive_hook(git_diff, threshold=0.5):
	sys.exit(1) # Reject push
	EOF
	```

	Impact: Stops secrets at the source, zero rotation overhead

	### 2. GitHub Actions / GitLab CI Pipeline
	Scan pull requests before merge:

	```yaml
	# .github/workflows/secret-check.yml
	name: Secret Detection

	on: [pull_request]

	jobs:
	scan:
	runs-on: ubuntu-latest
	steps:
	- uses: actions/checkout@v3
	with:
	fetch-depth: 0

	- name: Scan for secrets
	run: \|
	pip install transformers torch
	python3 scan_secrets.py
	```

	Impact: Catch secrets in code review, before they hit main branch

	### 3. Splunk/SIEM Integration
	Monitor for secrets in historical codebase scans:

	```python
	def splunk_integration(code_snippets: List[str]):
	"""Log detected secrets to Splunk for compliance."""
	detector = SecretDetector()
	results = detector.detect_secrets(code_snippets)

	for text, label, confidence in results:
	if label == "Secret":
	log_to_splunk({
	"event": "secret_detected",
	"confidence": confidence,
	"snippet": text[:100],
	"severity": "HIGH" if confidence > 0.9 else "MEDIUM"
	})
	```

	### 4. Confluence / Document Scanning
	Scan shared documents for accidentally pasted secrets:

	```python
	def scan_confluence_pages(pages: List[str]) -> List[dict]:
	"""Identify secrets in Confluence pages."""
	detector = SecretDetector()
	findings = []

	for page_id, content in pages:
	results = detector.detect_secrets([content])
	for text, label, confidence in results:
	if label == "Secret" and confidence > 0.7:
	findings.append({
	"page_id": page_id,
	"secret": text,
	"confidence": confidence
	})

	return findings
	```

	### 5. Local Development
	Developer-friendly warning before commits:

	```bash
	# install-hook.sh
	pip install transformers torch
	python3 -m huggingface_hub login

	# Copy pre-commit hook
	cp ./pre_commit_secret_check.py .git/hooks/pre-commit
	chmod +x .git/hooks/pre-commit
	```

	Impact: Developers get instant feedback on typos/mistakes

	## Performance Benchmarks

	\| Scenario \| Latency \| Throughput \| Notes \|
	\|----------\|---------\|-----------\|-------\|
	\| Single line (CPU) \| ~500ms \| - \| First run includes model load \|
	\| Single line (CPU, cached) \| ~50ms \| - \| Subsequent runs \|
	\| Single line (GPU) \| ~100ms \| - \| With torch.compile \|
	\| Batch 32 lines (GPU) \| ~150ms \| 213 lines/sec \| Optimal for CI/CD \|
	\| Batch 128 lines (GPU) \| ~400ms \| 320 lines/sec \| Maximum throughput \|

	Model Architecture: DeBERTa-v3-base with 86M parameters
	Model Size: 750MB (safetensors format)
	GPU Memory: ~800MB (inference only)
	RAM Usage: ~800MB (CPU inference)
	Vocabulary: 128K tokens

	## What Secrets Does It Detect?

	The model is trained on generic secret patterns and can identify:

	✓ Hardcoded passwords
	✓ API keys (AWS, GitHub, OpenAI, etc.)
	✓ Database connection strings
	✓ OAuth tokens
	✓ Private keys
	✓ Authentication credentials (any context)
	✓ Generic "secret" assignments

	Unlike regex tools, it understands context - words like "password123" in comments or documentation won't trigger false positives.

	## Installation & Requirements

	```bash
	pip install transformers torch
	# Optional: For GPU support
	pip install torch --index-url https://download.pytorch.org/whl/cu118
	```

	## Deployment Recommendations

	### Development (CPU)
	```python
	detector = SecretDetector(device="cpu", compile=False)
	```
	- Setup Time: ~5 seconds
	- Per-line Latency: ~50-500ms
	- Use Case: Local development, testing

	### Production (GPU)
	```python
	detector = SecretDetector(device="cuda", compile=True)
	```
	- Setup Time: ~15 seconds (first run, compilation included)
	- Per-line Latency: ~100-200ms
	- Use Case: CI/CD pipelines, high-volume scanning

	### CI/CD Best Practices

	```yaml
	# Optimize for speed in pipelines
	strategy:
	matrix:
	os: [ubuntu-latest]
	python-version: ["3.10"]
	cache:
	- uses: actions/cache@v3
	with:
	path: ~/.cache/huggingface
	key: huggingface-models # Reuse cached model weights
	```

	## FAQ

	Q: How accurate is this model?
	A: Trained on real secret examples, optimized for F1-score. Achieves high precision/recall for generic secrets while minimizing false positives on code comments and documentation.

	Q: Can it run in 5 seconds for a pre-receive hook?
	A: Yes! Single run takes <100ms (GPU) or <500ms (CPU). Model loading is cached by Git. On first push, allow ~2-3 seconds for initial load.

	Q: Does it detect all secrets?
	A: It excels at generic secrets (any password/token/key assignment). Highly specific patterns (proprietary internal formats) may need custom fine-tuning.

	Q: Is my code sent to Hugging Face?
	A: No. Run it locally or on your servers. The model is just weights - inference is completely on-premise.

	Q: What about false positives?
	A: The model learns context, so `password_hint = "12345"` in documentation won't trigger false alarms like regex tools would.

	Q: Can I fine-tune it on my organization's patterns?
	A: Yes! The base model is open-source. If you have labeled examples of secrets you want to catch, you can fine-tune a copy.

	## Limitations

	- Single-Line Processing: Analyzes one line at a time (128 tokens max). This means:
	- Multi-line private keys (PEM format, etc.) won't be fully caught—use regex tools for those
	- Multi-line JWTs will be flagged
	- For maximum coverage, combine with regex scanners for traditional key formats
	- Language Bias: Trained mainly on Python/JavaScript - other languages may be less accurate
	- Redacted Patterns: `ACCESS_TOKEN_REDACTED` won't be flagged (by design)
	- Comments: Intentional documentation of secrets (for educational purposes) may be flagged

	## Contributing

	Have secret patterns your regex tools miss? Want to improve accuracy for specific languages?

	1. Fork the [model repo](https://huggingface.co/hypn05/secrets-sentinel)
	2. Fine-tune with your labeled examples
	3. Share improvements back with the community

	## License

	MIT License - Free to use, modify, and deploy in commercial systems. Please include attribution when using this model in your projects.

	```
	MIT License

	Copyright (c) 2026 Hypn05

	Permission is hereby granted, free of charge, to any person obtaining a copy
	of this software and associated documentation files (the "Software"), to deal
	in the Software without restriction, including without limitation the rights
	to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
	copies of the Software, and to permit persons to whom the Software is
	furnished to do so, subject to the following conditions:

	The above copyright notice and this permission notice shall be included in all
	copies or substantial portions of the Software.

	THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
	IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
	FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
	```

	## Citation

	If you use this model in your security workflow, please cite:

	```bibtex
	@model{secrets_sentinel_2026,
	title={Secrets Sentinel},
	author={Hypn05},
	year={2026},
	url={https://huggingface.co/hypn05/secrets-sentinel}
	}
	```

	---

	Happy secure coding!