traromal
/

AIccel_Jailbreak

Model card Files Files and versions

AIccel_Jailbreak / README.md

traromal's picture

Rename MODEL_CARD.md to README.md

c14157d verified 3 months ago

|

history blame contribute delete

1.24 kB


	# Jailbreak Detection Model

	## Model Description
	This model is fine-tuned to detect jailbreak attempts in LLM prompts. It classifies prompts as either BENIGN or JAILBREAK.

	Base Model: microsoft/deberta-v3-small
	Training Dataset: jackhhao/jailbreak-classification
	Training Date: 2025-10-16

	## Performance Metrics
	- Accuracy: 0.9962
	- Precision: 1.0000
	- Recall: 0.9928
	- F1 Score: 0.9964

	## Training Details
	- Learning Rate: 2e-05
	- Batch Size: 16
	- Epochs: 5
	- Max Length: 512
	- Class Weighting: True

	## Usage
	```python
	from transformers import pipeline

	classifier = pipeline("text-classification", model="traromal/AIccel_Jailbreak")
	result = classifier("Your prompt here")
	print(result)
	```

	## Labels
	- BENIGN (0): Safe, normal prompts
	- JAILBREAK (1): Potential jailbreak attempts

	## Label Mapping
	- Original dataset labels: "benign" -> 0, "jailbreak" -> 1

	## Limitations
	- Model may not detect novel jailbreak techniques
	- Performance depends on similarity to training data
	- Should be used as part of a layered security approach

	## Training Configuration
	{
	"learning_rate": 2e-05,
	"batch_size": 16,
	"num_epochs": 5,
	"max_length": 512,
	"weight_decay": 0.01,
	"warmup_ratio": 0.1
	}