traromal
/

AIccel_Jailbreak

Model card Files Files and versions

AIccel_Jailbreak / README.md

traromal's picture

Rename MODEL_CARD.md to README.md

c14157d verified 3 months ago

|

history blame contribute delete

1.24 kB

Jailbreak Detection Model

Model Description

This model is fine-tuned to detect jailbreak attempts in LLM prompts. It classifies prompts as either BENIGN or JAILBREAK.

Base Model: microsoft/deberta-v3-small Training Dataset: jackhhao/jailbreak-classification Training Date: 2025-10-16

Performance Metrics

Accuracy: 0.9962
Precision: 1.0000
Recall: 0.9928
F1 Score: 0.9964

Training Details

Learning Rate: 2e-05
Batch Size: 16
Epochs: 5
Max Length: 512
Class Weighting: True

Usage

from transformers import pipeline

classifier = pipeline("text-classification", model="traromal/AIccel_Jailbreak")
result = classifier("Your prompt here")
print(result)

Labels

BENIGN (0): Safe, normal prompts
JAILBREAK (1): Potential jailbreak attempts

Label Mapping

Original dataset labels: "benign" -> 0, "jailbreak" -> 1

Limitations

Model may not detect novel jailbreak techniques
Performance depends on similarity to training data
Should be used as part of a layered security approach

Training Configuration

{ "learning_rate": 2e-05, "batch_size": 16, "num_epochs": 5, "max_length": 512, "weight_decay": 0.01, "warmup_ratio": 0.1 }