hurtmongoose
/

Jailbreak-Detection-Models

Model card Files Files and versions

Jailbreak-Detection-Models / README.md

hurtmongoose's picture

Upload README.md with huggingface_hub

19767d2 verified 6 months ago

|

history blame contribute delete

812 Bytes

	# Jailbreak Detection Model 🚀

	This model is fine-tuned to detect jailbreak prompts / unsafe instructions.

	## 📊 Training Metrics
	- Training steps: 0
	- Final Training Loss: N/A
	- Final Eval Loss: 0.07551019638776779

	## 📈 Training Curve
	![Training Curve](./training_loss.png)

	## 🛠 How to Use
	```python
	from transformers import AutoModelForSequenceClassification, AutoTokenizer

	model = AutoModelForSequenceClassification.from_pretrained("hurtmongoose/Jailbreak-Detection-Models")
	tokenizer = AutoTokenizer.from_pretrained("hurtmongoose/Jailbreak-Detection-Models")

	inputs = tokenizer("This is a test jailbreak prompt", return_tensors="pt")
	outputs = model(**inputs)
	print(outputs.logits)
	📌 Notes

	Trained on jailbreak detection dataset

	Can be improved with more adversarial prompts