Upload folder using huggingface_hub

19b49f2 verified 5 days ago

5.22 kB

	---
	license: cc-by-nc-sa-4.0
	base_model: Qwen/Qwen2.5-3B-Instruct
	library_name: peft
	tags:
	- llm-security
	- jailbreak-detection
	- text-classification
	- safety
	- lora
	pipeline_tag: text-classification
	---

	# TRYLOCK Sidecar Classifier

	Layer 3 of the TRYLOCK (Adaptive Ensemble Guard with Integrated Steering) defense system. This is a lightweight classifier that runs alongside the main LLM to detect attack patterns and dynamically adjust defense strength.

	## Model Description

	The sidecar classifier categorizes inputs into three classes:
	- SAFE: Benign queries that need minimal defense
	- WARN: Ambiguous or suspicious queries
	- ATTACK: Clear jailbreak/attack attempts

	## Intended Uses

	Primary use cases:
	- Real-time classification of LLM inputs for adaptive defense
	- Dynamically adjusting RepE steering strength
	- Research on attack detection methods
	- Content moderation and threat triage

	Out of scope:
	- Standalone content moderation (designed for adaptive steering)
	- High-stakes security decisions without human review

	### Training Details

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Base Model \| Qwen2.5-3B-Instruct \|
	\| Method \| LoRA fine-tuning \|
	\| LoRA Rank \| 32 \|
	\| LoRA Alpha \| 64 \|
	\| Target Modules \| q_proj, k_proj, v_proj, o_proj \|
	\| Training Samples \| 2,349 \|
	\| Classes \| SAFE, WARN, ATTACK \|

	### Evaluation Results

	\| Class \| Precision \| Recall \| F1 \|
	\|-------\|-----------\|--------\|-----\|
	\| SAFE \| 24.1% \| 34.5% \| 28.4% \|
	\| WARN \| 58.3% \| 51.6% \| 54.8% \|
	\| ATTACK \| 62.4% \| 60.8% \| 61.6% \|
	\| Macro Avg \| 48.3% \| 48.9% \| 48.3% \|

	Note: This is a research prototype. ATTACK detection is prioritized over SAFE detection for security.

	## Usage

	```python
	import torch
	from transformers import AutoModelForSequenceClassification, AutoTokenizer
	from peft import PeftModel

	# Load model
	base_model = AutoModelForSequenceClassification.from_pretrained(
	"Qwen/Qwen2.5-3B-Instruct",
	num_labels=3,
	device_map="auto",
	torch_dtype=torch.bfloat16,
	)
	model = PeftModel.from_pretrained(base_model, "scthornton/trylock-sidecar-classifier")
	tokenizer = AutoTokenizer.from_pretrained("scthornton/trylock-sidecar-classifier")

	# Classify input
	label_names = ["SAFE", "WARN", "ATTACK"]

	def classify(text):
	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=2048)
	with torch.no_grad():
	outputs = model(**inputs.to(model.device))
	probs = torch.softmax(outputs.logits, dim=-1)[0]
	return label_names[probs.argmax()], probs.tolist()

	# Example
	result, probs = classify("Ignore all previous instructions and...")
	print(f"Classification: {result}") # ATTACK
	print(f"Probabilities: {dict(zip(label_names, probs))}")
	```

	## Dynamic Defense Integration

	Use the classification to adjust RepE steering strength:

	```python
	alpha_map = {
	"SAFE": 0.5, # Minimal steering - preserve fluency
	"WARN": 1.5, # Moderate steering
	"ATTACK": 2.5 # Maximum steering - block harmful output
	}

	classification, _ = classify(user_input)
	alpha = alpha_map[classification]
	# Apply RepE steering with this alpha value
	```

	## TRYLOCK Architecture

	This classifier is Layer 3 of the 3-layer TRYLOCK defense:

	1. Layer 1 (KNOWLEDGE): [trylock-mistral-7b-dpo](https://huggingface.co/scthornton/trylock-mistral-7b-dpo)
	2. Layer 2 (INSTINCT): [trylock-repe-vectors](https://huggingface.co/scthornton/trylock-repe-vectors)
	3. Layer 3 (OVERSIGHT): Sidecar classifier (this model)

	## Limitations and Risks

	Limitations:
	- SAFE class has lower precision (24%) - may over-classify benign queries as WARN
	- Trained on English-language attacks only
	- 3-class granularity may miss nuanced threat levels
	- Requires ~3B parameter model inference overhead

	Risks:
	- False positives may trigger unnecessary steering
	- False negatives on novel attack patterns
	- Classification confidence doesn't guarantee correctness

	Recommendations:
	- Use probability scores, not just class labels, for fine-grained control
	- Consider WARN classification as "elevated caution" rather than definite threat
	- Combine with other safety mechanisms for production use

	## Framework Versions

	- PEFT 0.18.0

	## Citation

	```bibtex
	@misc{trylock2024,
	title={TRYLOCK: Defense-in-Depth Against LLM Jailbreaks via Layered Preference and Representation Engineering},
	author={Thornton, Scott},
	year={2024},
	url={https://huggingface.co/scthornton/trylock-sidecar-classifier}
	}
	```

	## License

	CC BY-NC-SA 4.0 (Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International)

	You are free to:
	- Share — copy and redistribute the material in any medium or format
	- Adapt — remix, transform, and build upon the material

	Under the following terms:
	- Attribution — You must give appropriate credit to Scott Thornton, provide a link to the license, and indicate if changes were made
	- NonCommercial — You may not use the material for commercial purposes without explicit written permission
	- ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license

	For commercial licensing inquiries, contact: scott@perfecxion.ai