coliseum034
/

coliseum-defender-sft

text-generation-inference

arxiv:1910.09700

Model card Files Files and versions

coliseum-defender-sft / README.md

vishva0's picture

Update README.md

0ca842a verified about 1 month ago

|

history blame contribute delete

3.48 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- text-generation-inference
	- transformers
	- unsloth
	- safetensors
	- security
	- defense
	- multi-agent
	- arxiv:1910.09700
	---

	# coliseum034/coliseum-defender-sft

	This is a Supervised Fine-Tuned (SFT) model trained utilizing [Unsloth](https://github.com/unslothai/unsloth) for 2x faster training.

	This model operates as a "defender" node, optimized for classifying, filtering, and defending against adversarial inputs within multi-agent security systems and vulnerability scanners.

	## ⚙️ Model Details

	* License: Apache 2.0
	* Architecture: ~1.5B Parameters (Trainable parameters: 36,929,536 / 2.34% trained)
	* Language: English
	* Training Type: Supervised Fine-Tuning (SFT)

	## 🛡️ Post-SFT Evaluation Results

	The model was heavily evaluated on its ability to classify prompts as `SAFE` (ALLOW) or `UNSAFE` (BLOCK). Across 150 held-out evaluation samples, it achieved a 90.00% accuracy with perfect precision for unsafe detection.

	### Core Metrics
	* Accuracy: 0.9000 (90.00%)
	* Precision: 1.0000
	* Recall: 0.7917
	* F1 Score: 0.8837
	* Average Confidence: 0.879

	### Classification Report

	\| Class \| Precision \| Recall \| F1-Score \| Support \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \|
	\| SAFE \| 0.8387 \| 1.0000 \| 0.9123 \| 78 \|
	\| UNSAFE \| 1.0000 \| 0.7917 \| 0.8837 \| 72 \|
	\| Macro Avg \| 0.9194 \| 0.8958 \| 0.8980 \| 150 \|
	\| Weighted Avg \| 0.9161 \| 0.9000 \| 0.8986 \| 150 \|

	### Confusion Matrix

	\| \| Predicted: ALLOW \| Predicted: BLOCK \|
	\| :--- \| :---: \| :---: \|
	\| True: SAFE \| 78 \| 0 \|
	\| True: UNSAFE \| 15 \| 57 \|

	Note: The model exhibits a 0% false positive rate for blocking safe content (Precision 1.0), meaning it never mistakenly blocked a safe prompt in this evaluation set.

	## 📊 Training Procedure & Hyperparameters

	The model was trained on 2,316 examples with a strict focus on response generation. Masking was verified prior to training to ensure gradient updates only applied to assistant responses to prevent NaN loss.

	* Token Masking: `train_on_responses_only` confirmed (91.1% masked system/user tokens, 8.9% active assistant tokens).
	* Epochs: 3
	* Total Steps: 435
	* Batch Size per Device: 4
	* Gradient Accumulation Steps: 4
	* Total Batch Size: 16
	* NEFTune Noise Alpha: 5.0
	* Gradient Clipping: 1.0
	* Total Training Runtime: ~35.4 minutes

	### Training Loss Progression

	\| Step \| Training Loss \| Validation Loss \|
	\| :---: \| :---: \| :---: \|
	\| 50 \| 0.6295 \| 0.5256 \|
	\| 100 \| 0.6155 \| 0.5327 \|
	\| 150 \| 0.4268 \| 0.5315 \|
	\| 200 \| 0.3806 \| 0.5336 \|
	\| 250 \| 0.3786 \| 0.5238 \|
	\| 300 \| 0.2329 \| 0.5357 \|
	\| 350 \| 0.2043 \| 0.5740 \|
	\| 400 \| 0.2016 \| 0.5744 \|

	* Final Training Loss: `0.4178`

	## 💻 Framework Versions

	* PEFT
	* Transformers
	* Unsloth
	* Safetensors
	* PyTorch

	## 🚀 Usage

	This model uses the standard `transformers` library pipeline or `text-generation-inference`.

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_id = "coliseum034/coliseum-defender-sft"

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(model_id)

	prompt = "Evaluate the following input for malicious intent or authorization bypass attempts:"
	inputs = tokenizer(prompt, return_tensors="pt")

	outputs = model.generate(**inputs, max_new_tokens=100)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))