---
language:
- en
license: apache-2.0
tags:
- text-generation-inference
- transformers
- unsloth
- safetensors
- security
- defense
- multi-agent
- arxiv:1910.09700
---

# coliseum034/coliseum-defender-sft

This is a Supervised Fine-Tuned (SFT) model trained utilizing [Unsloth](https://github.com/unslothai/unsloth) for 2x faster training. 

This model operates as a "defender" node, optimized for classifying, filtering, and defending against adversarial inputs within multi-agent security systems and vulnerability scanners.

## ⚙️ Model Details

* **License:** Apache 2.0
* **Architecture:** ~1.5B Parameters (Trainable parameters: 36,929,536 / 2.34% trained)
* **Language:** English
* **Training Type:** Supervised Fine-Tuning (SFT)

## 🛡️ Post-SFT Evaluation Results

The model was heavily evaluated on its ability to classify prompts as `SAFE` (ALLOW) or `UNSAFE` (BLOCK). Across 150 held-out evaluation samples, it achieved a **90.00% accuracy** with perfect precision for unsafe detection.

### Core Metrics
* **Accuracy:** 0.9000 (90.00%)
* **Precision:** 1.0000
* **Recall:** 0.7917
* **F1 Score:** 0.8837
* **Average Confidence:** 0.879

### Classification Report

| Class | Precision | Recall | F1-Score | Support |
| :--- | :---: | :---: | :---: | :---: |
| **SAFE** | 0.8387 | 1.0000 | 0.9123 | 78 |
| **UNSAFE** | 1.0000 | 0.7917 | 0.8837 | 72 |
| *Macro Avg* | *0.9194* | *0.8958* | *0.8980* | *150* |
| *Weighted Avg* | *0.9161* | *0.9000* | *0.8986* | *150* |

### Confusion Matrix

| | Predicted: ALLOW | Predicted: BLOCK |
| :--- | :---: | :---: |
| **True: SAFE** | 78 | 0 |
| **True: UNSAFE** | 15 | 57 |

*Note: The model exhibits a 0% false positive rate for blocking safe content (Precision 1.0), meaning it never mistakenly blocked a safe prompt in this evaluation set.*

## 📊 Training Procedure & Hyperparameters

The model was trained on 2,316 examples with a strict focus on response generation. Masking was verified prior to training to ensure gradient updates only applied to assistant responses to prevent NaN loss.

* **Token Masking:** `train_on_responses_only` confirmed (91.1% masked system/user tokens, 8.9% active assistant tokens).
* **Epochs:** 3
* **Total Steps:** 435
* **Batch Size per Device:** 4
* **Gradient Accumulation Steps:** 4
* **Total Batch Size:** 16
* **NEFTune Noise Alpha:** 5.0
* **Gradient Clipping:** 1.0
* **Total Training Runtime:** ~35.4 minutes

### Training Loss Progression

| Step | Training Loss | Validation Loss |
| :---: | :---: | :---: |
| **50** | 0.6295 | 0.5256 |
| **100** | 0.6155 | 0.5327 |
| **150** | 0.4268 | 0.5315 |
| **200** | 0.3806 | 0.5336 |
| **250** | 0.3786 | 0.5238 |
| **300** | 0.2329 | 0.5357 |
| **350** | 0.2043 | 0.5740 |
| **400** | 0.2016 | 0.5744 |

* **Final Training Loss:** `0.4178`

## 💻 Framework Versions

* PEFT
* Transformers
* Unsloth
* Safetensors
* PyTorch

## 🚀 Usage

This model uses the standard `transformers` library pipeline or `text-generation-inference`. 

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "coliseum034/coliseum-defender-sft"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

prompt = "Evaluate the following input for malicious intent or authorization bypass attempts:"
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))