vishva0's picture
Update README.md
0ca842a verified
---
language:
- en
license: apache-2.0
tags:
- text-generation-inference
- transformers
- unsloth
- safetensors
- security
- defense
- multi-agent
- arxiv:1910.09700
---
# coliseum034/coliseum-defender-sft
This is a Supervised Fine-Tuned (SFT) model trained utilizing [Unsloth](https://github.com/unslothai/unsloth) for 2x faster training.
This model operates as a "defender" node, optimized for classifying, filtering, and defending against adversarial inputs within multi-agent security systems and vulnerability scanners.
## βš™οΈ Model Details
* **License:** Apache 2.0
* **Architecture:** ~1.5B Parameters (Trainable parameters: 36,929,536 / 2.34% trained)
* **Language:** English
* **Training Type:** Supervised Fine-Tuning (SFT)
## πŸ›‘οΈ Post-SFT Evaluation Results
The model was heavily evaluated on its ability to classify prompts as `SAFE` (ALLOW) or `UNSAFE` (BLOCK). Across 150 held-out evaluation samples, it achieved a **90.00% accuracy** with perfect precision for unsafe detection.
### Core Metrics
* **Accuracy:** 0.9000 (90.00%)
* **Precision:** 1.0000
* **Recall:** 0.7917
* **F1 Score:** 0.8837
* **Average Confidence:** 0.879
### Classification Report
| Class | Precision | Recall | F1-Score | Support |
| :--- | :---: | :---: | :---: | :---: |
| **SAFE** | 0.8387 | 1.0000 | 0.9123 | 78 |
| **UNSAFE** | 1.0000 | 0.7917 | 0.8837 | 72 |
| *Macro Avg* | *0.9194* | *0.8958* | *0.8980* | *150* |
| *Weighted Avg* | *0.9161* | *0.9000* | *0.8986* | *150* |
### Confusion Matrix
| | Predicted: ALLOW | Predicted: BLOCK |
| :--- | :---: | :---: |
| **True: SAFE** | 78 | 0 |
| **True: UNSAFE** | 15 | 57 |
*Note: The model exhibits a 0% false positive rate for blocking safe content (Precision 1.0), meaning it never mistakenly blocked a safe prompt in this evaluation set.*
## πŸ“Š Training Procedure & Hyperparameters
The model was trained on 2,316 examples with a strict focus on response generation. Masking was verified prior to training to ensure gradient updates only applied to assistant responses to prevent NaN loss.
* **Token Masking:** `train_on_responses_only` confirmed (91.1% masked system/user tokens, 8.9% active assistant tokens).
* **Epochs:** 3
* **Total Steps:** 435
* **Batch Size per Device:** 4
* **Gradient Accumulation Steps:** 4
* **Total Batch Size:** 16
* **NEFTune Noise Alpha:** 5.0
* **Gradient Clipping:** 1.0
* **Total Training Runtime:** ~35.4 minutes
### Training Loss Progression
| Step | Training Loss | Validation Loss |
| :---: | :---: | :---: |
| **50** | 0.6295 | 0.5256 |
| **100** | 0.6155 | 0.5327 |
| **150** | 0.4268 | 0.5315 |
| **200** | 0.3806 | 0.5336 |
| **250** | 0.3786 | 0.5238 |
| **300** | 0.2329 | 0.5357 |
| **350** | 0.2043 | 0.5740 |
| **400** | 0.2016 | 0.5744 |
* **Final Training Loss:** `0.4178`
## πŸ’» Framework Versions
* PEFT
* Transformers
* Unsloth
* Safetensors
* PyTorch
## πŸš€ Usage
This model uses the standard `transformers` library pipeline or `text-generation-inference`.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "coliseum034/coliseum-defender-sft"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
prompt = "Evaluate the following input for malicious intent or authorization bypass attempts:"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))