--- language: - en license: apache-2.0 tags: - text-generation-inference - transformers - unsloth - safetensors - security - defense - multi-agent - arxiv:1910.09700 --- # coliseum034/coliseum-defender-sft This is a Supervised Fine-Tuned (SFT) model trained utilizing [Unsloth](https://github.com/unslothai/unsloth) for 2x faster training. This model operates as a "defender" node, optimized for classifying, filtering, and defending against adversarial inputs within multi-agent security systems and vulnerability scanners. ## ⚙️ Model Details * **License:** Apache 2.0 * **Architecture:** ~1.5B Parameters (Trainable parameters: 36,929,536 / 2.34% trained) * **Language:** English * **Training Type:** Supervised Fine-Tuning (SFT) ## 🛡️ Post-SFT Evaluation Results The model was heavily evaluated on its ability to classify prompts as `SAFE` (ALLOW) or `UNSAFE` (BLOCK). Across 150 held-out evaluation samples, it achieved a **90.00% accuracy** with perfect precision for unsafe detection. ### Core Metrics * **Accuracy:** 0.9000 (90.00%) * **Precision:** 1.0000 * **Recall:** 0.7917 * **F1 Score:** 0.8837 * **Average Confidence:** 0.879 ### Classification Report | Class | Precision | Recall | F1-Score | Support | | :--- | :---: | :---: | :---: | :---: | | **SAFE** | 0.8387 | 1.0000 | 0.9123 | 78 | | **UNSAFE** | 1.0000 | 0.7917 | 0.8837 | 72 | | *Macro Avg* | *0.9194* | *0.8958* | *0.8980* | *150* | | *Weighted Avg* | *0.9161* | *0.9000* | *0.8986* | *150* | ### Confusion Matrix | | Predicted: ALLOW | Predicted: BLOCK | | :--- | :---: | :---: | | **True: SAFE** | 78 | 0 | | **True: UNSAFE** | 15 | 57 | *Note: The model exhibits a 0% false positive rate for blocking safe content (Precision 1.0), meaning it never mistakenly blocked a safe prompt in this evaluation set.* ## 📊 Training Procedure & Hyperparameters The model was trained on 2,316 examples with a strict focus on response generation. Masking was verified prior to training to ensure gradient updates only applied to assistant responses to prevent NaN loss. * **Token Masking:** `train_on_responses_only` confirmed (91.1% masked system/user tokens, 8.9% active assistant tokens). * **Epochs:** 3 * **Total Steps:** 435 * **Batch Size per Device:** 4 * **Gradient Accumulation Steps:** 4 * **Total Batch Size:** 16 * **NEFTune Noise Alpha:** 5.0 * **Gradient Clipping:** 1.0 * **Total Training Runtime:** ~35.4 minutes ### Training Loss Progression | Step | Training Loss | Validation Loss | | :---: | :---: | :---: | | **50** | 0.6295 | 0.5256 | | **100** | 0.6155 | 0.5327 | | **150** | 0.4268 | 0.5315 | | **200** | 0.3806 | 0.5336 | | **250** | 0.3786 | 0.5238 | | **300** | 0.2329 | 0.5357 | | **350** | 0.2043 | 0.5740 | | **400** | 0.2016 | 0.5744 | * **Final Training Loss:** `0.4178` ## 💻 Framework Versions * PEFT * Transformers * Unsloth * Safetensors * PyTorch ## 🚀 Usage This model uses the standard `transformers` library pipeline or `text-generation-inference`. ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "coliseum034/coliseum-defender-sft" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id) prompt = "Evaluate the following input for malicious intent or authorization bypass attempts:" inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=100) print(tokenizer.decode(outputs[0], skip_special_tokens=True))