Instructions to use coliseum034/coliseum-defender-grpo-live with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use coliseum034/coliseum-defender-grpo-live with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("/Users/aditya/Documents/Aditya/Python/OpenEnv Hackathon/openenv-trust-safety-audit/models/base/Qwen2.5-1.5B-sft-merged") model = PeftModel.from_pretrained(base_model, "coliseum034/coliseum-defender-grpo-live") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- Unsloth Studio new
How to use coliseum034/coliseum-defender-grpo-live with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for coliseum034/coliseum-defender-grpo-live to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for coliseum034/coliseum-defender-grpo-live to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for coliseum034/coliseum-defender-grpo-live to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="coliseum034/coliseum-defender-grpo-live", max_seq_length=2048, )
coliseum034/coliseum-defender-grpo-live
This model has been fine-tuned using Generative Reward Policy Optimization (GRPO) to align its decision-making capabilities. It was trained utilizing Unsloth for accelerated training and memory optimization.
This model acts as a highly optimized "defender" node designed to reliably evaluate, filter, and secure interactions within multi-agent vulnerability scanners and complex security architectures.
βοΈ Model Details
- License: Apache 2.0
- Architecture: ~1.5B Parameters (Trainable parameters: 18,464,768 / 1.18% trained via PEFT)
- Language: English
- Training Type: Generative Reward Policy Optimization (GRPO)
π Evaluation Metrics: Pre vs. Post GRPO
The model was evaluated before and after the GRPO training phase on a 50-sample held-out evaluation set. Because the base supervised model had already reached a performance ceiling on classification accuracy for this dataset, the GRPO phase focused on aligning and optimizing the generation reward.
| Metric | Pre-GRPO | Post-GRPO | Delta |
|---|---|---|---|
| Accuracy | 1.0000 | 1.0000 | + 0.0000 |
| Precision | 1.0000 | 1.0000 | + 0.0000 |
| Recall | 1.0000 | 1.0000 | + 0.0000 |
| F1 Score | 1.0000 | 1.0000 | + 0.0000 |
| Avg. Reward | 0.9367 | 0.9380 | + 0.0013 |
Note: The model achieved a perfect F1 score (1.0000) prior to RL fine-tuning. The GRPO phase successfully increased the average reward generation behavior without degrading classification accuracy.
π Training Procedure & Hyperparameters
The model was trained for 1 epoch over 200 steps using custom reward functions. Gradients were smartly offloaded to optimize VRAM.
- Training Examples: 2,482
- Total Steps: 200
- Batch Size per Device: 2
- Gradient Accumulation Steps: 4
- Generations per Prompt: 2
- Total Batch Size: 8
- Total Training Runtime: ~30.8 minutes (30:52)
- Final Training Loss:
-0.0000
Generation Configuration
The generation_config was specifically modified to support the GRPO rollout phase:
- Max Length: 32,768
- Top K: 20
- Top P: 0.8
- Repetition Penalty: 1.1
- BOS Token ID: 151643
- EOS Token IDs: [151645, 151643]
Training Progression (Sampled Steps)
| Step | Training Loss | Reward | Reward Std Dev | Completion Length |
|---|---|---|---|---|
| 5 | -0.0000 | 0.1341 | 0.0177 | 34.85 |
| 50 | -0.0000 | 0.1675 | 0.0117 | 34.70 |
| 100 | 0.0000 | -0.0587 | 0.0176 | 32.75 |
| 150 | -0.0000 | -0.1545 | 0.0355 | 35.65 |
| 185 | -0.0000 | 0.4397 (Peak) | 0.0057 | 32.25 |
| 200 | 0.0000 | 0.0461 | 0.0127 | 30.85 |
π» Framework Versions
- PEFT 0.14.0
- Transformers
- Unsloth
- TRL
- Safetensors
- PyTorch
π Usage
This model uses the standard transformers library pipeline. Ensure that your inference scripts respect the custom generation configuration parameters used during training for optimal results.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "coliseum034/coliseum-defender-grpo-live"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
prompt = "Evaluate the safety and structural integrity of the following system request:"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
**inputs,
max_length=32768,
top_k=20,
top_p=0.8,
repetition_penalty=1.1
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
- Downloads last month
- 44