coliseum034/coliseum-defender-grpo-live

This model has been fine-tuned using Generative Reward Policy Optimization (GRPO) to align its decision-making capabilities. It was trained utilizing Unsloth for accelerated training and memory optimization.

This model acts as a highly optimized "defender" node designed to reliably evaluate, filter, and secure interactions within multi-agent vulnerability scanners and complex security architectures.

⚙️ Model Details

License: Apache 2.0
Architecture: ~1.5B Parameters (Trainable parameters: 18,464,768 / 1.18% trained via PEFT)
Language: English
Training Type: Generative Reward Policy Optimization (GRPO)

📊 Evaluation Metrics: Pre vs. Post GRPO

The model was evaluated before and after the GRPO training phase on a 50-sample held-out evaluation set. Because the base supervised model had already reached a performance ceiling on classification accuracy for this dataset, the GRPO phase focused on aligning and optimizing the generation reward.

Metric	Pre-GRPO	Post-GRPO	Delta
Accuracy	1.0000	1.0000	+ 0.0000
Precision	1.0000	1.0000	+ 0.0000
Recall	1.0000	1.0000	+ 0.0000
F1 Score	1.0000	1.0000	+ 0.0000
Avg. Reward	0.9367	0.9380	+ 0.0013

Note: The model achieved a perfect F1 score (1.0000) prior to RL fine-tuning. The GRPO phase successfully increased the average reward generation behavior without degrading classification accuracy.

📈 Training Procedure & Hyperparameters

The model was trained for 1 epoch over 200 steps using custom reward functions. Gradients were smartly offloaded to optimize VRAM.

Training Examples: 2,482
Total Steps: 200
Batch Size per Device: 2
Gradient Accumulation Steps: 4
Generations per Prompt: 2
Total Batch Size: 8
Total Training Runtime: ~30.8 minutes (30:52)
Final Training Loss: -0.0000

Generation Configuration

The generation_config was specifically modified to support the GRPO rollout phase:

Max Length: 32,768
Top K: 20
Top P: 0.8
Repetition Penalty: 1.1
BOS Token ID: 151643
EOS Token IDs: [151645, 151643]

Training Progression (Sampled Steps)

Step	Training Loss	Reward	Reward Std Dev	Completion Length
5	-0.0000	0.1341	0.0177	34.85
50	-0.0000	0.1675	0.0117	34.70
100	0.0000	-0.0587	0.0176	32.75
150	-0.0000	-0.1545	0.0355	35.65
185	-0.0000	0.4397 (Peak)	0.0057	32.25
200	0.0000	0.0461	0.0127	30.85

💻 Framework Versions

PEFT 0.14.0
Transformers
Unsloth
TRL
Safetensors
PyTorch

🚀 Usage

This model uses the standard transformers library pipeline. Ensure that your inference scripts respect the custom generation configuration parameters used during training for optimal results.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "coliseum034/coliseum-defender-grpo-live"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

prompt = "Evaluate the safety and structural integrity of the following system request:"
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(
    **inputs, 
    max_length=32768,
    top_k=20,
    top_p=0.8,
    repetition_penalty=1.1
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Downloads last month: 44

Video Preview

Reinforcement Learning

coliseum034
/

coliseum-defender-grpo-live