coliseum034/coliseum-defender-grpo-live

This model has been fine-tuned using Generative Reward Policy Optimization (GRPO) to align its decision-making capabilities. It was trained utilizing Unsloth for accelerated training and memory optimization.

This model acts as a highly optimized "defender" node designed to reliably evaluate, filter, and secure interactions within multi-agent vulnerability scanners and complex security architectures.

βš™οΈ Model Details

  • License: Apache 2.0
  • Architecture: ~1.5B Parameters (Trainable parameters: 18,464,768 / 1.18% trained via PEFT)
  • Language: English
  • Training Type: Generative Reward Policy Optimization (GRPO)

πŸ“Š Evaluation Metrics: Pre vs. Post GRPO

The model was evaluated before and after the GRPO training phase on a 50-sample held-out evaluation set. Because the base supervised model had already reached a performance ceiling on classification accuracy for this dataset, the GRPO phase focused on aligning and optimizing the generation reward.

Metric Pre-GRPO Post-GRPO Delta
Accuracy 1.0000 1.0000 + 0.0000
Precision 1.0000 1.0000 + 0.0000
Recall 1.0000 1.0000 + 0.0000
F1 Score 1.0000 1.0000 + 0.0000
Avg. Reward 0.9367 0.9380 + 0.0013

Note: The model achieved a perfect F1 score (1.0000) prior to RL fine-tuning. The GRPO phase successfully increased the average reward generation behavior without degrading classification accuracy.

πŸ“ˆ Training Procedure & Hyperparameters

The model was trained for 1 epoch over 200 steps using custom reward functions. Gradients were smartly offloaded to optimize VRAM.

  • Training Examples: 2,482
  • Total Steps: 200
  • Batch Size per Device: 2
  • Gradient Accumulation Steps: 4
  • Generations per Prompt: 2
  • Total Batch Size: 8
  • Total Training Runtime: ~30.8 minutes (30:52)
  • Final Training Loss: -0.0000

Generation Configuration

The generation_config was specifically modified to support the GRPO rollout phase:

  • Max Length: 32,768
  • Top K: 20
  • Top P: 0.8
  • Repetition Penalty: 1.1
  • BOS Token ID: 151643
  • EOS Token IDs: [151645, 151643]

Training Progression (Sampled Steps)

Step Training Loss Reward Reward Std Dev Completion Length
5 -0.0000 0.1341 0.0177 34.85
50 -0.0000 0.1675 0.0117 34.70
100 0.0000 -0.0587 0.0176 32.75
150 -0.0000 -0.1545 0.0355 35.65
185 -0.0000 0.4397 (Peak) 0.0057 32.25
200 0.0000 0.0461 0.0127 30.85

πŸ’» Framework Versions

  • PEFT 0.14.0
  • Transformers
  • Unsloth
  • TRL
  • Safetensors
  • PyTorch

πŸš€ Usage

This model uses the standard transformers library pipeline. Ensure that your inference scripts respect the custom generation configuration parameters used during training for optimal results.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "coliseum034/coliseum-defender-grpo-live"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

prompt = "Evaluate the safety and structural integrity of the following system request:"
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(
    **inputs, 
    max_length=32768,
    top_k=20,
    top_p=0.8,
    repetition_penalty=1.1
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Downloads last month
44
Video Preview
loading

Space using coliseum034/coliseum-defender-grpo-live 1