PhaseOfCode's picture
Replace template card with SevZero details
c4f0c1b verified
metadata
base_model: unsloth/Meta-Llama-3.1-8B-Instruct
library_name: peft
license: llama3.1
tags:
  - sevzero
  - openenv
  - grpo
  - lora
  - sre

SevZero GRPO-primary adapter

LoRA adapter produced by the primary GRPO run for the SevZero OpenEnv India Hackathon 2026 submission.

Training recipe

  • Initialization: PhaseOfCode/sevzero-llama3-8b-sft-primary
  • Base model: unsloth/Meta-Llama-3.1-8B-Instruct
  • RL method: GRPO through TRL against the live SevZero FastAPI/OpenEnv surface
  • Steps: 120
  • Learning rate: 7e-6
  • Group size: 4 generations
  • Temperature: 0.85
  • Beta: 0.04
  • Scheduler: cosine
  • vLLM: colocate mode, GPU memory utilization 0.55

The training loop produced nonzero reward variance, gradients, and KL movement. The held-out eval did not show score lift.

Eval summary

Held-out seeds: 13, 99, 777. Tasks: Easy, Medium, Hard.

Model Easy Medium Hard Mean
Untrained Llama-3.1-8B-Instruct 0.8199 0.9419 0.6369 0.7996
GRPO-primary 0.8199 0.9419 0.6369 0.7996

The honest conclusion: 120 GRPO steps were not enough to change deterministic held-out outcomes. SevZero's contribution is the environment, training harness, and reproducible failure surface.

Links