Replace template card with SevZero details

c4f0c1b verified about 1 month ago

1.56 kB

base_model: unsloth/Meta-Llama-3.1-8B-Instruct
library_name: peft
license: llama3.1
tags:
  - sevzero
  - openenv
  - grpo
  - lora
  - sre

SevZero GRPO-primary adapter

LoRA adapter produced by the primary GRPO run for the SevZero OpenEnv India Hackathon 2026 submission.

Training recipe

Initialization: PhaseOfCode/sevzero-llama3-8b-sft-primary
Base model: unsloth/Meta-Llama-3.1-8B-Instruct
RL method: GRPO through TRL against the live SevZero FastAPI/OpenEnv surface
Steps: 120
Learning rate: 7e-6
Group size: 4 generations
Temperature: 0.85
Beta: 0.04
Scheduler: cosine
vLLM: colocate mode, GPU memory utilization 0.55

The training loop produced nonzero reward variance, gradients, and KL movement. The held-out eval did not show score lift.

Eval summary

Held-out seeds: 13, 99, 777. Tasks: Easy, Medium, Hard.

Model	Easy	Medium	Hard	Mean
Untrained Llama-3.1-8B-Instruct	0.8199	0.9419	0.6369	0.7996
GRPO-primary	0.8199	0.9419	0.6369	0.7996

The honest conclusion: 120 GRPO steps were not enough to change deterministic held-out outcomes. SevZero's contribution is the environment, training harness, and reproducible failure surface.

PhaseOfCode
/

sevzero-llama3-8b-grpo-primary

SevZero GRPO-primary adapter

Training recipe

Eval summary

Links