--- base_model: unsloth/Meta-Llama-3.1-8B-Instruct library_name: peft license: llama3.1 tags: - sevzero - openenv - grpo - lora - sre --- # SevZero GRPO-primary adapter LoRA adapter produced by the primary GRPO run for the SevZero OpenEnv India Hackathon 2026 submission. ## Training recipe - Initialization: `PhaseOfCode/sevzero-llama3-8b-sft-primary` - Base model: `unsloth/Meta-Llama-3.1-8B-Instruct` - RL method: GRPO through TRL against the live SevZero FastAPI/OpenEnv surface - Steps: 120 - Learning rate: `7e-6` - Group size: 4 generations - Temperature: 0.85 - Beta: 0.04 - Scheduler: cosine - vLLM: colocate mode, GPU memory utilization 0.55 The training loop produced nonzero reward variance, gradients, and KL movement. The held-out eval did not show score lift. ## Eval summary Held-out seeds: `13`, `99`, `777`. Tasks: Easy, Medium, Hard. | Model | Easy | Medium | Hard | Mean | |---|---:|---:|---:|---:| | Untrained Llama-3.1-8B-Instruct | 0.8199 | 0.9419 | 0.6369 | 0.7996 | | GRPO-primary | 0.8199 | 0.9419 | 0.6369 | 0.7996 | The honest conclusion: 120 GRPO steps were not enough to change deterministic held-out outcomes. SevZero's contribution is the environment, training harness, and reproducible failure surface. ## Links - Final mirrored adapter: https://huggingface.co/Mist-ic/sevzero-llama3-8b-grpo - Environment Space: https://huggingface.co/spaces/Mist-ic/sevzero-env - Blog: https://huggingface.co/spaces/Mist-ic/sevzero-env/blob/main/BLOG.md - Eval dataset: https://huggingface.co/datasets/Mist-ic/sevzero-eval-results