Instructions to use PhaseOfCode/sevzero-llama3-8b-grpo-primary with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use PhaseOfCode/sevzero-llama3-8b-grpo-primary with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("unsloth/Meta-Llama-3.1-8B-Instruct") model = PeftModel.from_pretrained(base_model, "PhaseOfCode/sevzero-llama3-8b-grpo-primary") - Notebooks
- Google Colab
- Kaggle
File size: 1,560 Bytes
1cec946 c4f0c1b 1cec946 c4f0c1b 1cec946 c4f0c1b 1cec946 c4f0c1b 1cec946 c4f0c1b 1cec946 c4f0c1b 1cec946 c4f0c1b 1cec946 c4f0c1b 1cec946 c4f0c1b 1cec946 c4f0c1b 1cec946 c4f0c1b 1cec946 c4f0c1b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | ---
base_model: unsloth/Meta-Llama-3.1-8B-Instruct
library_name: peft
license: llama3.1
tags:
- sevzero
- openenv
- grpo
- lora
- sre
---
# SevZero GRPO-primary adapter
LoRA adapter produced by the primary GRPO run for the SevZero OpenEnv India Hackathon 2026 submission.
## Training recipe
- Initialization: `PhaseOfCode/sevzero-llama3-8b-sft-primary`
- Base model: `unsloth/Meta-Llama-3.1-8B-Instruct`
- RL method: GRPO through TRL against the live SevZero FastAPI/OpenEnv surface
- Steps: 120
- Learning rate: `7e-6`
- Group size: 4 generations
- Temperature: 0.85
- Beta: 0.04
- Scheduler: cosine
- vLLM: colocate mode, GPU memory utilization 0.55
The training loop produced nonzero reward variance, gradients, and KL movement. The held-out eval did not show score lift.
## Eval summary
Held-out seeds: `13`, `99`, `777`. Tasks: Easy, Medium, Hard.
| Model | Easy | Medium | Hard | Mean |
|---|---:|---:|---:|---:|
| Untrained Llama-3.1-8B-Instruct | 0.8199 | 0.9419 | 0.6369 | 0.7996 |
| GRPO-primary | 0.8199 | 0.9419 | 0.6369 | 0.7996 |
The honest conclusion: 120 GRPO steps were not enough to change deterministic held-out outcomes. SevZero's contribution is the environment, training harness, and reproducible failure surface.
## Links
- Final mirrored adapter: https://huggingface.co/Mist-ic/sevzero-llama3-8b-grpo
- Environment Space: https://huggingface.co/spaces/Mist-ic/sevzero-env
- Blog: https://huggingface.co/spaces/Mist-ic/sevzero-env/blob/main/BLOG.md
- Eval dataset: https://huggingface.co/datasets/Mist-ic/sevzero-eval-results
|