PhaseOfCode
/

sevzero-llama3-8b-grpo-primary

Model card Files Files and versions

sevzero-llama3-8b-grpo-primary / README.md

PhaseOfCode's picture

Replace template card with SevZero details

c4f0c1b verified about 1 month ago

|

history blame contribute delete

1.56 kB

	---
	base_model: unsloth/Meta-Llama-3.1-8B-Instruct
	library_name: peft
	license: llama3.1
	tags:
	- sevzero
	- openenv
	- grpo
	- lora
	- sre
	---

	# SevZero GRPO-primary adapter

	LoRA adapter produced by the primary GRPO run for the SevZero OpenEnv India Hackathon 2026 submission.

	## Training recipe

	- Initialization: `PhaseOfCode/sevzero-llama3-8b-sft-primary`
	- Base model: `unsloth/Meta-Llama-3.1-8B-Instruct`
	- RL method: GRPO through TRL against the live SevZero FastAPI/OpenEnv surface
	- Steps: 120
	- Learning rate: `7e-6`
	- Group size: 4 generations
	- Temperature: 0.85
	- Beta: 0.04
	- Scheduler: cosine
	- vLLM: colocate mode, GPU memory utilization 0.55

	The training loop produced nonzero reward variance, gradients, and KL movement. The held-out eval did not show score lift.

	## Eval summary

	Held-out seeds: `13`, `99`, `777`. Tasks: Easy, Medium, Hard.

	\| Model \| Easy \| Medium \| Hard \| Mean \|
	\|---\|---:\|---:\|---:\|---:\|
	\| Untrained Llama-3.1-8B-Instruct \| 0.8199 \| 0.9419 \| 0.6369 \| 0.7996 \|
	\| GRPO-primary \| 0.8199 \| 0.9419 \| 0.6369 \| 0.7996 \|

	The honest conclusion: 120 GRPO steps were not enough to change deterministic held-out outcomes. SevZero's contribution is the environment, training harness, and reproducible failure surface.

	## Links

	- Final mirrored adapter: https://huggingface.co/Mist-ic/sevzero-llama3-8b-grpo
	- Environment Space: https://huggingface.co/spaces/Mist-ic/sevzero-env
	- Blog: https://huggingface.co/spaces/Mist-ic/sevzero-env/blob/main/BLOG.md
	- Eval dataset: https://huggingface.co/datasets/Mist-ic/sevzero-eval-results