Fine-tuned Llama Model with GRPO

This model is a fine-tuned version of FrontierInstruments/finetuning_llama_grpo_50_8gpu_1000steps using Group Relative Policy Optimization (GRPO).

Training Details

Base Model: FrontierInstruments/finetuning_llama_grpo_50_8gpu_1000steps
Training Method: GRPO (Group Relative Policy Optimization)
Training Steps: 1000
Dataset: Protocol completion task dataset (FULL)
Hardware: 8x GPU distributed training with DeepSpeed ZeRO-3
Precision: FP16

Model Description

This model has been fine-tuned specifically for protocol completion tasks, using custom reward functions that evaluate:

Semantic correctness of protocol steps
Proper XML format adherence
Step-by-step reasoning quality

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "FrontierInstruments/finetuning_llama_grpo_full_8gpu_1000steps",
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("FrontierInstruments/finetuning_llama_grpo_full_8gpu_1000steps")

# Generate text
inputs = tokenizer("Your prompt here", return_tensors="pt")
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Configuration

LoRA rank: 32
LoRA alpha: 64
Learning rate: 1e-5
Batch size: 8 (effective)
Max sequence length: 1024
Beta (GRPO): 0.05

Downloads last month: -

Safetensors

Model size

8B params

Tensor type

F16

Model tree for FrontierInstruments/finetuning_llama_grpo_full_8gpu_1000steps

Base model

FrontierInstruments/finetuning_llama_grpo_50_8gpu_1000steps

Finetuned

(1)

this model