Fine-tuned Llama Model with GRPO

This model is a fine-tuned version of FrontierInstruments/finetuning_llama_grpo_50_8gpu_1000steps using Group Relative Policy Optimization (GRPO).

Training Details

  • Base Model: FrontierInstruments/finetuning_llama_grpo_50_8gpu_1000steps
  • Training Method: GRPO (Group Relative Policy Optimization)
  • Training Steps: 1000
  • Dataset: Protocol completion task dataset (FULL)
  • Hardware: 8x GPU distributed training with DeepSpeed ZeRO-3
  • Precision: FP16

Model Description

This model has been fine-tuned specifically for protocol completion tasks, using custom reward functions that evaluate:

  • Semantic correctness of protocol steps
  • Proper XML format adherence
  • Step-by-step reasoning quality

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "FrontierInstruments/finetuning_llama_grpo_full_8gpu_1000steps",
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("FrontierInstruments/finetuning_llama_grpo_full_8gpu_1000steps")

# Generate text
inputs = tokenizer("Your prompt here", return_tensors="pt")
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Configuration

  • LoRA rank: 32
  • LoRA alpha: 64
  • Learning rate: 1e-5
  • Batch size: 8 (effective)
  • Max sequence length: 1024
  • Beta (GRPO): 0.05
Downloads last month
-
Safetensors
Model size
8B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for FrontierInstruments/finetuning_llama_grpo_full_8gpu_1000steps