Fine-tuned Llama Model with GRPO
This model is a fine-tuned version of FrontierInstruments/finetuning_llama_grpo_50_8gpu_1000steps using Group Relative Policy Optimization (GRPO).
Training Details
- Base Model: FrontierInstruments/finetuning_llama_grpo_50_8gpu_1000steps
- Training Method: GRPO (Group Relative Policy Optimization)
- Training Steps: 1000
- Dataset: Protocol completion task dataset (FULL)
- Hardware: 8x GPU distributed training with DeepSpeed ZeRO-3
- Precision: FP16
Model Description
This model has been fine-tuned specifically for protocol completion tasks, using custom reward functions that evaluate:
- Semantic correctness of protocol steps
- Proper XML format adherence
- Step-by-step reasoning quality
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"FrontierInstruments/finetuning_llama_grpo_full_8gpu_1000steps",
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("FrontierInstruments/finetuning_llama_grpo_full_8gpu_1000steps")
# Generate text
inputs = tokenizer("Your prompt here", return_tensors="pt")
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training Configuration
- LoRA rank: 32
- LoRA alpha: 64
- Learning rate: 1e-5
- Batch size: 8 (effective)
- Max sequence length: 1024
- Beta (GRPO): 0.05
- Downloads last month
- -