dh_ministal_gpro

LoRA adapter for mistralai/Ministral-3-8B-Instruct-2512-BF16, fine-tuned with GRPO to play Duck Hunt.

Training

  • Method: Group Relative Policy Optimization (GRPO)
  • LoRA rank: 16, alpha: 32
  • Target modules: q_proj, k_proj, v_proj, o_proj
  • Learning rate: 5e-06
  • Generations per prompt: 4
  • Best eval hit rate: -100.0%

Usage

The adapter produces Mistral native tool calls ([TOOL_CALLS]), compatible with OpenAI SDK via vLLM or TGI.

from peft import AutoPeftModelForCausalLM
from transformers import AutoProcessor

model = AutoPeftModelForCausalLM.from_pretrained("dmayboroda/dh_ministal_gpro")
processor = AutoProcessor.from_pretrained("dmayboroda/dh_ministal_gpro")
Downloads last month
41
Video Preview
loading

Model tree for dmayboroda/dh_ministal_gpro