LLM2025_Advanced_DPO_5

This repository provides a DPO-fine-tuned model based on rokugatsu/LLM2025_Advanced_5 using trl.DPOTrainer.

This model has undergone Direct Preference Optimization (DPO) to align with human preferences, using trajectories from agent-based tasks.

Training Objective

This model was fine-tuned using DPO to improve multi-turn agent task performance by learning preferences from the u-10bei/sft_alfworld_trajectory_dataset_v2 dataset. The DPO training process aims to increase the likelihood of generating 'chosen' responses and decrease the likelihood of 'rejected' responses for given prompts.

Training Configuration (DPO)

Base SFT Model: rokugatsu/LLM2025_Advanced_5
DPO Dataset: u-10bei/sft_alfworld_trajectory_dataset_v2
DPO Method: Direct Preference Optimization (DPO)
Max sequence length: 2048
Epochs: 0.25
Learning rate: 2e-06
Beta parameter (DPO loss): 0.1

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

model_id = "rokugatsu/LLM2025_Advanced_DPO_5"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16, # Use bfloat16 if your GPU supports it
    device_map="auto",
)
# The model is already merged, so no need for PeftModel.from_pretrained(model, adapter)

# Example for inference (assuming you have a chat_template)
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"}
]
input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)

outputs = model.generate(input_ids, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Sources & Terms (IMPORTANT)

Training data: u-10bei/sft_alfworld_trajectory_dataset_v2

Dataset License: MIT License. This dataset is used and distributed under the terms of the MIT License. Compliance: Users must comply with the MIT license (including copyright notice) and the base model's original terms of use.

Downloads last month: 1

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for rokugatsu/LLM2025_Advanced_DPO_5

Base model

Qwen/Qwen3-4B-Instruct-2507

Adapter

rokugatsu/LLM2025_Advanced_5

Finetuned

(1)

this model

rokugatsu
/

LLM2025_Advanced_DPO_5