LLM2025_Advanced_DPO_5 / README.md

rokugatsu

Upload DPO-trained Qwen3-4B-Instruct-2507 model

dce3149 verified about 15 hours ago

preview code

raw

history blame contribute delete

2.41 kB

metadata

base_model: rokugatsu/LLM2025_Advanced_5
datasets:
  - u-10bei/sft_alfworld_trajectory_dataset_v2
language:
  - en
license: apache-2.0
library_name: trl
pipeline_tag: text-generation
tags:
  - dpo
  - agent
  - tool-use
  - alfworld

LLM2025_Advanced_DPO_5

This repository provides a DPO-fine-tuned model based on rokugatsu/LLM2025_Advanced_5 using trl.DPOTrainer.

This model has undergone Direct Preference Optimization (DPO) to align with human preferences, using trajectories from agent-based tasks.

Training Objective

This model was fine-tuned using DPO to improve multi-turn agent task performance by learning preferences from the u-10bei/sft_alfworld_trajectory_dataset_v2 dataset. The DPO training process aims to increase the likelihood of generating 'chosen' responses and decrease the likelihood of 'rejected' responses for given prompts.

Training Configuration (DPO)

Base SFT Model: rokugatsu/LLM2025_Advanced_5
DPO Dataset: u-10bei/sft_alfworld_trajectory_dataset_v2
DPO Method: Direct Preference Optimization (DPO)
Max sequence length: 2048
Epochs: 0.25
Learning rate: 2e-06
Beta parameter (DPO loss): 0.1

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

model_id = "rokugatsu/LLM2025_Advanced_DPO_5"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16, # Use bfloat16 if your GPU supports it
    device_map="auto",
)
# The model is already merged, so no need for PeftModel.from_pretrained(model, adapter)

# Example for inference (assuming you have a chat_template)
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"}
]
input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)

outputs = model.generate(input_ids, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Sources & Terms (IMPORTANT)

Training data: u-10bei/sft_alfworld_trajectory_dataset_v2

Dataset License: MIT License. This dataset is used and distributed under the terms of the MIT License. Compliance: Users must comply with the MIT license (including copyright notice) and the base model's original terms of use.