| | --- |
| | base_model: rokugatsu/LLM2025_Advanced_5 |
| | datasets: |
| | - u-10bei/sft_alfworld_trajectory_dataset_v2 |
| | language: |
| | - en |
| | license: apache-2.0 |
| | library_name: trl |
| | pipeline_tag: text-generation |
| | tags: |
| | - dpo |
| | - agent |
| | - tool-use |
| | - alfworld |
| | --- |
| | |
| | # LLM2025_Advanced_DPO_5 |
| | |
| | This repository provides a **DPO-fine-tuned model** based on |
| | **rokugatsu/LLM2025_Advanced_5** using `trl.DPOTrainer`. |
| | |
| | This model has undergone Direct Preference Optimization (DPO) to align with human preferences, |
| | using trajectories from agent-based tasks. |
| | |
| | ## Training Objective |
| | |
| | This model was fine-tuned using DPO to improve multi-turn agent task performance |
| | by learning preferences from the `u-10bei/sft_alfworld_trajectory_dataset_v2` dataset. |
| | The DPO training process aims to increase the likelihood of generating 'chosen' responses |
| | and decrease the likelihood of 'rejected' responses for given prompts. |
| | |
| | ## Training Configuration (DPO) |
| | |
| | - Base SFT Model: rokugatsu/LLM2025_Advanced_5 |
| | - DPO Dataset: u-10bei/sft_alfworld_trajectory_dataset_v2 |
| | - DPO Method: Direct Preference Optimization (DPO) |
| | - Max sequence length: 2048 |
| | - Epochs: 0.25 |
| | - Learning rate: 2e-06 |
| | - Beta parameter (DPO loss): 0.1 |
| | |
| | ## Usage |
| | |
| | ```python |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | from peft import PeftModel |
| | import torch |
| | |
| | model_id = "rokugatsu/LLM2025_Advanced_DPO_5" |
| | |
| | tokenizer = AutoTokenizer.from_pretrained(model_id) |
| | model = AutoModelForCausalLM.from_pretrained( |
| | model_id, |
| | torch_dtype=torch.bfloat16, # Use bfloat16 if your GPU supports it |
| | device_map="auto", |
| | ) |
| | # The model is already merged, so no need for PeftModel.from_pretrained(model, adapter) |
| | |
| | # Example for inference (assuming you have a chat_template) |
| | messages = [ |
| | {"role": "system", "content": "You are a helpful assistant."}, |
| | {"role": "user", "content": "What is the capital of France?"} |
| | ] |
| | input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device) |
| | |
| | outputs = model.generate(input_ids, max_new_tokens=256) |
| | print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| | ``` |
| | |
| | ## Sources & Terms (IMPORTANT) |
| | |
| | Training data: u-10bei/sft_alfworld_trajectory_dataset_v2 |
| | |
| | Dataset License: MIT License. This dataset is used and distributed under the terms of the MIT License. |
| | Compliance: Users must comply with the MIT license (including copyright notice) and the base model's original terms of use. |
| | |