Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Paper • 2305.18290 • Published • 64
This model is a DPO (Direct Preference Optimization) fine-tuned version of marioparreno/emojify-sft for emojify conversion. It has been optimized to prefer high-quality, semantically accurate emojifications.
This model further refines an SFT model by training on preference pairs. For each prompt, the model was shown a "chosen" (preferred) response and a "rejected" response, learning to align its outputs with human (or superior LLM) preferences for emojify conversion.
This model was trained on the marioparreno/emojify-dpo DPO dataset.
from unsloth import FastModel
# Load the fine-tuned model
model, tokenizer = FastModel.from_pretrained(
model_name="marioparreno/emojify-dpo",
max_seq_length=256,
load_in_4bit=True,
)
# Inference
inputs = tokenizer.apply_chat_template(
[
{"role": "system", "content": "Translate this text to emoji:"},
{"role": "user", "content": "I love coding with AI!"},
],
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
).to("cuda")
outputs = model.generate(input_ids=inputs, max_new_tokens=64)
response = tokenizer.batch_decode(outputs)
Base model
google/gemma-3-270m