Self-Rewarding Language Models
Paper
•
2401.10020
•
Published
•
152
This model is trained using Iterative DPO (Self-Rewarding Language Models approach), where the model acts as its own judge to continuously improve over multiple iterations.
The model progressively refines its own judgment criteria through:
Compared to base DPO model:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-1B-Instruct",
device_map="auto"
)
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "Zickl/llama32-1b-iterative-dpo")
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
# Generate
messages = [{"role": "user", "content": "Your question here"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
@misc{llama32-iterative-dpo,
author = {Zickl},
title = {Llama-3.2-1B Iterative DPO (Self-Rewarding)},
year = {2024},
publisher = {HuggingFace},
url = {https://huggingface.co/Zickl/llama32-1b-iterative-dpo}
}
Base model
meta-llama/Llama-3.2-1B-Instruct