Llama-3.2-1B Iterative DPO (Self-Rewarding)

This model is trained using Iterative DPO (Self-Rewarding Language Models approach), where the model acts as its own judge to continuously improve over multiple iterations.

Training Details

Base Model: meta-llama/Llama-3.2-1B-Instruct
Training Method: Iterative DPO with Self-Rewarding
Number of Iterations: 2
Initial Dataset: 15 LLM Judge preference pairs
Iteration 1 Dataset: 25 total pairs (15 initial + 10 self-judged)
Iteration 2 Dataset: 33 total pairs (25 + 8 self-judged)
LoRA Configuration: r=16, alpha=32
Learning Rate: 3e-5 (iterations)

Iterative Training Process

Iteration 0: Train on LLM Judge preferences (baseline DPO)
Iteration 1:
- Model generates new responses
- Model judges its own responses (self-rewarding)
- Train on accumulated preferences
Iteration 2:
- Repeat self-judging with improved model
- Train on all accumulated preferences

Self-Rewarding Approach

The model progressively refines its own judgment criteria through:

Self-evaluation of generated responses
Accumulation of diverse preference data
Compound learning from multiple iterations

Performance

Compared to base DPO model:

Improved response quality through self-refinement
Better alignment with implicit quality standards
Enhanced consistency in output quality

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B-Instruct",
    device_map="auto"
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "Zickl/llama32-1b-iterative-dpo")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")

# Generate
messages = [{"role": "user", "content": "Your question here"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Limitations

Self-judgment may amplify certain biases
Limited to model's inherent capability ceiling
Requires careful monitoring for mode collapse
May over-optimize for specific patterns

Related Models

Base DPO Model: Zickl/llama32-1b-dpo-llm-judge
Preference Datasets: Zickl/dpo-preference-datasets

Citation

@misc{llama32-iterative-dpo,
  author = {Zickl},
  title = {Llama-3.2-1B Iterative DPO (Self-Rewarding)},
  year = {2024},
  publisher = {HuggingFace},
  url = {https://huggingface.co/Zickl/llama32-1b-iterative-dpo}
}

References

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Zickl/llama32-1b-iterative-dpo

Base model

meta-llama/Llama-3.2-1B-Instruct

Adapter

(649)

this model

Papers for Zickl/llama32-1b-iterative-dpo

Self-Rewarding Language Models

Paper • 2401.10020 • Published Jan 18, 2024 • 156

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Paper • 2305.18290 • Published May 29, 2023 • 66