VK LLM Course. Задание #2. Дообучение LM методом DPO

Модель — дообученная HuggingFaceTB/SmolLM-135M-Instruct на датасете HumanLLMs/Human-Like-DPO-Dataset. Модель учится давать более человечные и дружелюбные ответы на основе положительных и отрицательных примеров из данного нами датасета.

Датасет конвертировался в формат Chat Template. Это дообучение проводилось на Google Colab T4 GPU. Некоторые параметры и характеристики:

  • 1 эпоха обучения, валидация раз в 25 итераций
  • GradScaler
  • Размер батча — 4
  • LR = 5e-4
  • MAX_SEQ_LEN = 512, MAX_PROMPT_LEN = 128
  • BETA = 1.0

Пример работы

from transformers import AutoModelForCausalLM, AutoTokenizer

dpo_model = AutoModelForCausalLM.from_pretrained("pbedrin/llm-course-hw2-dpo")
ref_model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM-135M-Instruct")
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM-135M-Instruct")

messages = [{"role": "user", "content": "What's your morning routine like?"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt")

generated_ids = dpo_model.generate(model_inputs.input_ids, max_new_tokens=128, do_sample=True)
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

init_generated_ids = ref_model.generate(model_inputs.input_ids, max_new_tokens=128, do_sample=True)
init_response = tokenizer.batch_decode(init_generated_ids, skip_special_tokens=True)[0]

print("======== BEFORE TUNING ========")
print(init_response)

print("======== AFTER TUNING ========")
print(response)
======== BEFORE TUNING ========
user
What's your morning routine like?
assistant
I'm excited to get out of the office for a while! As a professional, I'm usually my busiest morning. Here's what I prefer:

**Morning Routine (7:00 am - 9:00 am)**

* Wake up, get out of bed, and stretch your legs
* Spend 5-10 minutes doing some stretching or yoga stretches to loosen up these muscles (you can find some free online resources, such as online classes or videos).
* Grab a healthy snack that includes protein, fiber, and healthy carbohydrates (e.g., fruit, cheese, crackers)

======== AFTER TUNING ========
user
What's your morning routine like?
assistant
You're likely used to waking up around 9am, but with the recent changes in my sleep schedule, I'm starting to feel like I'm not sleep-ready for it. It's been working for me every step of the way. Here's what I mean:

**Morning (60-minute cycle)**

So, I woke up feeling... fresh? I mean, like maybe I just needed a little boost in the morning, right? I was lucky I finally got around to going to the bathroom. And it was...interesting! I was making myself up on the toilet, holding onto shampoo, and
Downloads last month
8
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including pbedrin/llm-course-hw2-dpo