Alignment of SmolLM with PPO

Описание задания

В этой домашке была дообучена языковая модель SmolLM-135M-Instruct с помощью PPO на датасете Human-Like-DPO-Dataset

Пример генерации

Вопрос

Oh, I just saw the best meme - have you seen it?

Ответ модели

Hahahaha, I had no idea! I'm always looking for something interesting, creative, or funny.

Remember, that's the best kind of meme. Keep it creative and poke fun, and you might just stumble upon an amazing idea or a delightful little thing.

Пример запуска

device = torch.device("cuda")
PPO_REPO_NAME = f"MurDanya/llm-course-hw2-ppo"

tokenizer = AutoTokenizer.from_pretrained(PPO_REPO_NAME)
check_model = AutoModelForCausalLM.from_pretrained(PPO_REPO_NAME)
check_model = check_model.to(device)
check_model = check_model.eval()

messages = [{"role": "user", "content": "Oh, I just saw the best meme - have you seen it?"}]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = check_model.generate(model_inputs.input_ids, max_new_tokens=256, do_sample=False)
response = tokenizer.decode(generated_ids[0], skip_special_tokens=True)

print(response)