--- base_model: HuggingFaceTB/SmolLM-135M-Instruct datasets: HumanLLMs/Human-Like-DPO-Dataset library_name: transformers model_name: trainer_output tags: - generated_from_trainer - trl - reward-trainer licence: license --- Reward model for NLP homework at HSE This model is a fine-tuned version of [HuggingFaceTB/SmolLM-135M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM-135M-Instruct) on the [HumanLLMs/Human-Like-DPO-Dataset](https://huggingface.co/datasets/HumanLLMs/Human-Like-DPO-Dataset) dataset. It has been trained using [TRL](https://github.com/huggingface/trl). ## Training procedure [Visualize in Weights & Biases](https://wandb.ai/spankevichteam/hw2-rlhf/runs/4b5k1ftd) ![image/png](https://cdn-uploads.huggingface.co/production/uploads/67b0bd703230f308b6a233c4/Sdj-UAXEMW5jl3gAPREED.png) This model was trained with Reward. ### Framework versions - TRL: 0.15.2 - Transformers: 4.49.0 - Pytorch: 2.5.1+cu121 - Datasets: 3.3.2 - Tokenizers: 0.21.0 ## Citations Cite TRL as: ```bibtex @misc{vonwerra2022trl, title = {{TRL: Transformer Reinforcement Learning}}, author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec}, year = 2020, journal = {GitHub repository}, publisher = {GitHub}, howpublished = {\url{https://github.com/huggingface/trl}} } ```