--- license: apache-2.0 tags: - dpo - unsloth - trl - qwen - instruction-tuning - preference-modeling - mnlp datasets: - Tandogan/sft_dataset_final_train - Tandogan/MNLP_M2_dpo_dataset base_model: Qwen/Qwen3-0.6B-Base inference: false --- # MNLP M2 DPO Model — Qwen3-0.6B Fine-Tuned with Direct Preference Optimization This repository contains a Direct Preference Optimization (DPO) model built on top of a supervised fine-tuned version of [`Qwen/Qwen3-0.6B-Base`](https://huggingface.co/Qwen/Qwen3-0.6B-Base), as part of the MNLP M2 project. The model is fine-tuned using a high-quality preference dataset to better align responses with human preferences. ## Model Description - **Base Model**: [`Qwen/Qwen3-0.6B-Base`](https://huggingface.co/Qwen/Qwen3-0.6B-Base) - **SFT Checkpoint**: [`Tandogan/MNLP_M2_SFT`](https://huggingface.co/Tandogan/MNLP_M2_SFT) - **DPO Dataset**: [`Tandogan/MNLP_M2_dpo_dataset`](https://huggingface.co/datasets/Tandogan/MNLP_M2_dpo_dataset) - **Libraries**: [Unsloth](https://github.com/unslothai/unsloth), [TRL](https://github.com/huggingface/trl) ## Training Procedure ### Supervised Fine-Tuning (SFT) - **Dataset**: [`Tandogan/sft_dataset_final_train`](https://huggingface.co/datasets/Tandogan/sft_dataset_final_train) (Alpaca-style prompt–completion pairs) - **Max sequence length**: 2048 - **Epochs**: 4 - **Optimizer**: AdamW (learning rate = `3e-5`, weight decay = `0`) - **Precision**: bf16 - **Batch size**: 2 (gradient accumulation = 4) - **Scheduler**: Linear with 1% warmup - **Eval & Checkpointing**: Every epoch ### Direct Preference Optimization (DPO) Two DPO fine-tuning experiments were run: #### 1. From Base Model (`Qwen3-0.6B-Base`) #### 2. From SFT Model ([`Tandogan/MNLP_M2_SFT`](https://huggingface.co/Tandogan/MNLP_M2_SFT)) - **Dataset**: [`Tandogan/MNLP_M2_dpo_dataset`](https://huggingface.co/datasets/Tandogan/MNLP_M2_dpo_dataset) - **Max sequence length**: 2048 (prompt + completions truncated to 1024 each) - **Epochs**: 4 - **Optimizer**: AdamW (learning rate = `2e-6`, weight decay = `0`) - **Precision**: bf16 - **Batch size**: 2 (gradient accumulation = 4) - **Scheduler**: Cosine with 1% warmup - **DPO Beta**: 0.1 - **Eval & Checkpointing**: Every epoch - **Monitoring**: Weights & Biases (WandB) - **Best Epoch Selection**: Based on validation loss ## Intended Use This model is intended for research and experimentation with preference-based alignment and reward modeling. It is **not** production-ready and may produce hallucinated, biased, or unsafe outputs. Please evaluate carefully for downstream tasks. ## How to Use You can use the model with the `transformers` and `trl` libraries for inference or evaluation: ```python from transformers import AutoTokenizer, AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("Tandogan/MNLP_M2_dpo_model").to("cuda") tokenizer = AutoTokenizer.from_pretrained("Tandogan/MNLP_M2_dpo_model") prompt = "Explain recursion in simple terms." inputs = tokenizer(prompt, return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=256) print(tokenizer.decode(outputs[0], skip_special_tokens=True))