--- library_name: transformers license: apache-2.0 datasets: - Mehdi-Zogh/MNLP_M2_dpo_dataset language: - en metrics: - accuracy base_model: - Qwen/Qwen3-0.6B-Base pipeline_tag: question-answering --- # Model Card for Qwen3-0.6B-MNLP-DPO This model is a Direct Preference Optimization (DPO) fine-tuned version of [Qwen3-0.6B-Base](https://huggingface.co/Qwen/Qwen3-0.6B-Base) using the [`Mehdi-Zogh/MNLP_M2_dpo_dataset`](https://huggingface.co/datasets/Mehdi-Zogh/MNLP_M2_dpo_dataset). The goal was to improve the alignment of the base model's outputs with human preferences for educational assistance use cases. --- ## Model Details ### Model Description This model was fine-tuned via the DPO (Direct Preference Optimization) algorithm on top of Qwen3-0.6B-Base. The dataset used for preference learning consists of query-response pairs with annotated preference labels, aiming to teach the model to generate more helpful, appropriate, and preferred responses in instructional contexts. - **Developed by:** Mehdi Zoghlami - **Model type:** Causal Language Model - **Language(s):** English - **License:** Apache 2.0 - **Finetuned from model:** [Qwen/Qwen3-0.6B-Base](https://huggingface.co/Qwen/Qwen3-0.6B-Base) - **Dataset:** [Mehdi-Zogh/MNLP_M2_dpo_dataset](https://huggingface.co/datasets/Mehdi-Zogh/MNLP_M2_dpo_dataset) --- ## Uses ### Direct Use This model is trained to be an AI tutor that is specialized in course content at EPFL. ### Downstream Use It can serve as a base model for further alignment, personalization, or integration into interactive educational platforms or tutoring systems. ### Out-of-Scope Use - Not recommended for use in high-stakes settings. - Not intended for use outside the English language. - Not intended for generating factual or up-to-date information (base model was not trained for retrieval-based tasks). --- ## Get Started with the Model ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "Mehdi-Zogh/MNLP_M2_dpo_model" # load the tokenizer and the model tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="auto" ) # prepare the model input prompt = "explain gradient descent in simple terms." messages = [ {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, enable_thinking=True # Switches between thinking and non-thinking modes. Default is True. ) model_inputs = tokenizer([text], return_tensors="pt").to(model.device) # conduct text completion generated_ids = model.generate( **model_inputs, max_new_tokens=32768 ) output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() # parsing thinking content try: # rindex finding 151668 () index = len(output_ids) - output_ids[::-1].index(151668) except ValueError: index = 0 thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n") content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n") print("thinking content:", thinking_content) print("content:", content) ``` ## Training Details ### Training Data The training data is the [Mehdi-Zogh/MNLP_M2_dpo_dataset](https://huggingface.co/datasets/Mehdi-Zogh/MNLP_M2_dpo_dataset), which contains instructional prompts with ranked preferred and rejected completions. The dataset is specifically designed for alignment research using preference optimization methods. ### Training Procedure The model was fine-tuned using `trl`'s `DPOTrainer` #### Training Hyperparameters | Hyperparameter | Value | |----------------------------|------------------| | Learning rate | 1e-5 | | Epochs | 3 | | Per-device train batch size| 1 | | Per-device eval batch size | 1 | | Gradient accumulation steps| 4 | | Precision | bf16 | | Early stopping patience | 3 | ## Evaluation 320 samples out of the dataset were used for validation. ### Testing Data, Factors & Metrics #### Testing Data The model was tested on [zechen-nlp/MNLP_dpo_demo](https://huggingface.co/datasets/zechen-nlp/MNLP_dpo_demo) #### Metrics - **Accuracy of Preference:** Measures how often the model ranks the preferred response above the rejected one in held-out validation pairs. - This is a standard metric in DPO training to evaluate how well the model aligns with human preferences. ### Results - The model achieved a **preference accuracy of 84% ± 5.2%** on the test set. - This indicates strong alignment between the model's outputs and the preferred responses provided in the dataset.