MNLP_M2_dpo_model / README.md

doc: update the usage code snippet

641ddc5 verified 8 months ago

4.85 kB

	---
	library_name: transformers
	license: apache-2.0
	datasets:
	- Mehdi-Zogh/MNLP_M2_dpo_dataset
	language:
	- en
	metrics:
	- accuracy
	base_model:
	- Qwen/Qwen3-0.6B-Base
	pipeline_tag: question-answering
	---

	# Model Card for Qwen3-0.6B-MNLP-DPO

	This model is a Direct Preference Optimization (DPO) fine-tuned version of [Qwen3-0.6B-Base](https://huggingface.co/Qwen/Qwen3-0.6B-Base) using the [`Mehdi-Zogh/MNLP_M2_dpo_dataset`](https://huggingface.co/datasets/Mehdi-Zogh/MNLP_M2_dpo_dataset). The goal was to improve the alignment of the base model's outputs with human preferences for educational assistance use cases.

	---

	## Model Details

	### Model Description

	This model was fine-tuned via the DPO (Direct Preference Optimization) algorithm on top of Qwen3-0.6B-Base. The dataset used for preference learning consists of query-response pairs with annotated preference labels, aiming to teach the model to generate more helpful, appropriate, and preferred responses in instructional contexts.

	- Developed by: Mehdi Zoghlami
	- Model type: Causal Language Model
	- Language(s): English
	- License: Apache 2.0
	- Finetuned from model: [Qwen/Qwen3-0.6B-Base](https://huggingface.co/Qwen/Qwen3-0.6B-Base)
	- Dataset: [Mehdi-Zogh/MNLP_M2_dpo_dataset](https://huggingface.co/datasets/Mehdi-Zogh/MNLP_M2_dpo_dataset)

	---

	## Uses

	### Direct Use

	This model is trained to be an AI tutor that is specialized in course content at EPFL.

	### Downstream Use

	It can serve as a base model for further alignment, personalization, or integration into interactive educational platforms or tutoring systems.

	### Out-of-Scope Use

	- Not recommended for use in high-stakes settings.
	- Not intended for use outside the English language.
	- Not intended for generating factual or up-to-date information (base model was not trained for retrieval-based tasks).

	---

	## Get Started with the Model

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = "Mehdi-Zogh/MNLP_M2_dpo_model"

	# load the tokenizer and the model
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype="auto",
	device_map="auto"
	)

	# prepare the model input
	prompt = "explain gradient descent in simple terms."
	messages = [
	{"role": "user", "content": prompt}
	]
	text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True,
	enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
	)
	model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

	# conduct text completion
	generated_ids = model.generate(
	**model_inputs,
	max_new_tokens=32768
	)
	output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

	# parsing thinking content
	try:
	# rindex finding 151668 (</think>)
	index = len(output_ids) - output_ids[::-1].index(151668)
	except ValueError:
	index = 0

	thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
	content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

	print("thinking content:", thinking_content)
	print("content:", content)

	```


	## Training Details

	### Training Data

	The training data is the [Mehdi-Zogh/MNLP_M2_dpo_dataset](https://huggingface.co/datasets/Mehdi-Zogh/MNLP_M2_dpo_dataset), which contains instructional prompts with ranked preferred and rejected completions. The dataset is specifically designed for alignment research using preference optimization methods.


	### Training Procedure

	The model was fine-tuned using `trl`'s `DPOTrainer`


	#### Training Hyperparameters


	\| Hyperparameter \| Value \|
	\|----------------------------\|------------------\|
	\| Learning rate \| 1e-5 \|
	\| Epochs \| 3 \|
	\| Per-device train batch size\| 1 \|
	\| Per-device eval batch size \| 1 \|
	\| Gradient accumulation steps\| 4 \|
	\| Precision \| bf16 \|
	\| Early stopping patience \| 3 \|



	## Evaluation

	320 samples out of the dataset were used for validation.

	### Testing Data, Factors & Metrics

	#### Testing Data

	The model was tested on [zechen-nlp/MNLP_dpo_demo](https://huggingface.co/datasets/zechen-nlp/MNLP_dpo_demo)


	#### Metrics

	- Accuracy of Preference: Measures how often the model ranks the preferred response above the rejected one in held-out validation pairs.
	- This is a standard metric in DPO training to evaluate how well the model aligns with human preferences.

	### Results

	- The model achieved a preference accuracy of 84% ± 5.2% on the test set.
	- This indicates strong alignment between the model's outputs and the preferred responses provided in the dataset.