LaQwenTa: A STEM academic assistant
This model is a Direct Preference Optimization (DPO) fine-tuned version of Qwen3-0.6B-Base using the Mehdi-Zogh/MNLP_M3_dpo_dataset. The goal was to improve the alignment of the base model's outputs with human preferences for educational assistance use cases.
Model Details
Model Description
This model was fine-tuned via the DPO (Direct Preference Optimization) algorithm on top of Qwen3-0.6B-Base. The dataset used for preference learning consists of query-response pairs with annotated preference labels, aiming to teach the model to generate more helpful, appropriate, and preferred responses in instructional contexts.
- Developed by: Mehdi Zoghlami
- Model type: Causal Language Model
- Language(s): English
- License: Apache 2.0
- Finetuned from model: Qwen/Qwen3-0.6B-Base
- Dataset: Mehdi-Zogh/MNLP_M3_dpo_dataset
Uses
Direct Use
This model is trained to be an AI tutor that is specialized in course content at EPFL.
Downstream Use
It can serve as a base model for further alignment, personalization, or integration into interactive educational platforms or tutoring systems.
Out-of-Scope Use
- Not recommended for use in high-stakes settings.
- Not intended for use outside the English language.
- Not intended for generating factual or up-to-date information (base model was not trained for retrieval-based tasks).
Get Started with the Model
prompt = "What are the phases of cell division?"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("Mehdi-Zogh/MNLP_M3_dpo_model", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Mehdi-Zogh/MNLP_M3_dpo_model", device_map="auto", trust_remote_code=True)
# Tokenize
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# Generate response
outputs = model.generate(
**inputs,
max_new_tokens=500,
temperature=0.7,
top_p=0.9,
do_sample=True,
eos_token_id=tokenizer.eos_token_id,
)
# Decode and print
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Training Details
Training Data
The training data is the Mehdi-Zogh/MNLP_M3_dpo_dataset, which contains instructional prompts with ranked preferred and rejected completions. The dataset is specifically designed for alignment research using preference optimization methods.
Training Procedure
The model was fine-tuned using trl's DPOTrainer
Training Hyperparameters
| Hyperparameter | Value |
|---|---|
| Learning rate | 1e-6 |
| Epochs | 3 |
| Per-device train batch size | 1 |
| Per-device eval batch size | 1 |
| Gradient accumulation steps | 4 |
| Precision | bf16 |
| Early stopping patience | 3 |
Evaluation
900 samples out of the dataset were used for validation.
Testing Data, Factors & Metrics
Testing Data
The model was tested on zechen-nlp/MNLP_dpo_evals
Metrics
- Accuracy of Preference: Measures how often the model ranks the preferred response above the rejected one in held-out validation pairs.
- This is a standard metric in DPO training to evaluate how well the model aligns with human preferences.
Results
- The model achieved a preference accuracy of 79% on the test set.
- This indicates strong alignment between the model's outputs and the preferred responses provided in the dataset.
- Downloads last month
- 10
Model tree for Mehdi-Zogh/MNLP_M3_dpo_model
Base model
Qwen/Qwen3-0.6B-Base