LaQwenTa: A STEM academic assistant

This model is a Direct Preference Optimization (DPO) fine-tuned version of Qwen3-0.6B-Base using the Mehdi-Zogh/MNLP_M3_dpo_dataset. The goal was to improve the alignment of the base model's outputs with human preferences for educational assistance use cases.

Model Details

Model Description

This model was fine-tuned via the DPO (Direct Preference Optimization) algorithm on top of Qwen3-0.6B-Base. The dataset used for preference learning consists of query-response pairs with annotated preference labels, aiming to teach the model to generate more helpful, appropriate, and preferred responses in instructional contexts.

Developed by: Mehdi Zoghlami
Model type: Causal Language Model
Language(s): English
License: Apache 2.0
Finetuned from model: Qwen/Qwen3-0.6B-Base
Dataset: Mehdi-Zogh/MNLP_M3_dpo_dataset

Uses

Direct Use

This model is trained to be an AI tutor that is specialized in course content at EPFL.

Downstream Use

It can serve as a base model for further alignment, personalization, or integration into interactive educational platforms or tutoring systems.

Out-of-Scope Use

Not recommended for use in high-stakes settings.
Not intended for use outside the English language.
Not intended for generating factual or up-to-date information (base model was not trained for retrieval-based tasks).

Get Started with the Model

prompt = "What are the phases of cell division?"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("Mehdi-Zogh/MNLP_M3_dpo_model", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Mehdi-Zogh/MNLP_M3_dpo_model", device_map="auto", trust_remote_code=True)

# Tokenize
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Generate response
outputs = model.generate(
    **inputs,
    max_new_tokens=500,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
    eos_token_id=tokenizer.eos_token_id,
)

# Decode and print
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Training Details

Training Data

The training data is the Mehdi-Zogh/MNLP_M3_dpo_dataset, which contains instructional prompts with ranked preferred and rejected completions. The dataset is specifically designed for alignment research using preference optimization methods.

Training Procedure

The model was fine-tuned using trl's DPOTrainer

Training Hyperparameters

Hyperparameter	Value
Learning rate	1e-6
Epochs	3
Per-device train batch size	1
Per-device eval batch size	1
Gradient accumulation steps	4
Precision	bf16
Early stopping patience	3

Evaluation

900 samples out of the dataset were used for validation.

Testing Data, Factors & Metrics

Testing Data

The model was tested on zechen-nlp/MNLP_dpo_evals

Metrics

Accuracy of Preference: Measures how often the model ranks the preferred response above the rejected one in held-out validation pairs.
This is a standard metric in DPO training to evaluate how well the model aligns with human preferences.

Results

The model achieved a preference accuracy of 79% on the test set.
This indicates strong alignment between the model's outputs and the preferred responses provided in the dataset.

Downloads last month: 15

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for Mehdi-Zogh/MNLP_M3_dpo_model

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

(595)

this model

Mehdi-Zogh
/

MNLP_M3_dpo_model