--- library_name: transformers tags: - peft - qlora - knowledge-distillation - gsm8k - metamathqa - phi-2 - math datasets: - meta-math/MetaMathQA - openai/gsm8k language: - en base_model: - microsoft/phi-2 pipeline_tag: text-generation --- # Model Card for Qwen2.5-Math-Instruct-Distill-Phi2-2.5K-Mixed This model is a version of `microsoft/phi-2` that has been fine-tuned using knowledge distillation. The goal was to teach the compact and efficient Phi-2 "student" model to replicate the step-by-step mathematical reasoning style of the powerful `Qwen/Qwen2.5-Math-7B-Instruct` "teacher" model. This is the **V2** of this project, featuring a significantly larger and more diverse dataset than the V1 model, resulting in more robust reasoning and the ability to correctly generate LaTeX math notation. ## Model Details ### Model Description This project explores **style distillation**, where a smaller model is trained not just on correct answers, but on the *process* and *format* of a larger, more capable model's output. The primary objective was to transfer the teacher's verbose, step-by-step reasoning methodology, including its use of LaTeX, to the student model. The model was fine-tuned using the QLoRA method for high memory efficiency, making it possible to train on consumer-grade hardware. - **Developed by:** Dery Ferd - **Model type:** Causal Decoder-Only Transformer - **Language(s) (NLP):** English - **License:** MIT - **Finetuned from model:** `microsoft/phi-2` - **Demo:** [Link to your Gradio Demo if you deploy it on Spaces] ## How to Get Started with the Model Use the code below to load the fine-tuned model adapter and run inference. ```python import torch from peft import PeftModel from transformers import AutoModelForCausalLM, AutoTokenizer # Your repository ID repo_id = "DeryFerd/Qwen2.5-Math-Instruct-Distill-Phi2-2.5K-Mixed" base_model_id = "microsoft/phi-2" # Load the base model base_model = AutoModelForCausalLM.from_pretrained( base_model_id, torch_dtype=torch.float16, device_map="auto", trust_remote_code=True ) # Load the tokenizer tokenizer = AutoTokenizer.from_pretrained(repo_id) # Load the PEFT model by merging the adapter into the base model model = PeftModel.from_pretrained(base_model, repo_id) model.eval() # --- Run Inference --- instruction = "A right-angled triangle has two shorter sides with lengths of 8 cm and 15 cm. What is the length of the longest side (the hypotenuse)? Use the Pythagorean theorem (a^2 + b^2 = c^2) to solve it." prompt = f"Instruct: {instruction.strip()}\nOutput:" inputs = tokenizer(prompt, return_tensors="pt", return_attention_mask=False).to(model.device) with torch.no_grad(): outputs = model.generate(**inputs, max_new_tokens=300, pad_token_id=tokenizer.eos_token_id) response = tokenizer.decode(outputs[0], skip_special_tokens=True) final_answer = response.split("Output:")[1].strip() print(final_answer) ``` ## Training Details ### Training Data The model was trained on a combined dataset of **2,500 examples** created specifically for this distillation task. The dataset is a mix of two sources to improve robustness and mathematical formatting capabilities: - **2,000 examples from GSM8K:** A popular dataset of grade-school math word problems that require multi-step arithmetic reasoning. - **500 examples from MetaMathQA:** A high-quality dataset covering a broader range of math topics, including algebra and more complex notation. The answers ("response" column) were not the original dataset answers, but were synthetically generated by the `Qwen/Qwen2.5-Math-7B-Instruct` teacher model to provide high-quality, step-by-step reasoning examples for the student to learn from. ### Training Procedure The model was fine-tuned using the SFTTrainer from the TRL library on a single NVIDIA T4 GPU in a Kaggle Notebook environment. #### Preprocessing Each data sample was formatted into a single string using the following template, which is suitable for the Phi-2 model: `Instruct: {instruction}\nOutput: {response}<|endoftext|>` #### Training Hyperparameters - **Framework:** Transformers, PEFT (QLoRA), TRL - **Quantization:** 4-bit `nf4` with `float16` compute dtype - **LoRA `r`:** 16 - **LoRA `alpha`:** 32 - **LoRA `target_modules`:** `["q_proj", "k_proj", "v_proj", "dense"]` - **Batch Size:** `per_device_train_batch_size` = 1, `gradient_accumulation_steps` = 8 (effective batch size of 8) - **Optimizer:** `paged_adamw_8bit` - **Learning Rate:** `2e-4` with a constant scheduler - **Epochs:** 3 - **precision:** `fp16` ## Evaluation ### Results & Summary Evaluation was performed qualitatively by comparing the outputs of three models (Base Phi-2, this fine-tuned Student model, and the Teacher Qwen model) on a variety of math problems. - **Success in Style Transfer:** The model successfully adopted the verbose, step-by-step reasoning style of the teacher model. This is a significant improvement over the base Phi-2, which tends to provide short, direct answers without showing its work. - **Success in LaTeX Generation:** A key failure of the V1 model was its inability to correctly generate LaTeX math notation. This V2 model, trained on a more diverse dataset, **successfully generates LaTeX** for equations, fractions, and exponents, mirroring the teacher's output format. - **Efficiency Gains:** As expected from distillation, this student model provides its detailed, high-quality answers **significantly faster** and using **fewer generated tokens** than the much larger 7B teacher model, demonstrating the core benefit of this project. ## Bias, Risks, and Limitations This is an experimental model trained for a specific purpose and has several limitations: - It inherits any biases present in the base `microsoft/phi-2` and teacher `Qwen/Qwen2.5-Math-7B-Instruct` models. - Its primary training objective was **style imitation**, not necessarily improving raw mathematical accuracy. It may produce plausible-sounding but mathematically incorrect reasoning. - Its knowledge is limited to the patterns within the 2,500 training examples. It may not generalize well to math problems far outside the scope of GSM8K and MetaMathQA (e.g., advanced calculus). This model should **not** be used for production or critical applications. It is intended as a portfolio project to demonstrate the effectiveness of knowledge distillation. ## More Information [LinkedIn] [[My LinkedIn]](https://www.linkedin.com/in/deryferdikaoktoriansah/) ## Developer Contact [Github] [[My Github]](https://github.com/DeryFerd)