|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- unsloth/Qwen2.5-3B-Instruct-unsloth-bnb-4bit |
|
|
library_name: adapter-transformers |
|
|
tags: |
|
|
- text-generation-inference |
|
|
- transformers |
|
|
- unsloth |
|
|
- qwen2 |
|
|
- trl |
|
|
- grpo |
|
|
--- |
|
|
|
|
|
# Uploaded model |
|
|
|
|
|
- **Developed by:** Akshint47 |
|
|
- **License:** apache-2.0 |
|
|
- **Finetuned from model :** unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit |
|
|
|
|
|
This qwen2 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library. |
|
|
|
|
|
[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth) |
|
|
# Fine-Tuning Qwen2.5-3B-Instruct with GRPO for GSM8K Dataset |
|
|
|
|
|
This notebook demonstrates the process of fine-tuning the **Qwen2.5-3B-Instruct** model using **GRPO (Generalized Reward Policy Optimization)** on the **GSM8K** dataset. The goal is to improve the model's ability to solve mathematical reasoning problems by leveraging reinforcement learning with custom reward functions. |
|
|
|
|
|
## Overview |
|
|
|
|
|
The notebook is structured as follows: |
|
|
|
|
|
1. **Installation**: Installs necessary libraries such as `unsloth`, `vllm`, and `trl` for efficient fine-tuning and inference. |
|
|
2. **Unsloth Setup**: Configures the environment for faster fine-tuning using Unsloth's `PatchFastRL` and loads the Qwen2.5-3B-Instruct model with LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning. |
|
|
3. **Data Preparation**: Loads and preprocesses the GSM8K dataset, formatting it for training with a system prompt and XML-style reasoning and answer format. |
|
|
4. **Reward Functions**: Defines custom reward functions to evaluate the model's responses, including: |
|
|
- **Correctness Reward**: Checks if the extracted answer matches the ground truth. |
|
|
- **Format Reward**: Ensures the response follows the specified XML format. |
|
|
- **Integer Reward**: Verifies if the extracted answer is an integer. |
|
|
- **XML Count Reward**: Evaluates the completeness of the XML structure in the response. |
|
|
5. **GRPO Training**: Configures and runs the GRPO trainer with vLLM for fast inference, using the defined reward functions to optimize the model's performance. |
|
|
6. **Training Progress**: Monitors the training progress, including rewards, completion length, and KL divergence, to ensure the model is improving over time. |
|
|
|
|
|
## Key Features |
|
|
|
|
|
- **Efficient Fine-Tuning**: Utilizes Unsloth and LoRA to fine-tune the model with reduced memory usage and faster training times. |
|
|
- **Custom Reward Functions**: Implements multiple reward functions to guide the model towards generating correct and well-formatted responses. |
|
|
- **vLLM Integration**: Leverages vLLM for fast inference during training, enabling efficient generation of multiple responses for reward calculation. |
|
|
- **GSM8K Dataset**: Focuses on improving the model's performance on mathematical reasoning tasks, specifically the GSM8K dataset. |
|
|
|
|
|
## Requirements |
|
|
|
|
|
- Python 3.11 |
|
|
- Libraries: `unsloth`, `vllm`, `trl`, `torch`, `transformers` |
|
|
|
|
|
## Installation |
|
|
|
|
|
To set up the environment, run: |
|
|
|
|
|
```bash |
|
|
pip install unsloth vllm trl |
|
|
|
|
|
``` |
|
|
|
|
|
## Usage |
|
|
- **Load the Model**: The notebook loads the Qwen2.5-3B-Instruct model with LoRA for fine-tuning. |
|
|
|
|
|
- **Prepare the Dataset**: The GSM8K dataset is loaded and formatted with a system prompt and XML-style reasoning and answer format. |
|
|
|
|
|
- **Define Reward Functions**: Custom reward functions are defined to evaluate the model's responses. |
|
|
|
|
|
- **Train the Model**: The GRPO trainer is configured and run to fine-tune the model using the defined reward functions. |
|
|
|
|
|
- **Monitor Progress**: The training progress is monitored, including rewards, completion length, and KL divergence. |
|
|
|
|
|
## Results |
|
|
- The training process is designed to improve the model's ability to generate correct and well-formatted responses to mathematical reasoning problems. The reward functions guide the model towards better performance, and the training progress is logged for analysis. |
|
|
|
|
|
## Future Work |
|
|
- **Hyperparameter Tuning**: Experiment with different learning rates, batch sizes, and reward weights to optimize performance. |
|
|
|
|
|
- **Additional Datasets**: Extend the fine-tuning process to other datasets to improve the model's generalization capabilities. |
|
|
|
|
|
- **Advanced Reward Functions**: Implement more sophisticated reward functions to further refine the model's responses. |
|
|
|
|
|
## Acknowledgments |
|
|
- **Unsloth**: For providing tools to speed up fine-tuning. |
|
|
|
|
|
- **vLLM**: For enabling fast inference during training. |
|
|
|
|
|
- **Hugging Face**: For the trl library and the GSM8K dataset. |
|
|
|
|
|
- Special thanks to @sudhir2016 sir for mentoring me for developing such a prominent fine-tuning model. |
|
|
|
|
|
## License |
|
|
This project is licensed under the MIT License. See the LICENSE file for details. |