|
|
--- |
|
|
base_model: unsloth/llama-3.2-1b-instruct-bnb-4bit |
|
|
tags: |
|
|
- text-generation-inference |
|
|
- transformers |
|
|
- unsloth |
|
|
- llama |
|
|
- gguf |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
--- |
|
|
|
|
|
# Uploaded model |
|
|
|
|
|
- **Developed by:** forestav |
|
|
- **License:** apache-2.0 |
|
|
- **Finetuned from model:** [unsloth/llama-3.2-1b-instruct-bnb-4bit](https://huggingface.co/unsloth/llama-3.2-1b-instruct-bnb-4bit) |
|
|
|
|
|
## Model description |
|
|
This model is a refined version of a LoRA adapter trained on the **unsloth/Llama-3.2-3B-Instruct** model using the **FineTome-100k** dataset. The finetuned model uses fewer parameters (1B vs. 3B) to achieve faster training and improved adaptability for specific tasks, such as medical applications. |
|
|
|
|
|
### Key adjustments: |
|
|
1. **Reduced Parameter Count:** The model was downsized to 1B parameters to improve training efficiency and ease customization. |
|
|
2. **Adjusted Learning Rate:** A smaller learning rate was used to prevent overfitting and mitigate catastrophic forgetting. This ensures the model retains its general pretraining knowledge while learning new tasks effectively. |
|
|
|
|
|
The finetuning dataset, **ruslanmv/ai-medical-chatbot**, contains only 257k rows, which necessitated careful hyperparameter tuning to avoid over-specialization. |
|
|
|
|
|
--- |
|
|
|
|
|
## Hyperparameters and explanations |
|
|
|
|
|
- **Learning rate:** `2e-5` |
|
|
A smaller learning rate reduces the risk of overfitting and catastrophic forgetting, particularly when working with models containing fewer parameters. |
|
|
|
|
|
- **Warm-up steps:** `5` |
|
|
Warm-up allows the optimizer to gather gradient statistics before training at the full learning rate, improving stability. |
|
|
|
|
|
- **Per device train batch size:** `2` |
|
|
Each GPU processes 2 training samples per step. This setup is suitable for resource-constrained environments. |
|
|
|
|
|
- **Gradient accumulation steps:** `4` |
|
|
Gradients are accumulated over 4 steps to simulate a larger batch size (effective batch size: 8) without exceeding memory limits. |
|
|
|
|
|
- **Optimizer:** `AdamW with 8-bit Quantization` |
|
|
- **AdamW:** Adds weight decay to prevent overfitting. |
|
|
- **8-bit Quantization:** Reduces memory usage by compressing optimizer states, facilitating faster training. |
|
|
|
|
|
- **Weight decay:** `0.01` |
|
|
Standard weight decay value effective across various training scenarios. |
|
|
|
|
|
- **Learning rate scheduler type:** `Linear` |
|
|
Gradually decreases the learning rate from the initial value to zero over the course of training. |
|
|
|
|
|
--- |
|
|
|
|
|
## Quantization details |
|
|
The model is saved in **16-bit GGUF format**, which: |
|
|
- Ensures **100% accuracy retention**. |
|
|
- Trades off speed and memory for improved precision. |
|
|
|
|
|
### Training optimization |
|
|
Training was accelerated by **2x** using [Unsloth](https://github.com/unslothai/unsloth) in combination with Hugging Face's **TRL library**. |
|
|
|
|
|
--- |
|
|
|
|
|
[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth) |