|
|
--- |
|
|
tags: |
|
|
- causal-lm |
|
|
- transformers |
|
|
- finetuned |
|
|
- instruction-following |
|
|
- dpo |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- agentlans/crash-course |
|
|
- Intel/orca_dpo_pairs |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- HuggingFaceTB/SmolLM2-135M-Instruct |
|
|
--- |
|
|
# SmolLM2-135M-Instruct-Plus |
|
|
|
|
|
This model is a finetuned version of [HuggingFaceTB/SmolLM2-135M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct), aiming to maximize knowledge in a small 135M parameter model. |
|
|
|
|
|
> [!WARNING] |
|
|
> ⚠️ Consider this model a creative text generator. |
|
|
> Without additional finetuning, it gives wildly inaccurate answers. Don't trust the output of this model without additional verification. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Base Model:** [HuggingFaceTB/SmolLM2-135M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct) |
|
|
- **Finetuning Datasets:** |
|
|
- [agentlans/crash-course](https://huggingface.co/datasets/agentlans/crash-course) (120K subset) |
|
|
- [Intel/orca_dpo_pairs](https://huggingface.co/datasets/Intel/orca_dpo_pairs) |
|
|
- **Training Procedure:** |
|
|
1. Supervised Fine-Tuning (SFT) on `crash-course` for 1 epoch. |
|
|
2. Direct Preference Optimization (DPO) on `orca_dpo_pairs`. |
|
|
|
|
|
## Intended Uses |
|
|
|
|
|
For research, experimentation, and educational purposes where a small instruction-following model is desired. |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **Hallucinations:** Prone to generating incorrect information due to its small size. |
|
|
- **Repetitive Output:** May produce repetitive text. |
|
|
|
|
|
## Training Details |
|
|
|
|
|
Both SFT and DPO share common settings: liger_kernel booster, LoRA fine-tuning, custom model, BF16 compute type, batch size of 2, and a cosine scheduler with a learning rate of 5e-5. RSLoRA is enabled with a rank of 16 and alpha of 32. |
|
|
|
|
|
The main differences are in the dataset and training specifics. SFT uses CrashCourse_120K with packing enabled and LoRA dropout of 0, while DPO uses orca_pairs with packing disabled and a LoRA dropout of 0.95. |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
Provides coherent and creative answers but may often be incorrect. Thorough evaluation is recommended before deployment. |