agentlans
/

SmolLM2-135M-Instruct-Plus

Text Generation

instruction-following

text-generation-inference

Model card Files Files and versions

SmolLM2-135M-Instruct-Plus / README.md

agentlans's picture

Update README.md

6f93fba verified 10 months ago

|

history blame contribute delete

2.05 kB

	---
	tags:
	- causal-lm
	- transformers
	- finetuned
	- instruction-following
	- dpo
	license: apache-2.0
	datasets:
	- agentlans/crash-course
	- Intel/orca_dpo_pairs
	language:
	- en
	base_model:
	- HuggingFaceTB/SmolLM2-135M-Instruct
	---
	# SmolLM2-135M-Instruct-Plus

	This model is a finetuned version of [HuggingFaceTB/SmolLM2-135M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct), aiming to maximize knowledge in a small 135M parameter model.

	> [!WARNING]
	> ⚠️ Consider this model a creative text generator.
	> Without additional finetuning, it gives wildly inaccurate answers. Don't trust the output of this model without additional verification.

	## Model Details

	- Base Model: [HuggingFaceTB/SmolLM2-135M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct)
	- Finetuning Datasets:
	- [agentlans/crash-course](https://huggingface.co/datasets/agentlans/crash-course) (120K subset)
	- [Intel/orca_dpo_pairs](https://huggingface.co/datasets/Intel/orca_dpo_pairs)
	- Training Procedure:
	1. Supervised Fine-Tuning (SFT) on `crash-course` for 1 epoch.
	2. Direct Preference Optimization (DPO) on `orca_dpo_pairs`.

	## Intended Uses

	For research, experimentation, and educational purposes where a small instruction-following model is desired.

	## Limitations

	- Hallucinations: Prone to generating incorrect information due to its small size.
	- Repetitive Output: May produce repetitive text.

	## Training Details

	Both SFT and DPO share common settings: liger_kernel booster, LoRA fine-tuning, custom model, BF16 compute type, batch size of 2, and a cosine scheduler with a learning rate of 5e-5. RSLoRA is enabled with a rank of 16 and alpha of 32.

	The main differences are in the dataset and training specifics. SFT uses CrashCourse_120K with packing enabled and LoRA dropout of 0, while DPO uses orca_pairs with packing disabled and a LoRA dropout of 0.95.

	## Evaluation

	Provides coherent and creative answers but may often be incorrect. Thorough evaluation is recommended before deployment.