forestav
/

medical_model

text-generation-inference

Model card Files Files and versions

medical_model / README.md

forestav's picture

Update README.md

aa50a79 verified about 1 year ago

|

history blame contribute delete

2.94 kB

	---
	base_model: unsloth/llama-3.2-1b-instruct-bnb-4bit
	tags:
	- text-generation-inference
	- transformers
	- unsloth
	- llama
	- gguf
	license: apache-2.0
	language:
	- en
	---

	# Uploaded model

	- Developed by: forestav
	- License: apache-2.0
	- Finetuned from model: [unsloth/llama-3.2-1b-instruct-bnb-4bit](https://huggingface.co/unsloth/llama-3.2-1b-instruct-bnb-4bit)

	## Model description
	This model is a refined version of a LoRA adapter trained on the unsloth/Llama-3.2-3B-Instruct model using the FineTome-100k dataset. The finetuned model uses fewer parameters (1B vs. 3B) to achieve faster training and improved adaptability for specific tasks, such as medical applications.

	### Key adjustments:
	1. Reduced Parameter Count: The model was downsized to 1B parameters to improve training efficiency and ease customization.
	2. Adjusted Learning Rate: A smaller learning rate was used to prevent overfitting and mitigate catastrophic forgetting. This ensures the model retains its general pretraining knowledge while learning new tasks effectively.

	The finetuning dataset, ruslanmv/ai-medical-chatbot, contains only 257k rows, which necessitated careful hyperparameter tuning to avoid over-specialization.

	---

	## Hyperparameters and explanations

	- Learning rate: `2e-5`
	A smaller learning rate reduces the risk of overfitting and catastrophic forgetting, particularly when working with models containing fewer parameters.

	- Warm-up steps: `5`
	Warm-up allows the optimizer to gather gradient statistics before training at the full learning rate, improving stability.

	- Per device train batch size: `2`
	Each GPU processes 2 training samples per step. This setup is suitable for resource-constrained environments.

	- Gradient accumulation steps: `4`
	Gradients are accumulated over 4 steps to simulate a larger batch size (effective batch size: 8) without exceeding memory limits.

	- Optimizer: `AdamW with 8-bit Quantization`
	- AdamW: Adds weight decay to prevent overfitting.
	- 8-bit Quantization: Reduces memory usage by compressing optimizer states, facilitating faster training.

	- Weight decay: `0.01`
	Standard weight decay value effective across various training scenarios.

	- Learning rate scheduler type: `Linear`
	Gradually decreases the learning rate from the initial value to zero over the course of training.

	---

	## Quantization details
	The model is saved in 16-bit GGUF format, which:
	- Ensures 100% accuracy retention.
	- Trades off speed and memory for improved precision.

	### Training optimization
	Training was accelerated by 2x using [Unsloth](https://github.com/unslothai/unsloth) in combination with Hugging Face's TRL library.

	---

	[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)