leftfooted
/

LlaMa-DUSFT

Model card Files Files and versions

LlaMa-DUSFT / README.md

leftfooted's picture

Update README.md

4ea9b74 verified about 1 year ago

|

history blame contribute delete

1.87 kB

	---
	license: apache-2.0
	datasets:
	- Open-Orca/OpenOrca
	base_model:
	- meta-llama/Llama-2-7b-hf
	---
	# llama-2 40 layer model

	## Model Overview

	LlaMa-DUSFT is a custom variant of the LLaMA-2-7B model created using the DUS (Dynamic Update Strategy) methodology. The original LLaMA-2-7B model consists of 32 layers, and this variant introduces a novel approach to optimize performance by reconfiguring and expanding the layer architecture to 40 layers.

	### Key Modifications:

	1. Layer Splitting:

	- The original 32 layers of LLaMA-2-7B were duplicated.

	- In one variant, the last 12 layers were removed.

	- In another variant, the first 12 layers were removed.

	2. Layer Merging:

	- The two resulting 20-layer segments were combined to form a 40-layer model.

	### Purpose:

	This architectural modification was designed to test whether the DUS approach with an expanded layer count improves performance compared to the standard LLaMA-2 architecture.

	## Training Details

	### Dataset:

	- The model was trained on a subset of the OpenOrca dataset, consisting of 5,000 samples.

	### Training Configuration:

	- Batch Size: 1

	- Epochs: 3

	- Optimizer: AdamW

	- Learning Rate: 5e-5

	- Software: Colab pro

	### Preprocessing:

	Data preprocessing followed the guidelines for LLaMA-2 models, ensuring tokenization and alignment were consistent with the original architecture.

	## Results and Evaluation

	### Performance Metrics:

	- Due to the experimental nature of this model, specific evaluation metrics are currently limited.

	- Initial results indicate improved adaptability in specific downstream tasks from the OpenOrca dataset.

	### Observations:

	- The DUS layer modification shows potential for enhancing model depth without significant degradation of performance.

	- Further evaluation with larger datasets and varied tasks is required to confirm generalizability.