deepnet
/

ShortGpt

Model card Files Files and versions

ShortGpt / README.md

deepnet's picture

Upload folder using huggingface_hub

821537b verified over 1 year ago

|

history blame contribute delete

3.38 kB

	# ShortGPT
	Unofficial implementations of:
	- ["ShortGPT: Layers in Large Language Models are More Redundant Than You Expect"](https://arxiv.org/pdf/2403.03853)
	- ["The Unreasonable Ineffectiveness of the Deeper Layers"](https://arxiv.org/abs/2403.17887)

	### To Use
	- Follow Llama 2 setup found [here](https://github.com/facebookresearch/llama).
	- Reference `short_gpt/short_llama.ipynb` for necessary function calls.
	- For HuggingFace models, reference this [branch](https://github.com/sramshetty/ShortGPT/tree/hf-models).


	### Details
	- Use a wrapper around Llama to collect hidden states and compute BI (block influence).
	- BI implementation may be subject to change or improvements if others find issues, thanks in advance!
	- Sum importance values across layers while inferencing on [pg19](https://huggingface.co/datasets/pg19).
	- Dataset can be slow to load from huggingface so you may want to use an alternative.
	- Use sorted layer-wise importance values to determine which layers are least important and subject to removal.
	- Demonstrate model-healing with Mistral-7B-v0.1 described in "The Unreasonable Ineffectiveness of the Deeper Layers", where finetuning with LoRA after layer removal can recover downstream model performance.


	### Results
	Comparison of ShortGPT layers removed on Llama-2-7B (9 least important layers):

	Paper: [27, 26, 25, 28, 24, 29, 23, 21, 22] \
	This Implementation: [25, 27, 24, 26, 28, 29, 23, 22, 21]

	Same layers but different order.

	### TODO:
	- [x] Is order significant -> Authors mention that layer order varies between datasets but their relative ordering suggests "similar levels of importance" [link](https://huggingface.co/papers/2403.03853#65f028667c916f24c80e93b3).
	- [x] Add more models and metrics -> Add experimental support for HF models on this [branch](https://github.com/sramshetty/ShortGPT/tree/hf-models).
	- [x] Add angular distance metric
	- [x] Demonstrate model healing using HuggingFace model [here](https://github.com/sramshetty/ShortGPT/blob/hf-models/short_gpt/short_hf.ipynb).

	### Citations
	```bibtex
	@misc{men2024shortgpt,
	title={ShortGPT: Layers in Large Language Models are More Redundant Than You Expect},
	author={Xin Men and Mingyu Xu and Qingyu Zhang and Bingning Wang and Hongyu Lin and Yaojie Lu and Xianpei Han and Weipeng Chen},
	year={2024},
	eprint={2403.03853},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}

	@misc{gromov2024unreasonable,
	title={The Unreasonable Ineffectiveness of the Deeper Layers},
	author={Andrey Gromov and Kushal Tirumala and Hassan Shapourian and Paolo Glorioso and Daniel A. Roberts},
	year={2024},
	eprint={2403.17887},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}

	@misc{song2024sleb,
	title={SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks},
	author={Jiwon Song and Kyungseok Oh and Taesu Kim and Hyungjun Kim and Yulhwa Kim and Jae-Joon Kim},
	year={2024},
	eprint={2402.09025},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}

	@article{raecompressive2019,
	author = {Rae, Jack W and Potapenko, Anna and Jayakumar, Siddhant M and Hillier, Chloe and Lillicrap, Timothy P},
	title = {Compressive Transformers for Long-Range Sequence Modelling},
	journal = {arXiv preprint},
	url = {https://arxiv.org/abs/1911.05507},
	year = {2019},
	}
	```