| # ShortGPT | |
| Unofficial implementations of: | |
| - ["ShortGPT: Layers in Large Language Models are More Redundant Than You Expect"](https://arxiv.org/pdf/2403.03853) | |
| - ["The Unreasonable Ineffectiveness of the Deeper Layers"](https://arxiv.org/abs/2403.17887) | |
| ### To Use | |
| - Follow Llama 2 setup found [here](https://github.com/facebookresearch/llama). | |
| - Reference `short_gpt/short_llama.ipynb` for necessary function calls. | |
| - For HuggingFace models, reference this [branch](https://github.com/sramshetty/ShortGPT/tree/hf-models). | |
| ### Details | |
| - Use a wrapper around Llama to collect hidden states and compute BI (block influence). | |
| - BI implementation may be subject to change or improvements if others find issues, thanks in advance! | |
| - Sum importance values across layers while inferencing on [pg19](https://huggingface.co/datasets/pg19). | |
| - Dataset can be slow to load from huggingface so you may want to use an alternative. | |
| - Use sorted layer-wise importance values to determine which layers are least important and subject to removal. | |
| - Demonstrate *model-healing* with Mistral-7B-v0.1 described in "The Unreasonable Ineffectiveness of the Deeper Layers", where finetuning with LoRA after layer removal can recover downstream model performance. | |
| ### Results | |
| Comparison of ShortGPT layers removed on Llama-2-7B (9 least important layers): | |
| Paper: [27, 26, 25, 28, 24, 29, 23, 21, 22] \ | |
| This Implementation: [25, 27, 24, 26, 28, 29, 23, 22, 21] | |
| Same layers but different order. | |
| ### TODO: | |
| - [x] Is order significant -> Authors mention that layer order varies between datasets but their relative ordering suggests "similar levels of importance" [link](https://huggingface.co/papers/2403.03853#65f028667c916f24c80e93b3). | |
| - [x] Add more models and metrics -> Add experimental support for HF models on this [branch](https://github.com/sramshetty/ShortGPT/tree/hf-models). | |
| - [x] Add angular distance metric | |
| - [x] Demonstrate model healing using HuggingFace model [here](https://github.com/sramshetty/ShortGPT/blob/hf-models/short_gpt/short_hf.ipynb). | |
| ### Citations | |
| ```bibtex | |
| @misc{men2024shortgpt, | |
| title={ShortGPT: Layers in Large Language Models are More Redundant Than You Expect}, | |
| author={Xin Men and Mingyu Xu and Qingyu Zhang and Bingning Wang and Hongyu Lin and Yaojie Lu and Xianpei Han and Weipeng Chen}, | |
| year={2024}, | |
| eprint={2403.03853}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CL} | |
| } | |
| @misc{gromov2024unreasonable, | |
| title={The Unreasonable Ineffectiveness of the Deeper Layers}, | |
| author={Andrey Gromov and Kushal Tirumala and Hassan Shapourian and Paolo Glorioso and Daniel A. Roberts}, | |
| year={2024}, | |
| eprint={2403.17887}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CL} | |
| } | |
| @misc{song2024sleb, | |
| title={SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks}, | |
| author={Jiwon Song and Kyungseok Oh and Taesu Kim and Hyungjun Kim and Yulhwa Kim and Jae-Joon Kim}, | |
| year={2024}, | |
| eprint={2402.09025}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CL} | |
| } | |
| @article{raecompressive2019, | |
| author = {Rae, Jack W and Potapenko, Anna and Jayakumar, Siddhant M and Hillier, Chloe and Lillicrap, Timothy P}, | |
| title = {Compressive Transformers for Long-Range Sequence Modelling}, | |
| journal = {arXiv preprint}, | |
| url = {https://arxiv.org/abs/1911.05507}, | |
| year = {2019}, | |
| } | |
| ``` | |