Spaces:
Running
Running
| title: README | |
| emoji: π | |
| colorFrom: pink | |
| colorTo: indigo | |
| sdk: static | |
| pinned: false | |
| license: mit | |
| short_description: Compressed Large Language Models | |
| # Compressed Large Language Models | |
| This repo contains compressed LLMs used in the [Decoding Compressed Trust](https://decoding-comp-trust.github.io/) project. | |
| The models are prepared by [Visual Informatics Group @ University of Texas at Austin (VITA-group)](https://vita-group.github.io/) and | |
| [Center for Applied Scientific Computing](https://computing.llnl.gov/casc) at [LLNL](https://www.llnl.gov/). | |
| License: [MIT License](https://opensource.org/license/mit/) | |
| Simplified lists: | |
| * Models: Llama-2 13b, Llama-2 chat 13b, Vicuna 13b v1.3 | |
| * Compression methods: | |
| - Pruning: Magnitude-based, Wanda, SparseGPT (2:4 semi-structured) | |
| - Quantization: AWQ, GPTQ (3,4,8 bits) | |
| Setup environment | |
| ```shell | |
| pip install torch==2.0.0+cu117 torchvision==0.15.1+cu117 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu117 | |
| pip install transformers==4.31.0 | |
| pip install accelerate | |
| pip install auto-gptq # for gptq | |
| ``` | |
| ## How to use models | |
| How to use pruned models | |
| ```python | |
| import torch | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| base_model = 'llama-2-7b' | |
| comp_method = 'magnitude_unstructured' | |
| comp_degree = 0.2 | |
| model_path = f'compressed-llm/{base_model}_{comp_method}' | |
| model = AutoModelForCausalLM.from_pretrained( | |
| model_path, | |
| revision=f's{comp_degree}', | |
| torch_dtype=torch.float16, | |
| low_cpu_mem_usage=True, | |
| device_map="auto" | |
| ) | |
| tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf') | |
| input_ids = tokenizer('Hello! I am a compressed-LLM chatbot!', return_tensors='pt').input_ids.cuda() | |
| outputs = model.generate(input_ids, max_new_tokens=128) | |
| print(tokenizer.decode(outputs[0])) | |
| ``` | |
| How to use wanda+gptq models | |
| ```python | |
| from transformers import AutoTokenizer | |
| from auto_gptq import AutoGPTQForCausalLM | |
| model_path = 'compressed-llm/llama-2-7b_wanda_2_4_gptq_4bit_128g' | |
| tokenizer_path = 'meta-llama/Llama-2-7b-hf' | |
| model = AutoGPTQForCausalLM.from_quantized( | |
| model_path, | |
| # inject_fused_attention=False, # or | |
| disable_exllama=True, | |
| device_map='auto', | |
| ) | |
| tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, trust_remote_code=True) | |
| input_ids = tokenizer('Hello! I am a VITA-compressed-LLM chatbot!', return_tensors='pt').input_ids.to('cuda') | |
| outputs = model.generate(input_ids=input_ids, max_length=128) | |
| tokenizer.decode(outputs[0]) | |
| ``` | |
| How to use gptq models | |
| ```python | |
| from transformers import AutoTokenizer | |
| from auto_gptq import AutoGPTQForCausalLM | |
| # model_path = 'compressed-llm/llama-2-7b_wanda_2_4_gptq_4bit_128g' | |
| # tokenizer_path = 'meta-llama/Llama-2-7b-hf' | |
| model_path = 'compressed-llm/vicuna-7b-v1.3_gptq' | |
| tokenizer_path = 'lmsys/vicuna-7b-v1.3' | |
| model = AutoGPTQForCausalLM.from_quantized( | |
| model_path, | |
| # inject_fused_attention=False, # or | |
| disable_exllama=True, | |
| device_map='auto', | |
| revision='2bit_128g', | |
| ) | |
| from transformers import AutoTokenizer | |
| tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, trust_remote_code=True) | |
| input_ids = tokenizer('Hello! I am a VITA-compressed-LLM chatbot!', return_tensors='pt').input_ids.to('cuda') | |
| outputs = model.generate(input_ids=input_ids, max_length=128) | |
| tokenizer.decode(outputs[0]) | |
| ``` | |
| ## Citations | |
| If you are using models in this hub, please consider citing our papers. | |
| ```bibtex | |
| @article{hong2024comptrust, | |
| title={Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression}, | |
| author={Hong, Junyuan and Duan, Jinhao and Zhang, Chenhui and Li, Zhangheng | |
| and Xie, Chulin and Lieberman, Kelsey and Diffenderfer, James | |
| and Bartoldson, Brian and Jaiswal, Ajay and Xu, Kaidi and Kailkhura, Bhavya | |
| and Hendrycks, Dan and Song, Dawn and Wang, Zhangyang and Bo Li}, | |
| journal={arXiv}, | |
| year={2024} | |
| } | |
| ``` | |
| Some of the models were used in previous publications. | |
| ```bibtex | |
| @article{jaiswal2023emergence, | |
| title={The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter}, | |
| author={Jaiswal, Ajay and Liu, Shiwei and Chen, Tianlong and Wang, Zhangyang}, | |
| journal={arXiv}, | |
| year={2023} | |
| } | |
| @article{jaiswal2023compressing, | |
| title={Compressing LLMs: The Truth is Rarely Pure and Never Simple}, | |
| author={Ajay Jaiswal and Zhe Gan and Xianzhi Du and Bowen Zhang and Zhangyang Wang and Yinfei Yang}, | |
| year={2023}, | |
| journal={arXiv}, | |
| } | |
| ``` | |
| ## Acknowlegement | |
| Main credits to Ajay Jaiswal, Jinhao Duan, Zhangheng Li and Junyuan Hong. We also appreciate Zhenyu Zhang, Lu Yin, and Shiwei Liu in some preparations. | |
| For any question, please contact [Junyuan Hong](mailto:jyhong@utexas.edu). |