Buckets:
| title: "Run a Chatgpt-like Chatbot on a Single GPU with ROCm" | |
| thumbnail: /blog/assets/chatbot-amd-gpu/thumbnail.png | |
| authors: | |
| - user: andyll7772 | |
| guest: true | |
| # Run a Chatgpt-like Chatbot on a Single GPU with ROCm | |
| ## Introduction | |
| ChatGPT, OpenAI's groundbreaking language model, has become an | |
| influential force in the realm of artificial intelligence, paving the | |
| way for a multitude of AI applications across diverse sectors. With its | |
| staggering ability to comprehend and generate human-like text, ChatGPT | |
| has transformed industries, from customer support to creative writing, | |
| and has even served as an invaluable research tool. | |
| Various efforts have been made to provide | |
| open-source large language models which demonstrate great capabilities | |
| but in smaller sizes, such as | |
| [OPT](https://huggingface.co/docs/transformers/model_doc/opt), | |
| [LLAMA](https://github.com/facebookresearch/llama), | |
| [Alpaca](https://github.com/tatsu-lab/stanford_alpaca) and | |
| [Vicuna](https://github.com/lm-sys/FastChat). | |
| In this blog, we will delve into the world of Vicuna, and explain how to | |
| run the Vicuna 13B model on a single AMD GPU with ROCm. | |
| **What is Vicuna?** | |
| Vicuna is an open-source chatbot with 13 billion parameters, developed | |
| by a team from UC Berkeley, CMU, Stanford, and UC San Diego. To create | |
| Vicuna, a LLAMA base model was fine-tuned using about 70K user-shared | |
| conversations collected from ShareGPT.com via public APIs. According to | |
| initial assessments where GPT-4 is used as a reference, Vicuna-13B has | |
| achieved over 90%\* quality compared to OpenAI ChatGPT. | |
| <p align="center"> | |
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/chatbot-amd-gpu/01.png" style="width: 60%; height: auto;"> | |
| </p> | |
| It was released on [Github](https://github.com/lm-sys/FastChat) on Apr | |
| 11, just a few weeks ago. It is worth mentioning that the data set, | |
| training code, evaluation metrics, training cost are known for Vicuna. Its total training cost was just | |
| around \$300, making it a cost-effective solution for the general public. | |
| For more details about Vicuna, please check out | |
| <https://vicuna.lmsys.org>. | |
| **Why do we need a quantized GPT model?** | |
| Running Vicuna-13B model in fp16 requires around 28GB GPU RAM. To | |
| further reduce the memory footprint, optimization techniques are | |
| required. There is a recent research paper GPTQ published, which | |
| proposed accurate post-training quantization for GPT models with lower | |
| bit precision. As illustrated below, for models with parameters larger | |
| than 10B, the 4-bit or 3-bit GPTQ can achieve comparable accuracy | |
| with fp16. | |
| <p align="center"> | |
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/chatbot-amd-gpu/02.png" style="width: 70%; height: auto;"> | |
| </p> | |
| Moreover, large parameters of these models also have a severely negative | |
| effect on GPT latency because GPT token generation is more limited by | |
| memory bandwidth (GB/s) than computation (TFLOPs or TOPs) itself. For this | |
| reason, a quantized model does not degrade | |
| token generation latency when the GPU is under a memory bound situation. | |
| Refer to [the GPTQ quantization papers](<https://arxiv.org/abs/2210.17323>) and [github repo](<https://github.com/IST-DASLab/gptq>). | |
| By leveraging this technique, several 4-bit quantized Vicuna models are | |
| available from Hugging Face as follows, | |
| <p align="center"> | |
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/chatbot-amd-gpu/03.png" style="width: 50%; height: auto;"> | |
| </p> | |
| ## Running Vicuna 13B Model on AMD GPU with ROCm | |
| To run the Vicuna 13B model on an AMD GPU, we need to leverage the power | |
| of ROCm (Radeon Open Compute), an open-source software platform that | |
| provides AMD GPU acceleration for deep learning and high-performance | |
| computing applications. | |
| Here's a step-by-step guide on how to set up and run the Vicuna 13B | |
| model on an AMD GPU with ROCm: | |
| **System Requirements** | |
| Before diving into the installation process, ensure that your system | |
| meets the following requirements: | |
| - An AMD GPU that supports ROCm (check the compatibility list on | |
| docs.amd.com page) | |
| - A Linux-based operating system, preferably Ubuntu 18.04 or 20.04 | |
| - Conda or Docker environment | |
| - Python 3.6 or higher | |
| For more information, please check out <https://docs.amd.com/bundle/ROCm-Installation-Guide-v5.4.3/page/Prerequisites.html>. | |
| This example has been tested on [**Instinct | |
| MI210**](https://www.amd.com/en/products/server-accelerators/amd-instinct-mi210) | |
| and [**Radeon | |
| RX6900XT**](https://www.amd.com/en/products/graphics/amd-radeon-rx-6900-xt) | |
| GPUs with ROCm5.4.3 and Pytorch2.0. | |
| **Quick Start** | |
| **1 ROCm installation and Docker container setup (Host machine)** | |
| **1.1 ROCm** **installation** | |
| The following is for ROCm5.4.3 and Ubuntu 22.04. Please modify | |
| according to your target ROCm and Ubuntu version from: | |
| <https://docs.amd.com/bundle/ROCm-Installation-Guide-v5.4.3/page/How_to_Install_ROCm.html> | |
| ``` | |
| sudo apt update && sudo apt upgrade -y | |
| wget https://repo.radeon.com/amdgpu-install/5.4.3/ubuntu/jammy/amdgpu-install_5.4.50403-1_all.deb | |
| sudo apt-get install ./amdgpu-install_5.4.50403-1_all.deb | |
| sudo amdgpu-install --usecase=hiplibsdk,rocm,dkms | |
| sudo amdgpu-install --list-usecase | |
| sudo reboot | |
| ``` | |
| **1.2 ROCm installation verification** | |
| ``` | |
| rocm-smi | |
| sudo rocminfo | |
| ``` | |
| **1.3 Docker image pull and run a Docker container** | |
| The following uses Pytorch2.0 on ROCm5.4.2. Please use the | |
| appropriate docker image according to your target ROCm and Pytorch | |
| version: <https://hub.docker.com/r/rocm/pytorch/tags> | |
| ``` | |
| docker pull rocm/pytorch:rocm5.4.2_ubuntu20.04_py3.8_pytorch_2.0.0_preview | |
| sudo docker run --device=/dev/kfd --device=/dev/dri --group-add video \ | |
| --shm-size=8g --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \ | |
| --ipc=host -it --name vicuna_test -v ${PWD}:/workspace -e USER=${USER} \ | |
| rocm/pytorch:rocm5.4.2_ubuntu20.04_py3.8_pytorch_2.0.0_preview | |
| ``` | |
| **2 Model** **quantization and Model inference (Inside the docker)** | |
| You can either download quantized Vicuna-13b model from Huggingface or | |
| quantize the floating-point model. Please check out **Appendix - GPTQ | |
| model quantization** if you want to quantize the floating-point model. | |
| **2.1 Download the quantized Vicuna-13b model** | |
| Use download-model.py script from the following git repo. | |
| ``` | |
| git clone https://github.com/oobabooga/text-generation-webui.git | |
| cd text-generation-webui | |
| python download-model.py anon8231489123/vicuna-13b-GPTQ-4bit-128g | |
| ``` | |
| 2. **Running the Vicuna 13B GPTQ Model on AMD GPU** | |
| ``` | |
| git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda | |
| cd GPTQ-for-LLaMa | |
| python setup_cuda.py install | |
| ``` | |
| These commands will compile and link HIPIFIED CUDA-equivalent kernel | |
| binaries to | |
| python as C extensions. The kernels of this implementation are composed | |
| of dequantization + FP32 Matmul. If you want to use dequantization + | |
| FP16 Matmul for additional speed-up, please check out **Appendix - GPTQ | |
| Dequantization + FP16 Mamul kernel for AMD GPUs** | |
| ``` | |
| git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda | |
| cd GPTQ-for-LLaMa/ | |
| python setup_cuda.py install | |
| # model inference | |
| python llama_inference.py ../../models/vicuna-13b --wbits 4 --load \ | |
| ../../models/vicuna-13b/vicuna-13b_4_actorder.safetensors --groupsize 128 --text “You input text here” | |
| ``` | |
| Now that you have everything set up, it's time to run the Vicuna 13B | |
| model on your AMD GPU. Use the commands above to run the model. Replace | |
| *"Your input text here"* with the text you want to use as input for | |
| the model. If everything is set up correctly, you should see the model | |
| generating output text based on your input. | |
| **3. Expose the quantized Vicuna model to the Web API server** | |
| Change the path of GPTQ python modules (GPTQ-for-LLaMa) in the following | |
| line: | |
| <https://github.com/thisserand/FastChat/blob/4a57c928a906705404eae06f7a44b4da45828487/fastchat/serve/load_gptq_model.py#L7> | |
| To launch Web UXUI from the gradio library, you need to set up the | |
| controller, worker (Vicunal model worker), web_server by running them as | |
| background jobs. | |
| ``` | |
| nohup python0 -W ignore::UserWarning -m fastchat.serve.controller & | |
| nohup python0 -W ignore::UserWarning -m fastchat.serve.model_worker --model-path /path/to/quantized_vicuna_weights \ | |
| --model-name vicuna-13b-quantization --wbits 4 --groupsize 128 & | |
| nohup python0 -W ignore::UserWarning -m fastchat.serve.gradio_web_server & | |
| ``` | |
| Now the 4-bit quantized Vicuna-13B model can be fitted in RX6900XT GPU | |
| DDR memory, which has 16GB DDR. Only 7.52GB of DDR (46% of 16GB) is | |
| needed to run 13B models whereas the model needs more than 28GB of DDR | |
| space in fp16 datatype. The latency penalty and accuracy penalty are | |
| also very minimal and the related metrics are provided at the end of | |
| this article. | |
| <p align="center"> | |
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/chatbot-amd-gpu/04.png" style="width: 60%; height: auto;"> | |
| </p> | |
| **Test the quantized Vicuna model in the Web API server** | |
| Let us give it a try. First, let us use fp16 Vicuna model for language | |
| translation. | |
| <p align="center"> | |
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/chatbot-amd-gpu/05.png" style="width: 80%; height: auto;"> | |
| </p> | |
| It does a better job than me. Next, let us ask something about soccer. The answer looks good to me. | |
| <p align="center"> | |
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/chatbot-amd-gpu/06.png" style="width: 80%; height: auto;"> | |
| </p> | |
| When we switch to the 4-bit model, for the same question, the answer is | |
| a bit different. There is a duplicated “Lionel Messi” in it. | |
| <p align="center"> | |
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/chatbot-amd-gpu/07.png" style="width: 80%; height: auto;"> | |
| </p> | |
| **Vicuna fp16 and 4bit quantized model comparison** | |
| Test environment: | |
| \- GPU: Instinct MI210, RX6900XT | |
| \- python: 3.10 | |
| \- pytorch: 2.1.0a0+gitfa08e54 | |
| \- rocm: 5.4.3 | |
| **Metrics - Model size (GB)** | |
| - Model parameter size. When the models are preloaded to GPU DDR, the | |
| actual DDR size consumption is larger than model itself due to caching | |
| for Input and output token spaces. | |
| <p align="center"> | |
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/chatbot-amd-gpu/08.png" style="width: 70%; height: auto;"> | |
| </p> | |
| **Metrics – Accuracy (PPL: Perplexity)** | |
| - Measured on 2048 examples of C4 | |
| (<https://paperswithcode.com/dataset/c4>) dataset | |
| - Vicuna 13b – baseline: fp16 datatype parameter, fp16 Matmul | |
| - Vicuna 13b – quant (4bit/fp32): 4bits datatype parameter, fp32 Matmul | |
| - Vicuna 13b – quant (4bit/fp16): 4bits datatype parameter, fp16 Matmul | |
| <p align="center"> | |
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/chatbot-amd-gpu/09.png" style="width: 70%; height: auto;"> | |
| </p> | |
| **Metrics – Latency (Token generation latency, ms)** | |
| - Measured during token generation phases. | |
| - Vicuna 13b – baseline: fp16 datatype parameter, fp16 Matmul | |
| - Vicuna 13b – quant (4bit/fp32): 4bits datatype parameter, fp32 Matmul | |
| - Vicuna 13b – quant (4bit/fp16): 4bits datatype parameter, fp16 Matmul | |
| <p align="center"> | |
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/chatbot-amd-gpu/10.png" style="width: 70%; height: auto;"> | |
| </p> | |
| ## Conclusion | |
| Large language models (LLMs) have made significant advancements in | |
| chatbot systems, as seen in OpenAI’s ChatGPT. Vicuna-13B, an open-source | |
| LLM model has been developed and demonstrated excellent capability and quality. | |
| By following this guide, you should now have a better understanding of | |
| how to set up and run the Vicuna 13B model on an AMD GPU with ROCm. This | |
| will enable you to unlock the full potential of this cutting-edge | |
| language model for your research and personal projects. | |
| Thanks for reading! | |
| ## Appendix - GPTQ model quantization | |
| **Building Vicuna quantized model from the floating-point LLaMA model** | |
| **a. Download LLaMA and Vicuna delta models from Huggingface** | |
| The developers of Vicuna (lmsys) provide only delta-models that can be | |
| applied to the LLaMA model. Download LLaMA in huggingface format and | |
| Vicuna delta parameters from Huggingface individually. Currently, 7b and | |
| 13b delta models of Vicuna are available. | |
| <https://huggingface.co/models?sort=downloads&search=huggyllama> | |
| <https://huggingface.co/models?sort=downloads&search=lmsys> | |
| <p align="center"> | |
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/chatbot-amd-gpu/13.png" style="width: 60%; height: auto;"> | |
| </p> | |
| **b. Convert LLaMA to Vicuna by using Vicuna-delta model** | |
| ``` | |
| git clone https://github.com/lm-sys/FastChat | |
| cd FastChat | |
| ``` | |
| Convert the LLaMA parameters by using this command: | |
| (Note: do not use vicuna-{7b, 13b}-\*delta-v0 because it’s vocab_size is | |
| different from that of LLaMA and the model cannot be converted) | |
| ``` | |
| python -m fastchat.model.apply_delta --base /path/to/llama-13b --delta lmsys/vicuna-13b-delta-v1.1 \ | |
| --target ./vicuna-13b | |
| ``` | |
| Now Vicuna-13b model is ready. | |
| **c. Quantize Vicuna to 2/3/4 bits** | |
| To apply the GPTQ to LLaMA and Vicuna, | |
| ``` | |
| git clone https://github.com/oobabooga/GPTQ-for-LLaMa -b cuda | |
| cd GPTQ-for-LLaMa | |
| ``` | |
| (Note, do not use <https://github.com/qwopqwop200/GPTQ-for-LLaMa> for | |
| now. Because 2,3,4bit quantization + MatMul kernels implemented in this | |
| repo does not parallelize the dequant+matmul and hence shows lower token | |
| generation performance) | |
| Quantize Vicuna-13b model with this command. QAT is done based on c4 | |
| data-set but you can also use other data-sets, such as wikitext2 | |
| (Note. Change group size with different combinations as long as the | |
| model accuracy increases significantly. Under some combination of wbit | |
| and groupsize, model accuracy can be increased significantly.) | |
| ``` | |
| python llama.py ./Vicuna-13b c4 --wbits 4 --true-sequential --act-order \ | |
| --save_safetensors Vicuna-13b-4bit-act-order.safetensors | |
| ``` | |
| Now the model is ready and saved as | |
| **Vicuna-13b-4bit-act-order.safetensors**. | |
| **GPTQ Dequantization + FP16 Mamul kernel for AMD GPUs** | |
| The more optimized kernel implementation in | |
| <https://github.com/oobabooga/GPTQ-for-LLaMa/blob/57a26292ed583528d9941e79915824c5af012279/quant_cuda_kernel.cu#L891> | |
| targets at A100 GPU and not compatible with ROCM5.4.3 HIPIFY | |
| toolkits. It needs to be modified as follows. The same for | |
| VecQuant2MatMulKernelFaster, VecQuant3MatMulKernelFaster, | |
| VecQuant4MatMulKernelFaster kernels. | |
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/chatbot-amd-gpu/14.png" style="width: 100%; height: auto;"> | |
| For convenience, All the modified codes are available in [Github Gist](https://gist.github.com/seungrokjung/110943b70503732c4a398607e1cbdd6c). | |
Xet Storage Details
- Size:
- 14.9 kB
- Xet hash:
- 51d0fcee002cc9220f16b816029ee5b73edd78cd63e5f225c1b410b1f31cad21
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.