Buckets:

hf-doc-build
/

blog

Files

xet

hf-doc-build/blog / chatbot-amd-gpu.md

HuggingFaceDocBuilder

4 days ago

preview code

download

raw

14.9 kB

	---
	title: "Run a Chatgpt-like Chatbot on a Single GPU with ROCm"
	thumbnail: /blog/assets/chatbot-amd-gpu/thumbnail.png
	authors:
	- user: andyll7772
	guest: true
	---

	# Run a Chatgpt-like Chatbot on a Single GPU with ROCm

	## Introduction

	ChatGPT, OpenAI's groundbreaking language model, has become an
	influential force in the realm of artificial intelligence, paving the
	way for a multitude of AI applications across diverse sectors. With its
	staggering ability to comprehend and generate human-like text, ChatGPT
	has transformed industries, from customer support to creative writing,
	and has even served as an invaluable research tool.

	Various efforts have been made to provide
	open-source large language models which demonstrate great capabilities
	but in smaller sizes, such as
	[OPT](https://huggingface.co/docs/transformers/model_doc/opt),
	[LLAMA](https://github.com/facebookresearch/llama),
	[Alpaca](https://github.com/tatsu-lab/stanford_alpaca) and
	[Vicuna](https://github.com/lm-sys/FastChat).

	In this blog, we will delve into the world of Vicuna, and explain how to
	run the Vicuna 13B model on a single AMD GPU with ROCm.

	What is Vicuna?

	Vicuna is an open-source chatbot with 13 billion parameters, developed
	by a team from UC Berkeley, CMU, Stanford, and UC San Diego. To create
	Vicuna, a LLAMA base model was fine-tuned using about 70K user-shared
	conversations collected from ShareGPT.com via public APIs. According to
	initial assessments where GPT-4 is used as a reference, Vicuna-13B has
	achieved over 90%\* quality compared to OpenAI ChatGPT.

	<p align="center">
	<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/chatbot-amd-gpu/01.png" style="width: 60%; height: auto;">
	</p>

	It was released on [Github](https://github.com/lm-sys/FastChat) on Apr
	11, just a few weeks ago. It is worth mentioning that the data set,
	training code, evaluation metrics, training cost are known for Vicuna. Its total training cost was just
	around \$300, making it a cost-effective solution for the general public.

	For more details about Vicuna, please check out
	<https://vicuna.lmsys.org>.

	Why do we need a quantized GPT model?

	Running Vicuna-13B model in fp16 requires around 28GB GPU RAM. To
	further reduce the memory footprint, optimization techniques are
	required. There is a recent research paper GPTQ published, which
	proposed accurate post-training quantization for GPT models with lower
	bit precision. As illustrated below, for models with parameters larger
	than 10B, the 4-bit or 3-bit GPTQ can achieve comparable accuracy
	with fp16.

	<p align="center">
	<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/chatbot-amd-gpu/02.png" style="width: 70%; height: auto;">
	</p>

	Moreover, large parameters of these models also have a severely negative
	effect on GPT latency because GPT token generation is more limited by
	memory bandwidth (GB/s) than computation (TFLOPs or TOPs) itself. For this
	reason, a quantized model does not degrade
	token generation latency when the GPU is under a memory bound situation.
	Refer to [the GPTQ quantization papers](<https://arxiv.org/abs/2210.17323>) and [github repo](<https://github.com/IST-DASLab/gptq>).

	By leveraging this technique, several 4-bit quantized Vicuna models are
	available from Hugging Face as follows,

	<p align="center">
	<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/chatbot-amd-gpu/03.png" style="width: 50%; height: auto;">
	</p>

	## Running Vicuna 13B Model on AMD GPU with ROCm

	To run the Vicuna 13B model on an AMD GPU, we need to leverage the power
	of ROCm (Radeon Open Compute), an open-source software platform that
	provides AMD GPU acceleration for deep learning and high-performance
	computing applications.

	Here's a step-by-step guide on how to set up and run the Vicuna 13B
	model on an AMD GPU with ROCm:

	System Requirements

	Before diving into the installation process, ensure that your system
	meets the following requirements:

	- An AMD GPU that supports ROCm (check the compatibility list on
	docs.amd.com page)

	- A Linux-based operating system, preferably Ubuntu 18.04 or 20.04

	- Conda or Docker environment

	- Python 3.6 or higher

	For more information, please check out <https://docs.amd.com/bundle/ROCm-Installation-Guide-v5.4.3/page/Prerequisites.html>.

	This example has been tested on [**Instinct
	MI210**](https://www.amd.com/en/products/server-accelerators/amd-instinct-mi210)
	and [**Radeon
	RX6900XT**](https://www.amd.com/en/products/graphics/amd-radeon-rx-6900-xt)
	GPUs with ROCm5.4.3 and Pytorch2.0.

	Quick Start

	1 ROCm installation and Docker container setup (Host machine)

	1.1 ROCm installation

	The following is for ROCm5.4.3 and Ubuntu 22.04. Please modify
	according to your target ROCm and Ubuntu version from:
	<https://docs.amd.com/bundle/ROCm-Installation-Guide-v5.4.3/page/How_to_Install_ROCm.html>

	```
	sudo apt update && sudo apt upgrade -y
	wget https://repo.radeon.com/amdgpu-install/5.4.3/ubuntu/jammy/amdgpu-install_5.4.50403-1_all.deb
	sudo apt-get install ./amdgpu-install_5.4.50403-1_all.deb
	sudo amdgpu-install --usecase=hiplibsdk,rocm,dkms
	sudo amdgpu-install --list-usecase
	sudo reboot
	```

	1.2 ROCm installation verification
	```
	rocm-smi
	sudo rocminfo
	```
	1.3 Docker image pull and run a Docker container

	The following uses Pytorch2.0 on ROCm5.4.2. Please use the
	appropriate docker image according to your target ROCm and Pytorch
	version: <https://hub.docker.com/r/rocm/pytorch/tags>
	```
	docker pull rocm/pytorch:rocm5.4.2_ubuntu20.04_py3.8_pytorch_2.0.0_preview

	sudo docker run --device=/dev/kfd --device=/dev/dri --group-add video \
	--shm-size=8g --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
	--ipc=host -it --name vicuna_test -v ${PWD}:/workspace -e USER=${USER} \
	rocm/pytorch:rocm5.4.2_ubuntu20.04_py3.8_pytorch_2.0.0_preview
	```
	2 Model quantization and Model inference (Inside the docker)

	You can either download quantized Vicuna-13b model from Huggingface or
	quantize the floating-point model. Please check out **Appendix - GPTQ
	model quantization** if you want to quantize the floating-point model.

	2.1 Download the quantized Vicuna-13b model

	Use download-model.py script from the following git repo.
	```
	git clone https://github.com/oobabooga/text-generation-webui.git
	cd text-generation-webui
	python download-model.py anon8231489123/vicuna-13b-GPTQ-4bit-128g
	```
	2. Running the Vicuna 13B GPTQ Model on AMD GPU
	```
	git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda
	cd GPTQ-for-LLaMa
	python setup_cuda.py install
	```
	These commands will compile and link HIPIFIED CUDA-equivalent kernel
	binaries to

	python as C extensions. The kernels of this implementation are composed
	of dequantization + FP32 Matmul. If you want to use dequantization +
	FP16 Matmul for additional speed-up, please check out **Appendix - GPTQ
	Dequantization + FP16 Mamul kernel for AMD GPUs**
	```
	git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda
	cd GPTQ-for-LLaMa/
	python setup_cuda.py install

	# model inference
	python llama_inference.py ../../models/vicuna-13b --wbits 4 --load \
	../../models/vicuna-13b/vicuna-13b_4_actorder.safetensors --groupsize 128 --text “You input text here”
	```
	Now that you have everything set up, it's time to run the Vicuna 13B
	model on your AMD GPU. Use the commands above to run the model. Replace
	"Your input text here" with the text you want to use as input for
	the model. If everything is set up correctly, you should see the model
	generating output text based on your input.

	3. Expose the quantized Vicuna model to the Web API server

	Change the path of GPTQ python modules (GPTQ-for-LLaMa) in the following
	line:

	<https://github.com/thisserand/FastChat/blob/4a57c928a906705404eae06f7a44b4da45828487/fastchat/serve/load_gptq_model.py#L7>

	To launch Web UXUI from the gradio library, you need to set up the
	controller, worker (Vicunal model worker), web_server by running them as
	background jobs.
	```
	nohup python0 -W ignore::UserWarning -m fastchat.serve.controller &

	nohup python0 -W ignore::UserWarning -m fastchat.serve.model_worker --model-path /path/to/quantized_vicuna_weights \
	--model-name vicuna-13b-quantization --wbits 4 --groupsize 128 &

	nohup python0 -W ignore::UserWarning -m fastchat.serve.gradio_web_server &
	```
	Now the 4-bit quantized Vicuna-13B model can be fitted in RX6900XT GPU
	DDR memory, which has 16GB DDR. Only 7.52GB of DDR (46% of 16GB) is
	needed to run 13B models whereas the model needs more than 28GB of DDR
	space in fp16 datatype. The latency penalty and accuracy penalty are
	also very minimal and the related metrics are provided at the end of
	this article.

	<p align="center">
	<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/chatbot-amd-gpu/04.png" style="width: 60%; height: auto;">
	</p>

	Test the quantized Vicuna model in the Web API server

	Let us give it a try. First, let us use fp16 Vicuna model for language
	translation.

	<p align="center">
	<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/chatbot-amd-gpu/05.png" style="width: 80%; height: auto;">
	</p>

	It does a better job than me. Next, let us ask something about soccer. The answer looks good to me.

	<p align="center">
	<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/chatbot-amd-gpu/06.png" style="width: 80%; height: auto;">
	</p>

	When we switch to the 4-bit model, for the same question, the answer is
	a bit different. There is a duplicated “Lionel Messi” in it.

	<p align="center">
	<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/chatbot-amd-gpu/07.png" style="width: 80%; height: auto;">
	</p>

	Vicuna fp16 and 4bit quantized model comparison

	Test environment:

	\- GPU: Instinct MI210, RX6900XT

	\- python: 3.10

	\- pytorch: 2.1.0a0+gitfa08e54

	\- rocm: 5.4.3

	Metrics - Model size (GB)

	- Model parameter size. When the models are preloaded to GPU DDR, the
	actual DDR size consumption is larger than model itself due to caching
	for Input and output token spaces.

	<p align="center">
	<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/chatbot-amd-gpu/08.png" style="width: 70%; height: auto;">
	</p>

	Metrics – Accuracy (PPL: Perplexity)

	- Measured on 2048 examples of C4
	(<https://paperswithcode.com/dataset/c4>) dataset

	- Vicuna 13b – baseline: fp16 datatype parameter, fp16 Matmul

	- Vicuna 13b – quant (4bit/fp32): 4bits datatype parameter, fp32 Matmul

	- Vicuna 13b – quant (4bit/fp16): 4bits datatype parameter, fp16 Matmul

	<p align="center">
	<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/chatbot-amd-gpu/09.png" style="width: 70%; height: auto;">
	</p>

	Metrics – Latency (Token generation latency, ms)

	- Measured during token generation phases.

	- Vicuna 13b – baseline: fp16 datatype parameter, fp16 Matmul

	- Vicuna 13b – quant (4bit/fp32): 4bits datatype parameter, fp32 Matmul

	- Vicuna 13b – quant (4bit/fp16): 4bits datatype parameter, fp16 Matmul

	<p align="center">
	<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/chatbot-amd-gpu/10.png" style="width: 70%; height: auto;">
	</p>


	## Conclusion

	Large language models (LLMs) have made significant advancements in
	chatbot systems, as seen in OpenAI’s ChatGPT. Vicuna-13B, an open-source
	LLM model has been developed and demonstrated excellent capability and quality.

	By following this guide, you should now have a better understanding of
	how to set up and run the Vicuna 13B model on an AMD GPU with ROCm. This
	will enable you to unlock the full potential of this cutting-edge
	language model for your research and personal projects.

	Thanks for reading!



	## Appendix - GPTQ model quantization

	Building Vicuna quantized model from the floating-point LLaMA model

	a. Download LLaMA and Vicuna delta models from Huggingface

	The developers of Vicuna (lmsys) provide only delta-models that can be
	applied to the LLaMA model. Download LLaMA in huggingface format and
	Vicuna delta parameters from Huggingface individually. Currently, 7b and
	13b delta models of Vicuna are available.

	<https://huggingface.co/models?sort=downloads&search=huggyllama>

	<https://huggingface.co/models?sort=downloads&search=lmsys>

	<p align="center">
	<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/chatbot-amd-gpu/13.png" style="width: 60%; height: auto;">
	</p>

	b. Convert LLaMA to Vicuna by using Vicuna-delta model
	```
	git clone https://github.com/lm-sys/FastChat
	cd FastChat
	```
	Convert the LLaMA parameters by using this command:

	(Note: do not use vicuna-{7b, 13b}-\*delta-v0 because it’s vocab_size is
	different from that of LLaMA and the model cannot be converted)
	```
	python -m fastchat.model.apply_delta --base /path/to/llama-13b --delta lmsys/vicuna-13b-delta-v1.1 \
	--target ./vicuna-13b
	```
	Now Vicuna-13b model is ready.

	c. Quantize Vicuna to 2/3/4 bits

	To apply the GPTQ to LLaMA and Vicuna,
	```
	git clone https://github.com/oobabooga/GPTQ-for-LLaMa -b cuda
	cd GPTQ-for-LLaMa
	```
	(Note, do not use <https://github.com/qwopqwop200/GPTQ-for-LLaMa> for
	now. Because 2,3,4bit quantization + MatMul kernels implemented in this
	repo does not parallelize the dequant+matmul and hence shows lower token
	generation performance)

	Quantize Vicuna-13b model with this command. QAT is done based on c4
	data-set but you can also use other data-sets, such as wikitext2

	(Note. Change group size with different combinations as long as the
	model accuracy increases significantly. Under some combination of wbit
	and groupsize, model accuracy can be increased significantly.)
	```
	python llama.py ./Vicuna-13b c4 --wbits 4 --true-sequential --act-order \
	--save_safetensors Vicuna-13b-4bit-act-order.safetensors
	```
	Now the model is ready and saved as
	Vicuna-13b-4bit-act-order.safetensors.

	GPTQ Dequantization + FP16 Mamul kernel for AMD GPUs

	The more optimized kernel implementation in
	<https://github.com/oobabooga/GPTQ-for-LLaMa/blob/57a26292ed583528d9941e79915824c5af012279/quant_cuda_kernel.cu#L891>

	targets at A100 GPU and not compatible with ROCM5.4.3 HIPIFY
	toolkits. It needs to be modified as follows. The same for
	VecQuant2MatMulKernelFaster, VecQuant3MatMulKernelFaster,
	VecQuant4MatMulKernelFaster kernels.

	<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/chatbot-amd-gpu/14.png" style="width: 100%; height: auto;">

	For convenience, All the modified codes are available in [Github Gist](https://gist.github.com/seungrokjung/110943b70503732c4a398607e1cbdd6c).

Xet Storage Details

Size:: 14.9 kB
Xet hash:: 51d0fcee002cc9220f16b816029ee5b73edd78cd63e5f225c1b410b1f31cad21

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.