Inference is very slow (about 3 secs/token)

#11

by rfernand - opened Nov 22, 2023

Nov 22, 2023

Great to have this model in HF! The inference is super slow - makes it hard to do real-time experiments. Can this be sped up easily?

rfernand

Nov 22, 2023

As measured on Windows 11, CPU: i9-13900KF, 128 GB RAM, GPU: RTX 3090 (24 GB).

PsiPi

Nov 23, 2023

use a quant. Which don't exist yet....

YaTharThShaRma999

Nov 23, 2023

@rfernand your best bet is to use quantization and that should boost speed by a large amount and also it will take up less vram. I think you should use the gptq quant format and load it with huggingface to get best speed. Although transformers is somewhat simple, using something like exllama v2 should get you the fastest speed.
https://huggingface.co/TheBloke/Orca-2-13B-GPTQ

Use the 8 bit one for maximum quality

PsiPi

Nov 24, 2023

heh yeah and now they do exist ;)

rfernand

Nov 25, 2023

Thanks @YaTharThShaRma999 and @PsiPi .

This is great - I tried the 4-bit version (https://huggingface.co/TheBloke/Orca-2-13B-GGUF) with following results:
model loading: 4x faster
inference 12x faster

TLDR

pip install ctransformers[cuda]
python script for inference:

from ctransformers import AutoModelForCausalLM

# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
llm = AutoModelForCausalLM.from_pretrained("TheBloke/Orca-2-13B-GGUF", model_file="orca-2-13b.Q4_K_M.gguf", model_type="llama", gpu_layers=50)

print(llm("AI is going to"))

rfernand changed discussion status to closed Nov 25, 2023

PsiPi

Nov 26, 2023

Yeah LoneStriker offers an excellent version as well

Vulfgang

Dec 20, 2023

For inference, I get the following error:

`GLIBC_2.29' not found

Anyone know how to resolve this?

Vulfgang

Dec 20, 2023

Specifically

[`GLIBC_2.29' not found](oserror: /lib64/libm.so.6: version `glibc_2.29' not found (required by /local/home/user_name/anaconda3/envs/odi-ds/lib/python3.9/site-packages/ctransformers/lib/cuda/libctransformers.so))

PsiPi

Dec 20, 2023

Says you have the wrong version of libc ? not to be glib but... Get the right one? might need to wrap it in an env. Don't know your situation. Good luck @Vulfgang

Vulfgang

Jan 6, 2024

Thank you for replying, I think I have the right glib now but now everytime I run the code on jupyter my kernel just dies as soon as I try to download the model from the repo.

Vulfgang

Jan 6, 2024

wait nevermind the last comment, all good

Vulfgang

Mar 9, 2025

Is there a recommended ec2 instance that can run fast, also is it faster to use a GPU or CPU for inference?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment