How to speed up inferring?

#21

by merlinarer - opened May 10, 2023

Discussion

merlinarer

May 10, 2023

Apart from int8, is there any plan to speed up inferring, such as fastertransformer?

vandarkholme

May 12, 2023

I dont think fastertransformer is an easy way... may torchscript and pytorch2.0 work

arjunguha

BigCode org May 18, 2023

The easiest way to do this may be to use the inference server:

https://github.com/bigcode-project/starcoder#text-generation-inference

michaelfeil

BigCode org May 28, 2023

You could try: https://huggingface.co/michaelfeil/ct2fast-starcoder/blob/main/README.md

merlinarer

May 30, 2023

You could try: https://huggingface.co/michaelfeil/ct2fast-starcoder/blob/main/README.md

Amazing, how much speed could ct2fast-starcoder bring compared with the oringinal starcoder?

Bilibili

Jun 19, 2023

This also seems interesting: https://github.com/bigcode-project/starcoder.cpp

michaelfeil

BigCode org Jun 19, 2023

•

edited Jun 19, 2023

You could try: https://huggingface.co/michaelfeil/ct2fast-starcoder/blob/main/README.md

Amazing, how much speed could ct2fast-starcoder bring compared with the oringinal starcoder?

Did not have time to check for starcoder. For santacoder:
Task: "def hello" -> generate 30 tokens
-> transformers pipeline in float 16, cuda: ~1300ms per inference
-> ctranslate2 in int8, cuda -> 315ms per inference

I assume for starcoder, weights are bigger, hence maybe 1.5-2.5x speedup.

thanhnew2001

Sep 23, 2023

You could try: https://huggingface.co/michaelfeil/ct2fast-starcoder/blob/main/README.md

This works like a charm, 100 times faster than the starchat and starcoder. I tried with 8, 12, 16G but failed, at least 24G RAM GPU will work.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment