vLLM 0.1.3 运行 CUDA out of memory

by stevensu - opened Aug 7, 2023

Aug 7, 2023

A10 , 测试了meta 官方的llama2-13b-chat 加载正常，但是加载Llama2-Chinese-13b-Chat 出现CUDA out of memory
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="./Llama2-Chinese-13b-Chat")
outputs = llm.generate(prompts, sampling_params)

# Print the outputs.

for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

stevensu

Aug 10, 2023

I found reason , because I set load_in_8bit=True with HF transformers , but vLLM not support 8bit yet ,

stevensu changed discussion status to closed Aug 10, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment