Instructions to use RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8")
model = AutoModelForCausalLM.from_pretrained("RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8

SGLang

How to use RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8 with Docker Model Runner:
```
docker model run hf.co/RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8
```

Are these models limited to H100s?

by RonanMcGovern - opened Jul 24, 2024

Discussion

RonanMcGovern

Jul 24, 2024

I've run well on H100s but on A100s or A6000s, I get:

[rank0]: RuntimeError: torch._scaled_mm is only supported on CUDA devices with compute capability >= 9.0 or 8.9, or ROCm MI300+

Is it possible to upgrade the A100 or H100? or am I just limited here?

Lin-K76

Jul 24, 2024

Unfortunately this model is limited to GPUs that support the FP8 data format, including the Hopper architecture but excluding the Ampere architecture.

RonanMcGovern

Jul 24, 2024

Makes a lot of sense. Thanks for the nice work. It's kind of interesting but it seems that using weight-only fp8 I'm able to get pretty much the same results on A100s as with full fp8 on H100 hopper.

RonanMcGovern changed discussion status to closed Jul 24, 2024

mgoin

Red Hat AI org Jul 24, 2024

@RonanMcGovern this model should still run in vLLM on A100, it will just chose to run in the FP8 weight-only pathway. Are you using the latest vLLM release?

RonanMcGovern

Jul 24, 2024

•

edited Jul 24, 2024

Ok, interesting. I'm using the latest docker image (maybe updates haven't been pushed to that yet?). The error I'm getting is:

[rank0]: RuntimeError: torch._scaled_mm is only supported on CUDA devices with compute capability >= 9.0 or 8.9, or ROCm MI300+

mgoin

Red Hat AI org Jul 24, 2024

My apologies, I got confused with the various formats available. Currently this is blocked on https://github.com/vllm-project/vllm/pull/6524. Thanks for reporting, we will work on landing ASAP

RonanMcGovern

Jul 24, 2024

ok yeah that would be great, I'll move my models over to the neuralmagic ones once that works because fp8 download is faster and also doesn't require HF_TOKEN

RonanMcGovern

Jul 25, 2024

btw @mgoin the fp8 is almost as fast as Nvidia NIM on an H100 SXM, which is impressive - at least at batch size 1. 130 toks vs 120 toks on a short prompt with 500 tokens generated.

At larger batches, speeds diverge until at a batch of 64, NIM is still managing about 120 toks, while vLLM fp8 neural magic is doing about 35 toks.

I wonder what the difference is behind that. Probably the gap can be closed if the batch size one is so close.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment