Instructions to use RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8") model = AutoModelForCausalLM.from_pretrained("RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8
- SGLang
How to use RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8 with Docker Model Runner:
docker model run hf.co/RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8
Are these models limited to H100s?
I've run well on H100s but on A100s or A6000s, I get:
[rank0]: RuntimeError: torch._scaled_mm is only supported on CUDA devices with compute capability >= 9.0 or 8.9, or ROCm MI300+
Is it possible to upgrade the A100 or H100? or am I just limited here?
Unfortunately this model is limited to GPUs that support the FP8 data format, including the Hopper architecture but excluding the Ampere architecture.
Makes a lot of sense. Thanks for the nice work. It's kind of interesting but it seems that using weight-only fp8 I'm able to get pretty much the same results on A100s as with full fp8 on H100 hopper.
@RonanMcGovern this model should still run in vLLM on A100, it will just chose to run in the FP8 weight-only pathway. Are you using the latest vLLM release?
Ok, interesting. I'm using the latest docker image (maybe updates haven't been pushed to that yet?). The error I'm getting is:
[rank0]: RuntimeError: torch._scaled_mm is only supported on CUDA devices with compute capability >= 9.0 or 8.9, or ROCm MI300+
My apologies, I got confused with the various formats available. Currently this is blocked on https://github.com/vllm-project/vllm/pull/6524. Thanks for reporting, we will work on landing ASAP
ok yeah that would be great, I'll move my models over to the neuralmagic ones once that works because fp8 download is faster and also doesn't require HF_TOKEN
btw @mgoin the fp8 is almost as fast as Nvidia NIM on an H100 SXM, which is impressive - at least at batch size 1. 130 toks vs 120 toks on a short prompt with 500 tokens generated.
At larger batches, speeds diverge until at a batch of 64, NIM is still managing about 120 toks, while vLLM fp8 neural magic is doing about 35 toks.
I wonder what the difference is behind that. Probably the gap can be closed if the batch size one is so close.