Instructions to use meta-llama/Meta-Llama-3-8B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use meta-llama/Meta-Llama-3-8B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="meta-llama/Meta-Llama-3-8B-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use meta-llama/Meta-Llama-3-8B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "meta-llama/Meta-Llama-3-8B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "meta-llama/Meta-Llama-3-8B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/meta-llama/Meta-Llama-3-8B-Instruct

SGLang

How to use meta-llama/Meta-Llama-3-8B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "meta-llama/Meta-Llama-3-8B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "meta-llama/Meta-Llama-3-8B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "meta-llama/Meta-Llama-3-8B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "meta-llama/Meta-Llama-3-8B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use meta-llama/Meta-Llama-3-8B-Instruct with Docker Model Runner:
```
docker model run hf.co/meta-llama/Meta-Llama-3-8B-Instruct
```

MPS support quantification

#39

by tonimelisma - opened Apr 20, 2024

Discussion

tonimelisma

Apr 20, 2024

I'm trying to run this with the transformers library on an M1 Macbook Pro.

With bfloat16, I get:
"TypeError: BFloat16 is not supported on MPS"

With float16, I get:
"NotImplementedError: The operator 'aten::isin.Tensor_Tensor_out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS."

Is there a quantized model somewhere that I should be using instead? Any chance of running this model on Apple GPU with the hugging face libraries?

tonimelisma changed discussion title from MPS support quantification to Xxx Apr 20, 2024

tonimelisma changed discussion title from Xxx to MPS support quantification Apr 20, 2024

rileydean

May 13, 2024

Curious, did you ever get this working?

ybelkada

May 14, 2024

Hi @tonimelisma
For using quantized Llama on apple devices, I advise to use MLX: https://huggingface.co/collections/mlx-community/llama-3-662156b069a5d33b3328603c cc @awni @prince-canuma

awni

May 14, 2024

Yup, should be easy to do and reasonably fast with MLX:

pip install mlx-lm
mlx_lm.generate --model mlx-community/Meta-Llama-3-8B-Instruct-4bit --prompt "hello"

More docs here

tonimelisma

May 15, 2024

Yes, MLX and llama.cpp work fine. I was inquiring whether Huggingface would work, too.

mantrid-prime

Jun 14, 2024

For mps you need to use torch.float32

A lot of things need changed elsewhere but this solves this particular issue. It's probably safe to assume that you need llama.cpp to run on a mac.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment