Instructions to use meta-llama/Meta-Llama-3-8B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use meta-llama/Meta-Llama-3-8B-Instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="meta-llama/Meta-Llama-3-8B-Instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct") model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use meta-llama/Meta-Llama-3-8B-Instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "meta-llama/Meta-Llama-3-8B-Instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/meta-llama/Meta-Llama-3-8B-Instruct
- SGLang
How to use meta-llama/Meta-Llama-3-8B-Instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "meta-llama/Meta-Llama-3-8B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "meta-llama/Meta-Llama-3-8B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use meta-llama/Meta-Llama-3-8B-Instruct with Docker Model Runner:
docker model run hf.co/meta-llama/Meta-Llama-3-8B-Instruct
Update generation_config.json
I noticed when using the instruct model with chat templating, that the chat template uses <|eot_id|> rather than the EOS token <|end_of_text|>. So when the assistant responds to messages it likes to use <|eot_id|> as well. Unfortunately the generation config doesn't say to stop generating on <|eot_id|> so the model keeps writing.
In the Model Card, I see that there is a workaround by manually updating eos_token_id in any generate call or pipeline:
terminators = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
outputs = pipeline(
prompt,
max_new_tokens=256,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9,
)
But I think there is a simpler way to fix this! If you just update the generation_config.json to stop on both <|end_of_text|> as well as <|eot_id|>, then it should work automatically and you won't need to build the terminators.
Running into the same issue. With the default config, the model doesn't stop at <|eot_id|> and will generate new text for the user.
After updating the config, the model no longer generates user text, but instead ends with an infinite series of <|eot_id|><|start_header_id|>assistant:<|eot_id|><|start_header_id|>assistant:<|eot_id|>...
Is there a way to prevent this?
Hmm @entropy could you provide more details your setup? Here's what is working for me, referencing this PR:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_path = "meta-llama/Meta-Llama-3-8B-Instruct"
revision = "refs/pr/4"
tokenizer = AutoTokenizer.from_pretrained(model_path, revision=revision)
model = AutoModelForCausalLM.from_pretrained(model_path, revision=revision, device_map="auto", torch_dtype=torch.bfloat16)
prompt = "Write a haiku about terminators."
chat = [{'content': prompt, 'role': 'user'}]
chat_tokens = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True, return_tensors='pt').to(model.device)
new_chat_tokens = model.generate(chat_tokens, do_sample=False, max_new_tokens=128)
new_chat_str = tokenizer.decode(new_chat_tokens[0])
print (new_chat_str)
produces:
<|begin_of_text|><|start_header_id|>user<|end_header_id|>
Write a haiku about terminators.<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Metal hearts ablaze
Rise from ashes, cold and dark
Judgment day arrives<|eot_id|>
Same here, I use oobabooga textgen and llama 3 8B instruct will not shut up.
To reproduce just tell it 1 token and to say START for example.
It's the same with TabbyAPI.
In oobabooga text-generation-webui, you also need to uncheck "Skip special tokens" in the Parameters -> Generation tab.
https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/discussions/14
check here my latest message.
for me this change was not enough on text generation webui
i had to uncheck "skip special tokens" and add "<|eot_id|>" in custom stop strings after that every thing was good
fixed gguf quant here. https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF
yes and it works fine. i use Meta-Llama-3-8B-Instruct.Q8_0.gguf and Meta-Llama-3-8B-Instruct.Q6_K.gguf and both perfectly stop conversation when finished.
Many thanks. :)
hi guys, is my issue related to the same problem described here? https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/discussions/36
if yes, will this repo be fixed?
Hmm @entropy could you provide more details your setup? Here's what is working for me, referencing this PR:
from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_path = "meta-llama/Meta-Llama-3-8B-Instruct" revision = "refs/pr/4" tokenizer = AutoTokenizer.from_pretrained(model_path, revision=revision) model = AutoModelForCausalLM.from_pretrained(model_path, revision=revision, device_map="auto", torch_dtype=torch.bfloat16) prompt = "Write a haiku about terminators." chat = [{'content': prompt, 'role': 'user'}] chat_tokens = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True, return_tensors='pt').to(model.device) new_chat_tokens = model.generate(chat_tokens, do_sample=False, max_new_tokens=128) new_chat_str = tokenizer.decode(new_chat_tokens[0]) print (new_chat_str)produces:
<|begin_of_text|><|start_header_id|>user<|end_header_id|> Write a haiku about terminators.<|eot_id|><|start_header_id|>assistant<|end_header_id|> Metal hearts ablaze Rise from ashes, cold and dark Judgment day arrives<|eot_id|>
please change new_chat_str = tokenizer.decode(new_chat_tokens[0]) to new_chat_str = tokenizer.decode(new_chat_tokens[0], skip_special_tokens=True)
