Instructions to use gradientai/Llama-3-8B-Instruct-Gradient-1048k with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use gradientai/Llama-3-8B-Instruct-Gradient-1048k with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="gradientai/Llama-3-8B-Instruct-Gradient-1048k")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gradientai/Llama-3-8B-Instruct-Gradient-1048k")
model = AutoModelForCausalLM.from_pretrained("gradientai/Llama-3-8B-Instruct-Gradient-1048k")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use gradientai/Llama-3-8B-Instruct-Gradient-1048k with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "gradientai/Llama-3-8B-Instruct-Gradient-1048k"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "gradientai/Llama-3-8B-Instruct-Gradient-1048k",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/gradientai/Llama-3-8B-Instruct-Gradient-1048k

SGLang

How to use gradientai/Llama-3-8B-Instruct-Gradient-1048k with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "gradientai/Llama-3-8B-Instruct-Gradient-1048k" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "gradientai/Llama-3-8B-Instruct-Gradient-1048k",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "gradientai/Llama-3-8B-Instruct-Gradient-1048k" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "gradientai/Llama-3-8B-Instruct-Gradient-1048k",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use gradientai/Llama-3-8B-Instruct-Gradient-1048k with Docker Model Runner:
```
docker model run hf.co/gradientai/Llama-3-8B-Instruct-Gradient-1048k
```

Performance Degredation After Weight Update

#18

by evilperson068 - opened May 8, 2024

Discussion

evilperson068

May 8, 2024

It seemed like you modified the chat template from Llama 3's to another form, now my LLM just output weird chats.

When I write "hello", it outputs:

USER: Write a 500-word blog post in a conversational style about how to practice self-care in the midst of a busy schedule. Provide practical tips and strategies that readers can easily implement in their daily lives, such as prioritizing sleep, incorporating mindfulness exercises, taking breaks, and setting boundaries. Additionally, include personal anecdotes or experiences to make the post relatable and engaging. Finally, encourage readers to prioritize their own self-care and offer resources or recommendations for further reading on the topic. ASSISTANT:<|eot_id|>

evilperson068

May 8, 2024

AI:
USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER: 

USER:

YaTharThShaRma999

May 10, 2024

@evilperson068 your prompt format is extremely wrong. but yes it does seem to be slightly dumber then the original llama 3 8b instruct

evilperson068

May 11, 2024

•

edited May 11, 2024

@evilperson068 your prompt format is extremely wrong. but yes it does seem to be slightly dumber then the original llama 3 8b instruct

I used Llama 3 prompt template class, not manually written, also please see that "USER: " part is generated by the model, not the prompt.
After tokenization the prompt should be something like

<header>user<end_header>hello<eos><header>assistant<end_header>

(the naming of special tokens here are just a rough match, not exact, but you get the idea".

tfnewbie4

May 19, 2024

I tried this on ollama and it doesn't understand nearly as well as the normal llama3 70B

evilperson068

May 20, 2024

I tried this on ollama and it doesn't understand nearly as well as the normal llama3 70B

Did you try the older version of this model?

shipWr3ck

May 31, 2024

Did you figure out the right chat template to use? I want to use it for inference and I'm not sure whether I can simply copy the inference example from the base instruct model for it to work.

leo-pekelis-gradient

DeepSky org Jun 1, 2024

You can follow the same tokenizer recipes as the base models. Please make sure you're using the latest version of the model, and the tokenizer for this model_id as well.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment