Instructions to use gradientai/Llama-3-8B-Instruct-Gradient-1048k with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use gradientai/Llama-3-8B-Instruct-Gradient-1048k with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="gradientai/Llama-3-8B-Instruct-Gradient-1048k") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("gradientai/Llama-3-8B-Instruct-Gradient-1048k") model = AutoModelForCausalLM.from_pretrained("gradientai/Llama-3-8B-Instruct-Gradient-1048k") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use gradientai/Llama-3-8B-Instruct-Gradient-1048k with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "gradientai/Llama-3-8B-Instruct-Gradient-1048k" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "gradientai/Llama-3-8B-Instruct-Gradient-1048k", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/gradientai/Llama-3-8B-Instruct-Gradient-1048k
- SGLang
How to use gradientai/Llama-3-8B-Instruct-Gradient-1048k with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "gradientai/Llama-3-8B-Instruct-Gradient-1048k" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "gradientai/Llama-3-8B-Instruct-Gradient-1048k", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "gradientai/Llama-3-8B-Instruct-Gradient-1048k" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "gradientai/Llama-3-8B-Instruct-Gradient-1048k", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use gradientai/Llama-3-8B-Instruct-Gradient-1048k with Docker Model Runner:
docker model run hf.co/gradientai/Llama-3-8B-Instruct-Gradient-1048k
Performance Degredation After Weight Update
It seemed like you modified the chat template from Llama 3's to another form, now my LLM just output weird chats.
When I write "hello", it outputs:
USER: Write a 500-word blog post in a conversational style about how to practice self-care in the midst of a busy schedule. Provide practical tips and strategies that readers can easily implement in their daily lives, such as prioritizing sleep, incorporating mindfulness exercises, taking breaks, and setting boundaries. Additionally, include personal anecdotes or experiences to make the post relatable and engaging. Finally, encourage readers to prioritize their own self-care and offer resources or recommendations for further reading on the topic. ASSISTANT:<|eot_id|>
AI:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
USER:
@evilperson068 your prompt format is extremely wrong. but yes it does seem to be slightly dumber then the original llama 3 8b instruct
@evilperson068 your prompt format is extremely wrong. but yes it does seem to be slightly dumber then the original llama 3 8b instruct
I used Llama 3 prompt template class, not manually written, also please see that "USER: " part is generated by the model, not the prompt.
After tokenization the prompt should be something like
<header>user<end_header>hello<eos><header>assistant<end_header>
(the naming of special tokens here are just a rough match, not exact, but you get the idea".
I tried this on ollama and it doesn't understand nearly as well as the normal llama3 70B
I tried this on ollama and it doesn't understand nearly as well as the normal llama3 70B
Did you try the older version of this model?
Did you figure out the right chat template to use? I want to use it for inference and I'm not sure whether I can simply copy the inference example from the base instruct model for it to work.
You can follow the same tokenizer recipes as the base models. Please make sure you're using the latest version of the model, and the tokenizer for this model_id as well.