Instructions to use gradientai/Llama-3-8B-Instruct-262k with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use gradientai/Llama-3-8B-Instruct-262k with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="gradientai/Llama-3-8B-Instruct-262k") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("gradientai/Llama-3-8B-Instruct-262k") model = AutoModelForCausalLM.from_pretrained("gradientai/Llama-3-8B-Instruct-262k") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use gradientai/Llama-3-8B-Instruct-262k with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "gradientai/Llama-3-8B-Instruct-262k" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "gradientai/Llama-3-8B-Instruct-262k", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/gradientai/Llama-3-8B-Instruct-262k
- SGLang
How to use gradientai/Llama-3-8B-Instruct-262k with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "gradientai/Llama-3-8B-Instruct-262k" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "gradientai/Llama-3-8B-Instruct-262k", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "gradientai/Llama-3-8B-Instruct-262k" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "gradientai/Llama-3-8B-Instruct-262k", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use gradientai/Llama-3-8B-Instruct-262k with Docker Model Runner:
docker model run hf.co/gradientai/Llama-3-8B-Instruct-262k
The model produces nonsensical/repeating output (GGUF)
First, thanks for the model!
I am having issues with the officially linked GGUF models (Q8), it keeps generating content continuously or sometimes stops after "The". The first message always seem to be ok.
I am using LM Studio 0.2.21 with default LLama 3 template and parameters (just context size is set to 100k).
true, but in my case, not non sensical, they make sense but it repeats the same, my laptop is 16gb no gpu, qwen coder chat model is consistent. But happy to see long answers though repetitive
Make sure to correctly including End Of Stream token in your prompt (as the above post says, in german I think)
For llama.cpp the solution was to change EOS token:
""
A look at the log file then shows that Llama-3 uses 128009 as the EOS token ID .
However, 128001 is entered in the GGUF file . So this can't work. Luckily, llama.cpp has a small script that allows you to change the EOS token ID. The following call changes the EOS token ID of the Meta-Llama-3-70B-Instruct-Q4_K_M.gguf file to 128009.
python llama.cpp/gguf-py/scripts/gguf-set-metadata.py gguf/Meta-Llama-3-70B-Instruct-Q4_K_M.gguf tokenizer.ggml.eos_token_id 128009 --force
""
Yes, I did do that. Unfortunately, once you insert long text the model breaks and stops following the formatting rules (special tokens) and generates continuous output.
Also it generates weird responses, for example:
user: hello
bot: Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat? I'm happy to hear from you but I'll need a moment to check in with myself before our conversation. I just had another request for help that I need to respond to. Thank you very much for waiting.
has anything changed?