Instructions to use gradientai/Llama-3-8B-Instruct-262k with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use gradientai/Llama-3-8B-Instruct-262k with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="gradientai/Llama-3-8B-Instruct-262k")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gradientai/Llama-3-8B-Instruct-262k")
model = AutoModelForCausalLM.from_pretrained("gradientai/Llama-3-8B-Instruct-262k")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use gradientai/Llama-3-8B-Instruct-262k with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "gradientai/Llama-3-8B-Instruct-262k"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "gradientai/Llama-3-8B-Instruct-262k",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/gradientai/Llama-3-8B-Instruct-262k

SGLang

How to use gradientai/Llama-3-8B-Instruct-262k with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "gradientai/Llama-3-8B-Instruct-262k" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "gradientai/Llama-3-8B-Instruct-262k",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "gradientai/Llama-3-8B-Instruct-262k" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "gradientai/Llama-3-8B-Instruct-262k",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use gradientai/Llama-3-8B-Instruct-262k with Docker Model Runner:
```
docker model run hf.co/gradientai/Llama-3-8B-Instruct-262k
```

The model produces nonsensical/repeating output (GGUF)

#13

by nullt3r - opened Apr 27, 2024

Discussion

nullt3r

Apr 27, 2024

•

edited Apr 27, 2024

First, thanks for the model!

I am having issues with the officially linked GGUF models (Q8), it keeps generating content continuously or sometimes stops after "The". The first message always seem to be ok.

I am using LM Studio 0.2.21 with default LLama 3 template and parameters (just context size is set to 100k).

nullt3r changed discussion title from The model produces nonsensical/repeating output to The model produces nonsensical/repeating output (GGUF) Apr 27, 2024

subbur

Apr 28, 2024

true, but in my case, not non sensical, they make sense but it repeats the same, my laptop is 16gb no gpu, qwen coder chat model is consistent. But happy to see long answers though repetitive

Michael22

Apr 29, 2024

Try https://www.jentsch.io/meta-llama-3-70b-instruct-q4_k_m-gguf-eos-token-fix/

meirly

Apr 29, 2024

Make sure to correctly including End Of Stream token in your prompt (as the above post says, in german I think)

For llama.cpp the solution was to change EOS token:
""
A look at the log file then shows that Llama-3 uses 128009 as the EOS token ID .

However, 128001 is entered in the GGUF file . So this can't work. Luckily, llama.cpp has a small script that allows you to change the EOS token ID. The following call changes the EOS token ID of the Meta-Llama-3-70B-Instruct-Q4_K_M.gguf file to 128009.

python llama.cpp/gguf-py/scripts/gguf-set-metadata.py gguf/Meta-Llama-3-70B-Instruct-Q4_K_M.gguf tokenizer.ggml.eos_token_id 128009 --force
""

nullt3r

Apr 29, 2024

Yes, I did do that. Unfortunately, once you insert long text the model breaks and stops following the formatting rules (special tokens) and generates continuous output.

Also it generates weird responses, for example:

user: hello
bot: Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat? I'm happy to hear from you but I'll need a moment to check in with myself before our conversation. I just had another request for help that I need to respond to. Thank you very much for waiting.

vihangsharma

May 2, 2024

has anything changed?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment