Instructions to use togethercomputer/Llama-2-7B-32K-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use togethercomputer/Llama-2-7B-32K-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="togethercomputer/Llama-2-7B-32K-Instruct")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("togethercomputer/Llama-2-7B-32K-Instruct")
model = AutoModelForCausalLM.from_pretrained("togethercomputer/Llama-2-7B-32K-Instruct")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use togethercomputer/Llama-2-7B-32K-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "togethercomputer/Llama-2-7B-32K-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "togethercomputer/Llama-2-7B-32K-Instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/togethercomputer/Llama-2-7B-32K-Instruct

SGLang

How to use togethercomputer/Llama-2-7B-32K-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "togethercomputer/Llama-2-7B-32K-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "togethercomputer/Llama-2-7B-32K-Instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "togethercomputer/Llama-2-7B-32K-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "togethercomputer/Llama-2-7B-32K-Instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use togethercomputer/Llama-2-7B-32K-Instruct with Docker Model Runner:
```
docker model run hf.co/togethercomputer/Llama-2-7B-32K-Instruct
```

when will have a ggml version?

by CUIGuy - opened Aug 21, 2023

Discussion

CUIGuy

Aug 21, 2023

is it possible to have ggml version?

pbkowalski

Aug 22, 2023

There is already one from TheBloke ( https://huggingface.co/TheBloke/Llama-2-7B-32K-Instruct-GGML ), unfortunately it only outputs gibberish for me

CUIGuy

Aug 22, 2023

•

edited Aug 22, 2023

There is already one from TheBloke ( https://huggingface.co/TheBloke/Llama-2-7B-32K-Instruct-GGML ), unfortunately it only outputs gibberish for me

what prompt are you using? People say this use a different prompt then the original llama chat prompt. @pbkowalski

pbkowalski

Aug 22, 2023

@CUIGuy I've tried both the variant specified [INST]...[\INST] and others, but the output is just symbols regardless

CUIGuy

Aug 22, 2023

@CUIGuy I've tried both the variant specified [INST]...[\INST] and others, but the output is just symbols regardless

got.

mauriceweber

Aug 23, 2023

@pbkowalski for which quantization levels did you observe this ?

pbkowalski

Aug 23, 2023

•

edited Aug 23, 2023

@mauriceweber I've only tried 2_K, 4_0 and 4_1

The output I get from 4_1:

'[INST]\nWrite a poem about cats\n[\INST]\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n',

mapa17

Aug 25, 2023

•

edited Aug 25, 2023

I tried different prompts and as well only get long sequences of "\n". Could it be that something breaks in the tokenization of the input?
Can someone with access to the unquantized model verify if the token sequence for the following?

m.tokenize("[INST]\nWrite a poem about cats\n[/INST]\n\n".encode('utf8'))
[1, 29961, 25580, 29962, 13, 6113, 263, 26576, 1048, 274, 1446, 13, 29961, 29914, 25580, 29962, 13, 13]

rozek

Aug 31, 2023

Based on my experiences, Q2...Q4 quantizations are too small for proper outputs - even when generating "useful" texts (rather than just newlines) these models hallucinate far too much. The Q8_0 quantization, however, works pretty well - and, when using llama.cpp, 16GB RAM allow for context lengths up to 16k, 24GB RAM for lengths up to 32k (tested on a Macbook Air 15" with 24GB unified RAM).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment