Instructions to use togethercomputer/LLaMA-2-7B-32K with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use togethercomputer/LLaMA-2-7B-32K with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="togethercomputer/LLaMA-2-7B-32K")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("togethercomputer/LLaMA-2-7B-32K")
model = AutoModelForCausalLM.from_pretrained("togethercomputer/LLaMA-2-7B-32K")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use togethercomputer/LLaMA-2-7B-32K with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "togethercomputer/LLaMA-2-7B-32K"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "togethercomputer/LLaMA-2-7B-32K",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/togethercomputer/LLaMA-2-7B-32K

SGLang

How to use togethercomputer/LLaMA-2-7B-32K with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "togethercomputer/LLaMA-2-7B-32K" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "togethercomputer/LLaMA-2-7B-32K",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "togethercomputer/LLaMA-2-7B-32K" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "togethercomputer/LLaMA-2-7B-32K",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use togethercomputer/LLaMA-2-7B-32K with Docker Model Runner:
```
docker model run hf.co/togethercomputer/LLaMA-2-7B-32K
```

Installing ! pip install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary but flah_llama still erroring out

#25

by ajash - opened Sep 5, 2023

Discussion

ajash

Sep 5, 2023

•

edited Sep 5, 2023

I have installed all the required dependencies to run flash attn.:
! pip install flash-attn --no-build-isolation
! pip install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary

model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map='auto', trust_remote_code=True, torch_dtype=torch.bfloat16, revision="refs/pr/17")
This is not working. Error:

ImportError: Please install RoPE kernels: pip install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary

I have already installed this dependency.

ajash

Sep 5, 2023

Output of:
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map='auto', trust_remote_code=True, torch_dtype=torch.bfloat16)

Downloading (…)lve/main/config.json: 100%
709/709 [00:00<00:00, 62.2kB/s]
Downloading (…)eling_flash_llama.py: 100%
45.3k/45.3k [00:00<00:00, 3.74MB/s]
A new version of the following files was downloaded from https://huggingface.co/togethercomputer/LLaMA-2-7B-32K:
- modeling_flash_llama.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
>>>> Flash Attention installed

ModuleNotFoundError Traceback (most recent call last)
~/.cache/huggingface/modules/transformers_modules/togethercomputer/LLaMA-2-7B-32K/aef6d8946ae1015bdb65c478a2dd73b58daaef47/modeling_flash_llama.py in
51 try:
---> 52 from flash_attn.layers.rotary import apply_rotary_emb_func
53 flash_rope_installed = True

12 frames
ModuleNotFoundError: No module named 'flash_attn.ops.triton'

During handling of the above exception, another exception occurred:

ImportError Traceback (most recent call last)
~/.cache/huggingface/modules/transformers_modules/togethercomputer/LLaMA-2-7B-32K/aef6d8946ae1015bdb65c478a2dd73b58daaef47/modeling_flash_llama.py in
55 except ImportError:
56 flash_rope_installed = False
---> 57 raise ImportError('Please install RoPE kernels: pip install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary')
58
59

ImportError: Please install RoPE kernels: pip install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary

bjoernp

Sep 5, 2023

•

edited Sep 5, 2023

Currently a bug in flash-attn. Try installing v2.1.1 for now:

pip install flash-attn==2.1.1 --no-build-isolation
pip install git+https://github.com/HazyResearch/flash-attention.git@v2.1.1#subdirectory=csrc/rotary

ajash

Sep 6, 2023

that worked... thanks
how does one figure this out by themselves :)

MaZeNsMz

Sep 18, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment