Instructions to use togethercomputer/LLaMA-2-7B-32K with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use togethercomputer/LLaMA-2-7B-32K with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="togethercomputer/LLaMA-2-7B-32K")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("togethercomputer/LLaMA-2-7B-32K") model = AutoModelForCausalLM.from_pretrained("togethercomputer/LLaMA-2-7B-32K") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use togethercomputer/LLaMA-2-7B-32K with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "togethercomputer/LLaMA-2-7B-32K" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "togethercomputer/LLaMA-2-7B-32K", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/togethercomputer/LLaMA-2-7B-32K
- SGLang
How to use togethercomputer/LLaMA-2-7B-32K with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "togethercomputer/LLaMA-2-7B-32K" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "togethercomputer/LLaMA-2-7B-32K", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "togethercomputer/LLaMA-2-7B-32K" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "togethercomputer/LLaMA-2-7B-32K", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use togethercomputer/LLaMA-2-7B-32K with Docker Model Runner:
docker model run hf.co/togethercomputer/LLaMA-2-7B-32K
Installing ! pip install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary but flah_llama still erroring out
I have installed all the required dependencies to run flash attn.:
! pip install flash-attn --no-build-isolation
! pip install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map='auto', trust_remote_code=True, torch_dtype=torch.bfloat16, revision="refs/pr/17")
This is not working. Error:
ImportError: Please install RoPE kernels: pip install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary
I have already installed this dependency.
Output of:
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map='auto', trust_remote_code=True, torch_dtype=torch.bfloat16)
Downloading (…)lve/main/config.json: 100%
709/709 [00:00<00:00, 62.2kB/s]
Downloading (…)eling_flash_llama.py: 100%
45.3k/45.3k [00:00<00:00, 3.74MB/s]
A new version of the following files was downloaded from https://huggingface.co/togethercomputer/LLaMA-2-7B-32K:
- modeling_flash_llama.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
>>>> Flash Attention installed
ModuleNotFoundError Traceback (most recent call last)
~/.cache/huggingface/modules/transformers_modules/togethercomputer/LLaMA-2-7B-32K/aef6d8946ae1015bdb65c478a2dd73b58daaef47/modeling_flash_llama.py in
51 try:
---> 52 from flash_attn.layers.rotary import apply_rotary_emb_func
53 flash_rope_installed = True
12 frames
ModuleNotFoundError: No module named 'flash_attn.ops.triton'
During handling of the above exception, another exception occurred:
ImportError Traceback (most recent call last)
~/.cache/huggingface/modules/transformers_modules/togethercomputer/LLaMA-2-7B-32K/aef6d8946ae1015bdb65c478a2dd73b58daaef47/modeling_flash_llama.py in
55 except ImportError:
56 flash_rope_installed = False
---> 57 raise ImportError('Please install RoPE kernels: pip install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary')
58
59
ImportError: Please install RoPE kernels: pip install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary
Currently a bug in flash-attn. Try installing v2.1.1 for now:
pip install flash-attn==2.1.1 --no-build-isolation
pip install git+https://github.com/HazyResearch/flash-attention.git@v2.1.1#subdirectory=csrc/rotary
that worked... thanks
how does one figure this out by themselves :)
