Instructions to use togethercomputer/Llama-2-7B-32K-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use togethercomputer/Llama-2-7B-32K-Instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="togethercomputer/Llama-2-7B-32K-Instruct")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("togethercomputer/Llama-2-7B-32K-Instruct") model = AutoModelForCausalLM.from_pretrained("togethercomputer/Llama-2-7B-32K-Instruct") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use togethercomputer/Llama-2-7B-32K-Instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "togethercomputer/Llama-2-7B-32K-Instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "togethercomputer/Llama-2-7B-32K-Instruct", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/togethercomputer/Llama-2-7B-32K-Instruct
- SGLang
How to use togethercomputer/Llama-2-7B-32K-Instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "togethercomputer/Llama-2-7B-32K-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "togethercomputer/Llama-2-7B-32K-Instruct", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "togethercomputer/Llama-2-7B-32K-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "togethercomputer/Llama-2-7B-32K-Instruct", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use togethercomputer/Llama-2-7B-32K-Instruct with Docker Model Runner:
docker model run hf.co/togethercomputer/Llama-2-7B-32K-Instruct
Commit History
Update added_tokens.json a9fc7ba
Fast tokenizer b050a6f
Update README.md 35696b9
Update README.md 4d80166
Update README.md e7f027c
update model weights e6bf8ab
Yucheng Lu commited on
Update README.md 294d968
Update README.md 6fd5a28
update model weights to fp16; add bos token to config a3e7bfc
Yucheng Lu commited on
Update README.md fd00afd
Update README.md c559461
update latest model weights e178def
Yucheng Lu commited on
Update README.md: add PG19 evaluation results 62502aa
Update README.md 65e7cb3
Update README.md c9cd6ba
Update README.md: fix the typos for model loading example 0fb7015
Update README: add the details for data collections and fix typos a8043fd
init b71cf75
Yucheng Lu commited on