Instructions to use grimulkan/llama2_70b_longlora_fp16_32k_ROPE8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use grimulkan/llama2_70b_longlora_fp16_32k_ROPE8 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="grimulkan/llama2_70b_longlora_fp16_32k_ROPE8")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("grimulkan/llama2_70b_longlora_fp16_32k_ROPE8") model = AutoModelForCausalLM.from_pretrained("grimulkan/llama2_70b_longlora_fp16_32k_ROPE8") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use grimulkan/llama2_70b_longlora_fp16_32k_ROPE8 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "grimulkan/llama2_70b_longlora_fp16_32k_ROPE8" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "grimulkan/llama2_70b_longlora_fp16_32k_ROPE8", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/grimulkan/llama2_70b_longlora_fp16_32k_ROPE8
- SGLang
How to use grimulkan/llama2_70b_longlora_fp16_32k_ROPE8 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "grimulkan/llama2_70b_longlora_fp16_32k_ROPE8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "grimulkan/llama2_70b_longlora_fp16_32k_ROPE8", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "grimulkan/llama2_70b_longlora_fp16_32k_ROPE8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "grimulkan/llama2_70b_longlora_fp16_32k_ROPE8", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use grimulkan/llama2_70b_longlora_fp16_32k_ROPE8 with Docker Model Runner:
docker model run hf.co/grimulkan/llama2_70b_longlora_fp16_32k_ROPE8
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("grimulkan/llama2_70b_longlora_fp16_32k_ROPE8")
model = AutoModelForCausalLM.from_pretrained("grimulkan/llama2_70b_longlora_fp16_32k_ROPE8")This is the same as Yukang's Llama-2-70b-longlora-32k, except that the extra pad token has been stripped from the tokenizer to make it similar to the base Llama model (and it has been merged into the base model). Please refer to that page for more details.
It was created by merging LongAlpaca-70B-lora into Llama-2-70b, replacing the embed and norm layers as described in the LongLoRA repo, and removing the extra row and pad token.
This is not an instruct-tuned model, but a base model for further fine-tuning. It supports 32K of context with linear rope scaling of 8.
- Downloads last month
- 7
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="grimulkan/llama2_70b_longlora_fp16_32k_ROPE8")