Instructions to use NumbersStation/nsql-llama-2-7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use NumbersStation/nsql-llama-2-7B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="NumbersStation/nsql-llama-2-7B")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("NumbersStation/nsql-llama-2-7B") model = AutoModelForCausalLM.from_pretrained("NumbersStation/nsql-llama-2-7B") - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use NumbersStation/nsql-llama-2-7B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "NumbersStation/nsql-llama-2-7B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "NumbersStation/nsql-llama-2-7B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/NumbersStation/nsql-llama-2-7B
- SGLang
How to use NumbersStation/nsql-llama-2-7B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "NumbersStation/nsql-llama-2-7B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "NumbersStation/nsql-llama-2-7B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "NumbersStation/nsql-llama-2-7B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "NumbersStation/nsql-llama-2-7B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use NumbersStation/nsql-llama-2-7B with Docker Model Runner:
docker model run hf.co/NumbersStation/nsql-llama-2-7B
Sagemaker Deployment Failing in ml.g5.2xlarge instance
I am getting the below error in Cloudwatch. We are trying to deploy it in ml.g5.2xlarge instance. Any resolution for this or we need to deploy it in bigger instance.
torch.cuda.OutOfMemoryError: Allocation on device 0 would exceed allowed memory. (out of memory)
Currently allocated : 20.61 GiB
Requested : 172.00 MiB
Device limit : 22.20 GiB
Free (according to CUDA): 15.12 MiB
PyTorch limit (set by user-supplied memory fraction)
: 22.20 GiB
The above exception was the direct cause of the following exception:
The model can be deployed on g5.xlarge with torch.bfloat16.
Thanks @senwu . Can you please tell me how to give torch.bfloat16. configuration in the deployment script. Sorry, I am new to this and don't know many of these configs. Below is the deployment script I am using
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client('iam')
role = iam.get_role(RoleName='AmazonSageMaker-ExecutionRole-20230723T133694')['Role']['Arn']
# Hub Model configuration. https://huggingface.co/models
hub = {
'HF_MODEL_ID':'NumbersStation/nsql-llama-2-7B',
'SM_NUM_GPUS': json.dumps(1)
}
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
image_uri=get_huggingface_llm_image_uri("huggingface",version="0.9.3"),
env=hub,
role=role,
)
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.g5.2xlarge",
container_startup_health_check_timeout=300,
)
predictor.predict({
"inputs": "Can you please let us know more details about your ",
})```
Hi @rishisaraf11 ,
We haven't used Sagemaker to deploy the model and from the doc it doesn't seem like there is much flexibility. The model prefers torch.bfloat16 but you can still use other dtype.
Hi @senwu
I tried different variations of passing SM_FRAMEWORK_PARAMS into env for HuggingFaceModel class in the script shared by @rishisaraf11 but no luck
hub = {
'HF_MODEL_ID': 'NumbersStation/nsql-llama-2-7B',
'SM_NUM_GPUS': json.dumps(1),
'SM_FRAMEWORK_PARAMS': "{'torch_dtype': 'bfloat16'}"
}
#create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
image_uri=get_huggingface_llm_image_uri("huggingface", version="0.9.3"),
env=hub,
role=role,
)
#deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.g5.2xlarge",
container_startup_health_check_timeout=300,
)
It seems like sagemaker doesn't have full transformer support yet. You can use the default config for the model as well.
You can also use g5.2xlarge machine or low_cpu_mem_usage=True from https://huggingface.co/docs/transformers/main_classes/model to reduce the RAM usage when loading the model.
Thank you for the reply @senwu
Problem seems with the overflow of GPU VRAM which is ~22.2 GB's
for ml.g5.2xlarge which has Nvidia A10g 24 GB GPU.
Error: Sagemaker deployment failed due to memory error
torch.cuda.OutOfMemoryError: Allocation on device 0 would exceed allowed memory. (out of memory)
Currently allocated : 20.61 GiB
Requested : 172.00 MiB
Device limit : 22.20 GiB
Free (according to CUDA): 15.12 MiB
PyTorch limit (set by user-supplied memory fraction)
: 22.20 GiB
To torch.float32 version of the model it requires around 26G VRAM. We will adjust the default model type this week.