Instructions to use NumbersStation/nsql-llama-2-7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use NumbersStation/nsql-llama-2-7B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="NumbersStation/nsql-llama-2-7B")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("NumbersStation/nsql-llama-2-7B")
model = AutoModelForCausalLM.from_pretrained("NumbersStation/nsql-llama-2-7B")

Inference
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use NumbersStation/nsql-llama-2-7B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "NumbersStation/nsql-llama-2-7B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "NumbersStation/nsql-llama-2-7B",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/NumbersStation/nsql-llama-2-7B

SGLang

How to use NumbersStation/nsql-llama-2-7B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "NumbersStation/nsql-llama-2-7B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "NumbersStation/nsql-llama-2-7B",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "NumbersStation/nsql-llama-2-7B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "NumbersStation/nsql-llama-2-7B",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use NumbersStation/nsql-llama-2-7B with Docker Model Runner:
```
docker model run hf.co/NumbersStation/nsql-llama-2-7B
```

Sagemaker Deployment Failing in ml.g5.2xlarge instance

by rishisaraf11 - opened Aug 17, 2023

Discussion

rishisaraf11

Aug 17, 2023

I am getting the below error in Cloudwatch. We are trying to deploy it in ml.g5.2xlarge instance. Any resolution for this or we need to deploy it in bigger instance.

torch.cuda.OutOfMemoryError: Allocation on device 0 would exceed allowed memory. (out of memory)
Currently allocated : 20.61 GiB
Requested : 172.00 MiB
Device limit : 22.20 GiB
Free (according to CUDA): 15.12 MiB
PyTorch limit (set by user-supplied memory fraction)
: 22.20 GiB
The above exception was the direct cause of the following exception:

senwu

NumbersStation org Aug 18, 2023

The model can be deployed on g5.xlarge with torch.bfloat16.

rishisaraf11

Aug 18, 2023

Thanks @senwu . Can you please tell me how to give torch.bfloat16. configuration in the deployment script. Sorry, I am new to this and don't know many of these configs. Below is the deployment script I am using

import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='AmazonSageMaker-ExecutionRole-20230723T133694')['Role']['Arn']

# Hub Model configuration. https://huggingface.co/models
hub = {
    'HF_MODEL_ID':'NumbersStation/nsql-llama-2-7B',
    'SM_NUM_GPUS': json.dumps(1)
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    image_uri=get_huggingface_llm_image_uri("huggingface",version="0.9.3"),
    env=hub,
    role=role, 
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.2xlarge",
    container_startup_health_check_timeout=300,
)

predictor.predict({
    "inputs": "Can you please let us know more details about your ",
})```

senwu

NumbersStation org Aug 19, 2023

Hi @rishisaraf11 ，

We haven't used Sagemaker to deploy the model and from the doc it doesn't seem like there is much flexibility. The model prefers torch.bfloat16 but you can still use other dtype.

arviii

Aug 19, 2023

Hi @senwu

I tried different variations of passing `SM_FRAMEWORK_PARAMS` into env for `HuggingFaceModel` class in the script shared by @rishisaraf11 but no luck

hub = {
'HF_MODEL_ID': 'NumbersStation/nsql-llama-2-7B',
'SM_NUM_GPUS': json.dumps(1),
'SM_FRAMEWORK_PARAMS': "{'torch_dtype': 'bfloat16'}"
}

#create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
image_uri=get_huggingface_llm_image_uri("huggingface", version="0.9.3"),
env=hub,
role=role,
)

#deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.g5.2xlarge",
container_startup_health_check_timeout=300,
)

senwu

NumbersStation org Aug 25, 2023

It seems like sagemaker doesn't have full transformer support yet. You can use the default config for the model as well.

You can also use g5.2xlarge machine or low_cpu_mem_usage=True from https://huggingface.co/docs/transformers/main_classes/model to reduce the RAM usage when loading the model.

arviii

Aug 28, 2023

Thank you for the reply @senwu

Problem seems with the overflow of GPU VRAM which is `~22.2 GB's`

for ml.g5.2xlarge which has Nvidia A10g 24 GB GPU.

Error: Sagemaker deployment failed due to memory error

torch.cuda.OutOfMemoryError: Allocation on device 0 would exceed allowed memory. (out of memory)
Currently allocated : 20.61 GiB
Requested : 172.00 MiB
Device limit : 22.20 GiB
Free (according to CUDA): 15.12 MiB
PyTorch limit (set by user-supplied memory fraction)
: 22.20 GiB

senwu

NumbersStation org Aug 28, 2023

To torch.float32 version of the model it requires around 26G VRAM. We will adjust the default model type this week.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Sagemaker Deployment Failing in ml.g5.2xlarge instance

I tried different variations of passing SM_FRAMEWORK_PARAMS into env for HuggingFaceModel class in the script shared by @rishisaraf11 but no luck

Thank you for the reply @senwu

Problem seems with the overflow of GPU VRAM which is ~22.2 GB's

Error: Sagemaker deployment failed due to memory error

I tried different variations of passing `SM_FRAMEWORK_PARAMS` into env for `HuggingFaceModel` class in the script shared by @rishisaraf11 but no luck

Problem seems with the overflow of GPU VRAM which is `~22.2 GB's`