Instructions to use silma-ai/SILMA-9B-Instruct-v1.0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use silma-ai/SILMA-9B-Instruct-v1.0 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="silma-ai/SILMA-9B-Instruct-v1.0")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("silma-ai/SILMA-9B-Instruct-v1.0")
model = AutoModelForCausalLM.from_pretrained("silma-ai/SILMA-9B-Instruct-v1.0", device_map="auto")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use silma-ai/SILMA-9B-Instruct-v1.0 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "silma-ai/SILMA-9B-Instruct-v1.0"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "silma-ai/SILMA-9B-Instruct-v1.0",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/silma-ai/SILMA-9B-Instruct-v1.0

SGLang

How to use silma-ai/SILMA-9B-Instruct-v1.0 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "silma-ai/SILMA-9B-Instruct-v1.0" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "silma-ai/SILMA-9B-Instruct-v1.0",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "silma-ai/SILMA-9B-Instruct-v1.0" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "silma-ai/SILMA-9B-Instruct-v1.0",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use silma-ai/SILMA-9B-Instruct-v1.0 with Docker Model Runner:
```
docker model run hf.co/silma-ai/SILMA-9B-Instruct-v1.0
```

Model loading taking too much GPU memory

by tehreemfarooqi - opened Sep 23, 2024

Discussion

tehreemfarooqi

Sep 23, 2024

Hey, when trying to load the model using the code given in the repo card, it keep giving me CUDA out of memory error. I am using NVIDIA V100 with 16 GB RAM. Given that I have run LLMs with more parameters as well as speech-to-text models on this GPU, this doesn't make sense to me. I'm using the exact code given in the repo card. Am I doing something wrong?

karimouda

SILMA AI - Arabic Language Models org Sep 23, 2024

Hello Tehreem and thanks for trying the model

Our model will run on 16GB GPUs only in Quantization mode, you can find the sample code here:
https://huggingface.co/silma-ai/SILMA-9B-Instruct-v1.0#quantized-versions-through-bitsandbytes

You can also find our recommended GPU requirements here:
https://huggingface.co/silma-ai/SILMA-9B-Instruct-v1.0#gpu-requirements

Finally, here is a probable technical explanation of why you got OOM:

Our model is 9B parameters with each parameter represented as BF/FP16 (16-bit floating-point)
This means that 9 billion parameters will be represented by 18 billion bytes, with each parameter requiring 2 bytes (16 bits).
To find the amount of memory needed, you will then need to divide 18B bytes by 1,073,741,824 (since 1GB=1,073,741,824 bytes)
Therefor, you will need 16.76 GB of GPU memory only to load the weights

tehreemfarooqi

Sep 24, 2024

Thanks for your reply @karimouda ! I was able to run it using a multi-GPU setup.

tehreemfarooqi changed discussion status to closed Sep 24, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment