Instructions to use silma-ai/SILMA-9B-Instruct-v1.0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use silma-ai/SILMA-9B-Instruct-v1.0 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="silma-ai/SILMA-9B-Instruct-v1.0") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("silma-ai/SILMA-9B-Instruct-v1.0") model = AutoModelForCausalLM.from_pretrained("silma-ai/SILMA-9B-Instruct-v1.0") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use silma-ai/SILMA-9B-Instruct-v1.0 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "silma-ai/SILMA-9B-Instruct-v1.0" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "silma-ai/SILMA-9B-Instruct-v1.0", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/silma-ai/SILMA-9B-Instruct-v1.0
- SGLang
How to use silma-ai/SILMA-9B-Instruct-v1.0 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "silma-ai/SILMA-9B-Instruct-v1.0" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "silma-ai/SILMA-9B-Instruct-v1.0", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "silma-ai/SILMA-9B-Instruct-v1.0" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "silma-ai/SILMA-9B-Instruct-v1.0", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use silma-ai/SILMA-9B-Instruct-v1.0 with Docker Model Runner:
docker model run hf.co/silma-ai/SILMA-9B-Instruct-v1.0
Model loading taking too much GPU memory
Hey, when trying to load the model using the code given in the repo card, it keep giving me CUDA out of memory error. I am using NVIDIA V100 with 16 GB RAM. Given that I have run LLMs with more parameters as well as speech-to-text models on this GPU, this doesn't make sense to me. I'm using the exact code given in the repo card. Am I doing something wrong?
Hello Tehreem and thanks for trying the model
Our model will run on 16GB GPUs only in Quantization mode, you can find the sample code here:
https://huggingface.co/silma-ai/SILMA-9B-Instruct-v1.0#quantized-versions-through-bitsandbytes
You can also find our recommended GPU requirements here:
https://huggingface.co/silma-ai/SILMA-9B-Instruct-v1.0#gpu-requirements
Finally, here is a probable technical explanation of why you got OOM:
- Our model is 9B parameters with each parameter represented as BF/FP16 (16-bit floating-point)
- This means that 9 billion parameters will be represented by 18 billion bytes, with each parameter requiring 2 bytes (16 bits).
- To find the amount of memory needed, you will then need to divide 18B bytes by 1,073,741,824 (since 1GB=1,073,741,824 bytes)
- Therefor, you will need 16.76 GB of GPU memory only to load the weights