Instructions to use microsoft/Orca-2-13b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use microsoft/Orca-2-13b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="microsoft/Orca-2-13b")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("microsoft/Orca-2-13b") model = AutoModelForCausalLM.from_pretrained("microsoft/Orca-2-13b") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use microsoft/Orca-2-13b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "microsoft/Orca-2-13b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/Orca-2-13b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/microsoft/Orca-2-13b
- SGLang
How to use microsoft/Orca-2-13b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "microsoft/Orca-2-13b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/Orca-2-13b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "microsoft/Orca-2-13b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/Orca-2-13b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use microsoft/Orca-2-13b with Docker Model Runner:
docker model run hf.co/microsoft/Orca-2-13b
Inference is very slow (about 3 secs/token)
Great to have this model in HF! The inference is super slow - makes it hard to do real-time experiments. Can this be sped up easily?
As measured on Windows 11, CPU: i9-13900KF, 128 GB RAM, GPU: RTX 3090 (24 GB).
use a quant. Which don't exist yet....
@rfernand your best bet is to use quantization and that should boost speed by a large amount and also it will take up less vram. I think you should use the gptq quant format and load it with huggingface to get best speed. Although transformers is somewhat simple, using something like exllama v2 should get you the fastest speed.
https://huggingface.co/TheBloke/Orca-2-13B-GPTQ
Use the 8 bit one for maximum quality
heh yeah and now they do exist ;)
Thanks @YaTharThShaRma999 and @PsiPi .
This is great - I tried the 4-bit version (https://huggingface.co/TheBloke/Orca-2-13B-GGUF) with following results:
model loading: 4x faster
inference 12x faster
TLDR
- pip install ctransformers[cuda]
- python script for inference:
from ctransformers import AutoModelForCausalLM
# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
llm = AutoModelForCausalLM.from_pretrained("TheBloke/Orca-2-13B-GGUF", model_file="orca-2-13b.Q4_K_M.gguf", model_type="llama", gpu_layers=50)
print(llm("AI is going to"))
Yeah LoneStriker offers an excellent version as well
For inference, I get the following error:
`GLIBC_2.29' not found
Anyone know how to resolve this?
Specifically
[`GLIBC_2.29' not found](oserror: /lib64/libm.so.6: version `glibc_2.29' not found (required by /local/home/user_name/anaconda3/envs/odi-ds/lib/python3.9/site-packages/ctransformers/lib/cuda/libctransformers.so))
Thank you for replying, I think I have the right glib now but now everytime I run the code on jupyter my kernel just dies as soon as I try to download the model from the repo.
wait nevermind the last comment, all good
Is there a recommended ec2 instance that can run fast, also is it faster to use a GPU or CPU for inference?