Instructions to use microsoft/Orca-2-7b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use microsoft/Orca-2-7b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="microsoft/Orca-2-7b")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("microsoft/Orca-2-7b") model = AutoModelForCausalLM.from_pretrained("microsoft/Orca-2-7b") - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use microsoft/Orca-2-7b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "microsoft/Orca-2-7b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/Orca-2-7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/microsoft/Orca-2-7b
- SGLang
How to use microsoft/Orca-2-7b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "microsoft/Orca-2-7b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/Orca-2-7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "microsoft/Orca-2-7b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/Orca-2-7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use microsoft/Orca-2-7b with Docker Model Runner:
docker model run hf.co/microsoft/Orca-2-7b
Finetuned model is less in size than original model
I used sft trainer for finetuning this model, but to my surprise while pushing finetuned model 3 files of smaller size is save in total around 15GB but original model is around 25 GB. This is surprising
@hiiamsid the most likely reason is that this model is stored in fp32 format while the output you got is in fp16 format.
fp32 is pretty useless since its 2x the size of fp16 but fp32 is 2x slower then fp16. there is no quality difference as well.
Thats why 99% of the time everyone uses fp16 since it increases speed in inference and training, decreases vram/ram usage in inference and training and is much smaller then fp32.