Instructions to use Open-Orca/OpenOrca-Platypus2-13B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Open-Orca/OpenOrca-Platypus2-13B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Open-Orca/OpenOrca-Platypus2-13B")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Open-Orca/OpenOrca-Platypus2-13B")
model = AutoModelForCausalLM.from_pretrained("Open-Orca/OpenOrca-Platypus2-13B")

Inference
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Open-Orca/OpenOrca-Platypus2-13B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Open-Orca/OpenOrca-Platypus2-13B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Open-Orca/OpenOrca-Platypus2-13B",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/Open-Orca/OpenOrca-Platypus2-13B

SGLang

How to use Open-Orca/OpenOrca-Platypus2-13B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Open-Orca/OpenOrca-Platypus2-13B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Open-Orca/OpenOrca-Platypus2-13B",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Open-Orca/OpenOrca-Platypus2-13B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Open-Orca/OpenOrca-Platypus2-13B",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use Open-Orca/OpenOrca-Platypus2-13B with Docker Model Runner:
```
docker model run hf.co/Open-Orca/OpenOrca-Platypus2-13B
```

Extremely slow inference

by TZ20 - opened Sep 15, 2023

Discussion

TZ20

Sep 15, 2023

•

edited Sep 15, 2023

Hi, I'm loading this model using 4 bit quantization from huggingface. Im using 4 T4 gpus:

model = LlamaForCausalLM.from_pretrained(
    'Open-Orca/OpenOrca-Platypus2-13B',
    load_in_4bit = True,
    torch_dtype = torch.float16,
    device_map= 'auto')

However, when I do model.generate, it is extremely slow compared to the base LLama-2-13b-chat model. E.g. where the original llama 2 model might take 2 min, this one takes 30 min.
Any reason for this?

alpindale

OpenOrca org Sep 21, 2023

Try replacing your current configs with the updated config.json and generation_config.json. Looks like the cache was disabled, which usually leads to extreme slowdowns.

TZ20

Sep 22, 2023

Thanks, seemed to do the trick

TZ20 changed discussion status to closed Sep 22, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment