Instructions to use Open-Orca/Mistral-7B-OpenOrca with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Open-Orca/Mistral-7B-OpenOrca with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Open-Orca/Mistral-7B-OpenOrca") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Open-Orca/Mistral-7B-OpenOrca") model = AutoModelForCausalLM.from_pretrained("Open-Orca/Mistral-7B-OpenOrca") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Open-Orca/Mistral-7B-OpenOrca with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Open-Orca/Mistral-7B-OpenOrca" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Open-Orca/Mistral-7B-OpenOrca", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Open-Orca/Mistral-7B-OpenOrca
- SGLang
How to use Open-Orca/Mistral-7B-OpenOrca with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Open-Orca/Mistral-7B-OpenOrca" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Open-Orca/Mistral-7B-OpenOrca", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Open-Orca/Mistral-7B-OpenOrca" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Open-Orca/Mistral-7B-OpenOrca", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Open-Orca/Mistral-7B-OpenOrca with Docker Model Runner:
docker model run hf.co/Open-Orca/Mistral-7B-OpenOrca
Problem with streaming support
I'm serving OpenOrca using HF TGI with stream=True. The problem is that the stopping sequence <|im_end|> consists of 10 tokens . It that string is split across two response chunks then it doesn't get automatically removed from the text.
I know this is a very specific instance but wondering if anyone else has encountered this and managed to solve it?
You need to upgrade the transformers version, mistral support was introduced in 4.34.0, TGI 1.1.0 depends on transformers 4.33.3. After upgrading transformers my TGI can stop without generating '<|im_end|>'.
We build a docker image if you want to use, zjuici/mirror.huggingface.text-generation-inference:1.1.0-transformers-4.34.1
Thanks for helping. I haven't tried yet. I am running TGI for the official docker image so I'm try yours instead. Cheers
Matt
I'm curious about where the change in the transformer version is set in the image? (docker novice).
I could not find the Dockerfile right now but it should be as simple as (IIRC):
FROM ghcr.io/huggingface/text-generation-inference:1.1.0
RUN python -m pip install transformers==4.34.1
Thanks!