Instructions to use Open-Orca/Mistral-7B-OpenOrca with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Open-Orca/Mistral-7B-OpenOrca with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Open-Orca/Mistral-7B-OpenOrca") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Open-Orca/Mistral-7B-OpenOrca") model = AutoModelForCausalLM.from_pretrained("Open-Orca/Mistral-7B-OpenOrca") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Open-Orca/Mistral-7B-OpenOrca with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Open-Orca/Mistral-7B-OpenOrca" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Open-Orca/Mistral-7B-OpenOrca", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Open-Orca/Mistral-7B-OpenOrca
- SGLang
How to use Open-Orca/Mistral-7B-OpenOrca with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Open-Orca/Mistral-7B-OpenOrca" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Open-Orca/Mistral-7B-OpenOrca", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Open-Orca/Mistral-7B-OpenOrca" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Open-Orca/Mistral-7B-OpenOrca", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Open-Orca/Mistral-7B-OpenOrca with Docker Model Runner:
docker model run hf.co/Open-Orca/Mistral-7B-OpenOrca
I'm getting error : <unc> set to 0 in the tokenizer config
I'm having trouble with the provided tokenizer , unclear what's happening in that error. (sorry for not being more helpful!)
I concur. I'm trying to load this model in text-generation-inference. Here's the stack:
2023-10-03T07:17:50.796666Z INFO download: text_generation_launcher: Successfully downloaded weights.
2023-10-03T07:17:50.796964Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2023-10-03T07:18:00.806182Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2023-10-03T07:18:05.539569Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 83, in serve
server.serve(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 207, in serve
asyncio.run(
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
self._run_once()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
handle._run()
File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 159, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 252, in get_model
return FlashMistral(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_mistral.py", line 297, in __init__
tokenizer = LlamaTokenizerFast.from_pretrained(
File "/opt/conda/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1854, in from_pretrained
return cls._from_pretrained(
File "/opt/conda/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1886, in _from_pretrained
slow_tokenizer = (cls.slow_tokenizer_class)._from_pretrained(
File "/opt/conda/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2073, in _from_pretrained
raise ValueError(
ValueError: Non-consecutive added token '<unk>' found. Should have index 32000 but has index 0 in saved vocabulary.
You'll need to get into whatever environment you have setup for ooba (e.g. conda) and do:
pip install git+https://github.com/huggingface/transformers
This is because support for Mistral in Transformers is not merged to PyPI yet, so you need to install from the development snapshot.
Thanks, that worked for me.
I assumed that since text-generation-inference:1.1.0 has support for Mistral, that it would work out of the box. Instead I had to create a new image. eg:
FROM ghcr.io/huggingface/text-generation-inference:1.1.0
RUN apt-get update -y && \
DEBIAN_FRONTEND=noninteractive apt-get install -y git && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
RUN pip3 install --no-cache-dir \
"git+https://github.com/huggingface/transformers"
is there a way to do this programmatically - yet ? (i'm trying to host it here, on hugging face)
I seem to be still getting:
"""raise TypeError(f"{config.model_type} isn't supported yet.")
TypeError: mistral isn't supported yet."""
even after updating with the given command.
I'm just loading it through AutoTokenizer.from_pretrained
normally it's been fixed, you have to set the max token lengths when you deploy