Instructions to use mistralai/Mistral-7B-Instruct-v0.2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use mistralai/Mistral-7B-Instruct-v0.2 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="mistralai/Mistral-7B-Instruct-v0.2")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use mistralai/Mistral-7B-Instruct-v0.2 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Install mistral-common:
pip install --upgrade mistral-common
# Start the vLLM server:
vllm serve "mistralai/Mistral-7B-Instruct-v0.2" --tokenizer_mode mistral --config_format mistral --load_format mistral --tool-call-parser mistral --enable-auto-tool-choice
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mistralai/Mistral-7B-Instruct-v0.2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/mistralai/Mistral-7B-Instruct-v0.2

SGLang

How to use mistralai/Mistral-7B-Instruct-v0.2 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "mistralai/Mistral-7B-Instruct-v0.2" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mistralai/Mistral-7B-Instruct-v0.2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "mistralai/Mistral-7B-Instruct-v0.2" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mistralai/Mistral-7B-Instruct-v0.2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use mistralai/Mistral-7B-Instruct-v0.2 with Docker Model Runner:
```
docker model run hf.co/mistralai/Mistral-7B-Instruct-v0.2
```

deepspeed inference tensor parallelism memory footprint doesn't decrease with deepspeed tp_size increase.

#92

by jiangtaozh - opened Apr 18, 2024

Discussion

jiangtaozh

Apr 18, 2024

model_name = "mistralai/Mistral-7B-Instruct-v0.2"
model = AutoModelForCausalLM.from_pretrained(model_name)

tokenizer = AutoTokenizer.from_pretrained(model_name)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token  # Set pad_token to eos_token

model.eval()

ds_engine = deepspeed.init_inference(model,
                                     tensor_parallel={"tp_size": world_size},
                                     #dtype=torch.float32,
                                     dtype=torch.float16,
                                     replace_with_kernel_inject=True)

model = ds_engine.module
model.eval()

generator = pipeline(task='text-generation', model=model, tokenizer=tokenizer, device=local_rank)
return generator

according to this official link https://www.deepspeed.ai/tutorials/automatic-tensor-parallelism/ with tp size increase, the memory allocated on each gpu will be decreased. However for mistral model, it doesn't work. I tested gpt-j 6B and llama model 7B models with exactly the same code, both of them are working (GPU memory allocation decrease with tp size increase). What's wrong with mistral model? I tried the deepspeed-mii too, the conclusion is the same.

anyone can throw some lights on this?

ayu8393

Apr 26, 2024

they do not support mistral models with the old inference engine. so you should try to use the latest inference engine DeepSpeed-MII. Here's an example for running a mistral model:

import mii
pipe = mii.pipeline("mistralai/Mistral-7B-v0.1")
response = pipe(["DeepSpeed is"], max_new_tokens=128)
print(response)

jiangtaozh

May 1, 2024

•

edited May 1, 2024

in the backend of mii, the inference is still backed by deepspeed inference engine. I tried, it is the same. no memory footprint reduction along the tensor parallism with the following code.

import argparse
import mii

def main():
parser = argparse.ArgumentParser()
parser.add_argument("--model", type=str, default="mistralai/Mistral-7B-v0.1")
parser.add_argument("--tensor-parallel", type=int, default=1)
args = parser.parse_args()
mii.serve(args.model, tensor_parallel=args.tensor_parallel)
print(f"Serving model {args.model} on {args.tensor_parallel} GPU(s).")
print(f"Run python client.py --model {args.model} to connect.")
print(f"Run python terminate.py --model {args.model} to terminate.")=
main()

jiangtaozh

May 1, 2024

https://github.com/microsoft/DeepSpeed-MII/issues/329

Somniloquy

Jun 13, 2024

I have the same problem, have you solved it?

jiangtaozh

Jun 14, 2024

Not yet. I don't deepspeed support mistral tensor parallelism.

subhrokomol

Sep 23, 2024

Can we quantize the model and run deepspeed MII ?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment