Instructions to use mistralai/Mistral-7B-Instruct-v0.2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use mistralai/Mistral-7B-Instruct-v0.2 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="mistralai/Mistral-7B-Instruct-v0.2")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use mistralai/Mistral-7B-Instruct-v0.2 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Install mistral-common:
pip install --upgrade mistral-common
# Start the vLLM server:
vllm serve "mistralai/Mistral-7B-Instruct-v0.2" --tokenizer_mode mistral --config_format mistral --load_format mistral --tool-call-parser mistral --enable-auto-tool-choice
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mistralai/Mistral-7B-Instruct-v0.2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/mistralai/Mistral-7B-Instruct-v0.2

SGLang

How to use mistralai/Mistral-7B-Instruct-v0.2 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "mistralai/Mistral-7B-Instruct-v0.2" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mistralai/Mistral-7B-Instruct-v0.2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "mistralai/Mistral-7B-Instruct-v0.2" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mistralai/Mistral-7B-Instruct-v0.2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use mistralai/Mistral-7B-Instruct-v0.2 with Docker Model Runner:
```
docker model run hf.co/mistralai/Mistral-7B-Instruct-v0.2
```

FIne tuned model generating both user and assistant dialogues during inference

#63

by sabber - opened Mar 10, 2024

Discussion

sabber

Mar 10, 2024

Hello there!!
I'm working on fine-tuning a large language model (Mistral-7B-Instruct-v0.2) for conversational tasks using the Hugging Face Transformers library. I processed my conversational dataset by treating each conversation as a list of dialog turns, with separate fields for user utterances and assistant responses. I then formatted my conversational dataset using the setup_chat_format function from the trl library, which prepares the model and tokenizer for chat-style (ChatML) inputs and outputs.

However, during inference, the fine-tuned model generates both user and assistant utterances, instead of just the assistant's response. Additionally, the model tends to generate repeated information. I've tried adjusting the inference settings and exploring different sampling strategies, but the issue persists.

Did anybody face this problem? Appreciate your help/feedback/suggestions.

nepto

Mar 13, 2024

Yeah i faced it when seeing this(thats why i came here),but solved(partially) it after reading ur question ; )

It seems that the inference interface provided in hugging face website uses similar backend approach for this model.

This model works more like completion model to me here.
when is ask "what is 1+1" it just asks that back to me! no change in phrase and stuff.
but then i asked "USER: what is 1+1 ASSISTANT:", and then it completed the answer! i need to make some scripting to just get out the assistant part out of text,separated from user ; )

seems this is implemented so that the same output can be joined with user's input to easily make a chat interface.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment