Instructions to use mistralai/Mistral-7B-Instruct-v0.2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use mistralai/Mistral-7B-Instruct-v0.2 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="mistralai/Mistral-7B-Instruct-v0.2")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use mistralai/Mistral-7B-Instruct-v0.2 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Install mistral-common:
pip install --upgrade mistral-common
# Start the vLLM server:
vllm serve "mistralai/Mistral-7B-Instruct-v0.2" --tokenizer_mode mistral --config_format mistral --load_format mistral --tool-call-parser mistral --enable-auto-tool-choice
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mistralai/Mistral-7B-Instruct-v0.2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/mistralai/Mistral-7B-Instruct-v0.2

SGLang

How to use mistralai/Mistral-7B-Instruct-v0.2 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "mistralai/Mistral-7B-Instruct-v0.2" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mistralai/Mistral-7B-Instruct-v0.2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "mistralai/Mistral-7B-Instruct-v0.2" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mistralai/Mistral-7B-Instruct-v0.2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use mistralai/Mistral-7B-Instruct-v0.2 with Docker Model Runner:
```
docker model run hf.co/mistralai/Mistral-7B-Instruct-v0.2
```

responses are incomplete, greetings are not handled

#150

by dev4sidra - opened Aug 2, 2024

Discussion

dev4sidra

Aug 2, 2024

i have tried each possible way, changed parameters but still responses are incomplete , in some cases it works and of some query it return half ans. Greetings are not handled properly it return un,matched ans

dev4sidra changed discussion status to closed Aug 2, 2024

dev4sidra changed discussion status to open Aug 2, 2024

pandora-s

Mistral AI_ org Aug 2, 2024

Hi dev4sidra, how are you using the model?

DhruvParth

Aug 3, 2024

I am facing the same issue as well.

Sample screenshot:

pandora-s

Mistral AI_ org Aug 3, 2024

I believe you are using the model as raw text completion and not for chat completion, I would recommend using as mentionned in the readme with:

from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")

messages = [
    {"role": "user", "content": "What is your favourite condiment?"},
    {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
    {"role": "user", "content": "Do you have mayonnaise recipes?"}
]

encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")

model_inputs = encodeds.to(device)
model.to(device)

generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

DhruvParth

Aug 5, 2024

This is how I got it working:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_name = 'mistralai/Mistral-7B-Instruct-v0.2'

def load_quantized_model(model_name: str):
    """
    :param model_name: Name or path of the model to be loaded.
    :return: Loaded quantized model.
    """
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16
    )

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.bfloat16,
        quantization_config=bnb_config
    )

    return model

def initialize_tokenizer(model_name: str):
    """
    Initialize the tokenizer with the specified model_name.

    :param model_name: Name or path of the model for tokenizer initialization.
    :return: Initialized tokenizer.
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.bos_token_id = 1  # Set beginning of sentence token id
    return tokenizer


model = load_quantized_model(model_name)

tokenizer = initialize_tokenizer(model_name)

# Define stop token ids
stop_token_ids = [0]

def generate_response(prompt):
  text = f"[INST] {prompt} [/INST]"
  encoded = tokenizer(text, return_tensors="pt", add_special_tokens=False)
  model_input = encoded.to(model.device)
  generated_ids = model.generate(**model_input, max_new_tokens=1000, do_sample=True)
  decoded = tokenizer.batch_decode(generated_ids)
  return decoded[0].replace(text, '').strip()

# https://stackoverflow.com/questions/77803696/runtimeerror-cutlassf-no-kernel-found-to-launch-when-running-huggingface-tran
torch.backends.cuda.enable_mem_efficient_sdp(False)
torch.backends.cuda.enable_flash_sdp(False)

prompt = "How AI will replace Engineers"
response = generate_response(prompt)
print(response)

dev4sidra

Aug 5, 2024

•

edited Aug 5, 2024

const response = await textGeneration({
accessToken: apiKey,
model: 'mistralai/Mistral-7B-Instruct-v0.2',
inputs: inputText,
parameters: {
max_length: 1024,
repetition_penalty: 1.03,
temperature: 0.2, /// Adjust for balance between creativity and relevance
top_p: 0.9, // Nucleus sampling: consider top 90% probability mass.
top_k: 50, // Limits token choices to the top 50 most probable tokens.

},
}); i am using it this way, but still responses are incomplete, i tried diff ways to change paramteres,

basically i have a vector db, the question i ask find relevant data from database then i pass the query and relevant searches to model, it should generate full response

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment