Instructions to use nvidia/Mistral-NeMo-Minitron-8B-Base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nvidia/Mistral-NeMo-Minitron-8B-Base with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nvidia/Mistral-NeMo-Minitron-8B-Base")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("nvidia/Mistral-NeMo-Minitron-8B-Base")
model = AutoModelForCausalLM.from_pretrained("nvidia/Mistral-NeMo-Minitron-8B-Base")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use nvidia/Mistral-NeMo-Minitron-8B-Base with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nvidia/Mistral-NeMo-Minitron-8B-Base"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Mistral-NeMo-Minitron-8B-Base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/nvidia/Mistral-NeMo-Minitron-8B-Base

SGLang

How to use nvidia/Mistral-NeMo-Minitron-8B-Base with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nvidia/Mistral-NeMo-Minitron-8B-Base" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Mistral-NeMo-Minitron-8B-Base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nvidia/Mistral-NeMo-Minitron-8B-Base" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Mistral-NeMo-Minitron-8B-Base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use nvidia/Mistral-NeMo-Minitron-8B-Base with Docker Model Runner:
```
docker model run hf.co/nvidia/Mistral-NeMo-Minitron-8B-Base
```

Mistral-NeMo-Minitron-8B-Chat

by rasyosef - opened Aug 26, 2024

Discussion

rasyosef

Aug 26, 2024

https://huggingface.co/rasyosef/Mistral-NeMo-Minitron-8B-Chat

I have created instruction-tuned version of nvidia/Mistral-NeMo-Minitron-8B-Base that has underwent supervised fine-tuning with 32k instruction-response pairs from the teknium/OpenHermes-2.5 dataset.

How to use

Chat Format

Given the nature of the training data, the phi-2 instruct model is best suited for prompts using the chat format as follows.
You can provide the prompt as a question with a generic template as follows:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Question?<|im_end|>
<|im_start|>assistant

For example:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
How to explain Internet for a medieval knight?<|im_end|>
<|im_start|>assistant

where the model generates the text after <|im_start|>assistant .

Sample inference code

This code snippets show how to get quickly started with running the model on a GPU:

import torch 
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline 

torch.random.manual_seed(0) 

model_id = "rasyosef/Mistral-NeMo-Minitron-8B-Chat"
model = AutoModelForCausalLM.from_pretrained( 
    model_id,  
    device_map="auto",  
    torch_dtype=torch.bfloat16 
) 

tokenizer = AutoTokenizer.from_pretrained(model_id) 

messages = [ 
    {"role": "system", "content": "You are a helpful AI assistant."}, 
    {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"}, 
    {"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."}, 
    {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"}, 
] 

pipe = pipeline( 
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
) 

generation_args = { 
    "max_new_tokens": 256, 
    "return_full_text": False, 
    "temperature": 0.0, 
    "do_sample": False, 
} 

output = pipe(messages, **generation_args) 
print(output[0]['generated_text'])

Note: If you want to use flash attention, call AutoModelForCausalLM.from_pretrained() with attn_implementation="flash_attention_2"

aashish1904

Aug 26, 2024

•

edited Aug 26, 2024

Please find GGUF quants for this model at QuantFactory/Mistral-NeMo-Minitron-8B-Chat-GGUF

pmolchanov

NVIDIA org Aug 27, 2024

This is quite cool, thank you @aashish1904 and @rasyosef . Do you know how this compares to the same experiments with LLaMa-3.1-8B or similar models?

rasyosef

Aug 27, 2024

Hi @pmolchanov , I was going to finetune Llama-3.1-8B with the same 32k instruction dataset and evaluate them both on the IFEval benchmark using lm-evaluation-harness.

Will let you know of the result soon.

Kartik305

Sep 25, 2024

@rasyosef Can you please share some insights about the finetuning procoess itself.
Specifically about your multi-gpu settings, hardware requirements and if you are using a quantized version of the model or loading it in bf16 directly for finetuning.

rasyosef

Sep 25, 2024

Hi @Kartik305 , I used a single A100 40GB GPU and parameter efficient finetuning to train a LoRA adapter on top of the model weights were loaded in bf16.

It was trained for 2 epochs with an SFT dataset of 32k samples (max length of 512 tokens) and took 3.5 hrs to complete.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment