Instructions to use teknium/MPT-7B-Mercury-Experimental with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use teknium/MPT-7B-Mercury-Experimental with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="teknium/MPT-7B-Mercury-Experimental", trust_remote_code=True)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("teknium/MPT-7B-Mercury-Experimental", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("teknium/MPT-7B-Mercury-Experimental", trust_remote_code=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use teknium/MPT-7B-Mercury-Experimental with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "teknium/MPT-7B-Mercury-Experimental"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "teknium/MPT-7B-Mercury-Experimental",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/teknium/MPT-7B-Mercury-Experimental

SGLang

How to use teknium/MPT-7B-Mercury-Experimental with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "teknium/MPT-7B-Mercury-Experimental" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "teknium/MPT-7B-Mercury-Experimental",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "teknium/MPT-7B-Mercury-Experimental" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "teknium/MPT-7B-Mercury-Experimental",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use teknium/MPT-7B-Mercury-Experimental with Docker Model Runner:
```
docker model run hf.co/teknium/MPT-7B-Mercury-Experimental
```

Base Model: MPT-7B

This is a Hermes Lite version that excludes the training data of Nous Instruct that hermes model was also trained on, and is experimental.

Big thanks to BitTensor foundation for the compute to attempt this experiment!

There seems to have been some sort of problem with the training that I cannot identify, that, while it does seem improved from the base model, does not seem to have learned nearly as much as was learned by Llama in training Hermes.

Typically, the model would response with long responses when asked, be much more contextually intelligent, and answer in a thoughtful way. However, for whatever reason - likely something to do with not training with LLM-Foundry - the model does not like longer responses, and typical responds quite breifly.

I don't believe this is a base model issue, or at least, I believe it is a base model issue related to it and the trainer, as I compared this fine tune with MPT-7B Instruct model, and it had no problem at all producing extremely long responses, etc. If anyone has the time to investigate, please follow up with me in the community tab or on Twitter, @Teknium1!

I trained Replit 3b with the same trainer, same settings, and it's results were phenomenal. So I would love any hypothesis on what may have made this different.

You should load the model and tokenizer like so:

tokenizer = AutoTokenizer.from_pretrained('EleutherAI/gpt-neox-20b')
tokenizer.pad_token = "<|padding|>"
model = AutoModelForCausalLM.from_pretrained(
    "teknium/MPT-7B-Mercury-Experimental",
    torch_dtype=torch.float16,
    device_map='auto',
    trust_remote_code=True
)

You should use the eos_token_id parameter in the generate function, and skip_special_tokens=True in the tokenizer decode.

generated_ids = model.generate(input_ids, max_new_tokens=512, do_sample=True, top_p=0.5, top_k=0, repetition_penalty=1.1,  min_new_tokens=100, eos_token_id=tokenizer.eos_token_id)
response = tokenizer.decode(generated_ids[0][input_ids.shape[-1]:], skip_special_tokens=True)

While the model is not quite where I'd like it to be, it could be useful for learning how MPT model works, and for some uses, so it is uploaded here.

Downloads last month: 9