Instructions to use inflatebot/MN-12B-Mag-Mell-R1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use inflatebot/MN-12B-Mag-Mell-R1 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="inflatebot/MN-12B-Mag-Mell-R1")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("inflatebot/MN-12B-Mag-Mell-R1")
model = AutoModelForCausalLM.from_pretrained("inflatebot/MN-12B-Mag-Mell-R1")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Local Apps Settings

vLLM

How to use inflatebot/MN-12B-Mag-Mell-R1 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "inflatebot/MN-12B-Mag-Mell-R1"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "inflatebot/MN-12B-Mag-Mell-R1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/inflatebot/MN-12B-Mag-Mell-R1

SGLang

How to use inflatebot/MN-12B-Mag-Mell-R1 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "inflatebot/MN-12B-Mag-Mell-R1" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "inflatebot/MN-12B-Mag-Mell-R1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "inflatebot/MN-12B-Mag-Mell-R1" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "inflatebot/MN-12B-Mag-Mell-R1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use inflatebot/MN-12B-Mag-Mell-R1 with Docker Model Runner:
```
docker model run hf.co/inflatebot/MN-12B-Mag-Mell-R1
```

(Special Stop Token Triggered! ID:2)

#14

by korus69 - opened Feb 19, 2025

Discussion

korus69

Feb 19, 2025

how do i stop this from triggering? my output gets interrupted prematurely. im using koboldcpp and sillytavern in their recent versions.

inflatebot

Owner Feb 19, 2025

Is this in KoboldCPP's logs, or SillyTavern's?

korus69

Feb 19, 2025

it's in koboldcpp's

inflatebot

Owner Feb 19, 2025

•

edited Feb 19, 2025

Hmmm
The token with ID 2 is "</s>", the end-of-string token for Mistral Nemo's default format. Are your Context Template and Instruct Template set to ChatML (or a ChatML variant?)

inflatebot

Owner Feb 19, 2025

Also, where did you get the GGUF file from, and does their version of the ChatML-ified Mistral Nemo give you similar trouble?

korus69

Feb 19, 2025

i got the gguf file from mradermacher and im using his imatrix quant. when i set the context template to alpaca it doesnt triggered it that much. i mainly get this interruption when im using either chatml or mistral context template.

inflatebot

Owner Feb 19, 2025

•

edited Feb 19, 2025

I see.
If you wouldn't mind, we could try using Featherless as the backend. This way we can narrow it down to either your ST setup or KoboldCPP/your quant file.
If that's cool, I can send you a temporary key. Usage won't cost me anything since it's a subscription, but it'll count towards my concurrent requests so I'd revoke it once you're done.

For the record, I looked at the tokenizer files, and they're the same as the ChatMLified Mistral Nemo, so if it is a problem with the tokenizer, it's at least not my fault. :P I do suspect the backend though. This test would definitively eliminate one or the other. If FL works, I can recommend trying a different quant. (Or you can just do that anyway. Maybe that's a better idea!)

inflatebot

Owner Feb 20, 2025

Did you ever get this figured out? Don't wanna leave you hanging.

annetterunner

Aug 9, 2025

Nice model, but it loves to put

[TOOL_CALLS]
at the end of every generation. Tried with multiple quants from bartowski and mradermacher. LM Studio, llama.cpp runtime.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment