Instructions to use inflatebot/MN-12B-Mag-Mell-R1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use inflatebot/MN-12B-Mag-Mell-R1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="inflatebot/MN-12B-Mag-Mell-R1") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("inflatebot/MN-12B-Mag-Mell-R1") model = AutoModelForCausalLM.from_pretrained("inflatebot/MN-12B-Mag-Mell-R1") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use inflatebot/MN-12B-Mag-Mell-R1 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "inflatebot/MN-12B-Mag-Mell-R1" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "inflatebot/MN-12B-Mag-Mell-R1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/inflatebot/MN-12B-Mag-Mell-R1
- SGLang
How to use inflatebot/MN-12B-Mag-Mell-R1 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "inflatebot/MN-12B-Mag-Mell-R1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "inflatebot/MN-12B-Mag-Mell-R1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "inflatebot/MN-12B-Mag-Mell-R1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "inflatebot/MN-12B-Mag-Mell-R1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use inflatebot/MN-12B-Mag-Mell-R1 with Docker Model Runner:
docker model run hf.co/inflatebot/MN-12B-Mag-Mell-R1
(Special Stop Token Triggered! ID:2)
how do i stop this from triggering? my output gets interrupted prematurely. im using koboldcpp and sillytavern in their recent versions.
Is this in KoboldCPP's logs, or SillyTavern's?
it's in koboldcpp's
Hmmm
The token with ID 2 is "</s>", the end-of-string token for Mistral Nemo's default format. Are your Context Template and Instruct Template set to ChatML (or a ChatML variant?)
Also, where did you get the GGUF file from, and does their version of the ChatML-ified Mistral Nemo give you similar trouble?
i got the gguf file from mradermacher and im using his imatrix quant. when i set the context template to alpaca it doesnt triggered it that much. i mainly get this interruption when im using either chatml or mistral context template.
I see.
If you wouldn't mind, we could try using Featherless as the backend. This way we can narrow it down to either your ST setup or KoboldCPP/your quant file.
If that's cool, I can send you a temporary key. Usage won't cost me anything since it's a subscription, but it'll count towards my concurrent requests so I'd revoke it once you're done.
For the record, I looked at the tokenizer files, and they're the same as the ChatMLified Mistral Nemo, so if it is a problem with the tokenizer, it's at least not my fault. :P I do suspect the backend though. This test would definitively eliminate one or the other. If FL works, I can recommend trying a different quant. (Or you can just do that anyway. Maybe that's a better idea!)
Did you ever get this figured out? Don't wanna leave you hanging.
Nice model, but it loves to put
[TOOL_CALLS]
at the end of every generation. Tried with multiple quants from bartowski and mradermacher. LM Studio, llama.cpp runtime.