Instructions to use catplusplus/MiniMax-M2.7-REAP-172B-A10B-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use catplusplus/MiniMax-M2.7-REAP-172B-A10B-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="catplusplus/MiniMax-M2.7-REAP-172B-A10B-NVFP4", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("catplusplus/MiniMax-M2.7-REAP-172B-A10B-NVFP4", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("catplusplus/MiniMax-M2.7-REAP-172B-A10B-NVFP4", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use catplusplus/MiniMax-M2.7-REAP-172B-A10B-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "catplusplus/MiniMax-M2.7-REAP-172B-A10B-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "catplusplus/MiniMax-M2.7-REAP-172B-A10B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/catplusplus/MiniMax-M2.7-REAP-172B-A10B-NVFP4

SGLang

How to use catplusplus/MiniMax-M2.7-REAP-172B-A10B-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "catplusplus/MiniMax-M2.7-REAP-172B-A10B-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "catplusplus/MiniMax-M2.7-REAP-172B-A10B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "catplusplus/MiniMax-M2.7-REAP-172B-A10B-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "catplusplus/MiniMax-M2.7-REAP-172B-A10B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use catplusplus/MiniMax-M2.7-REAP-172B-A10B-NVFP4 with Docker Model Runner:
```
docker model run hf.co/catplusplus/MiniMax-M2.7-REAP-172B-A10B-NVFP4
```

MiniMax-M2.7-REAP-172B-A10B-NVFP4

File size: 1,819 Bytes

2143e89

#include <iostream>
#include <string>
#include <cstdlib>
#include <sys/mman.h>
#include <unistd.h>
#include <cstring>

int main(int argc, char** argv) {
    if (argc != 2) {
        std::cerr << "Usage: " << argv[0] << " <GB to allocate and lock>\n";
        return 1;
    }

    double gb = std::stod(argv[1]);
    size_t bytes = static_cast<size_t>(gb * 1024.0 * 1024.0 * 1024.0);

    std::cout << "Allocating " << gb << " GB (" << bytes << " bytes) of RAM...\n";
    
    void* ptr = mmap(NULL, bytes, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (ptr == MAP_FAILED) {
        perror("mmap failed");
        return 1;
    }

    std::cout << "Memory allocated. Faulting pages and pinning to RAM...\n";
    
    // Write to memory to map it to physical pages and prevent lazy allocation
    size_t page_size = sysconf(_SC_PAGESIZE);
    char* char_ptr = static_cast<char*>(ptr);
    for (size_t i = 0; i < bytes; i += page_size) {
        char_ptr[i] = 1;
    }

    // Mlock to pin it to RAM and prevent it from being swapped out itself
    if (mlock(ptr, bytes) != 0) {
        perror("mlock failed (you probably need to run with sudo)");
    } else {
        std::cout << "mlock successful.\n";
    }

    std::cout << "Memory is fully resident. Other inactive processes/caches should be pushed to swap.\n";
    std::cout << "Clearing filesystem caches...\n";
    
    int ret = system("echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null");
    if (ret != 0) {
        std::cerr << "Failed to clear caches. (Maybe sudo failed?)\n";
    } else {
        std::cout << "Caches cleared successfully.\n";
    }

    std::cout << "Unlocking and releasing memory...\n";
    munlock(ptr, bytes);
    munmap(ptr, bytes);

    std::cout << "Done! Try starting your model now.\n";
    return 0;
}