How to use from
llama.cpp
Install from brew
brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf cturan/MiniMax-M2-GGUF:
# Run inference directly in the terminal:
llama-cli -hf cturan/MiniMax-M2-GGUF:
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf cturan/MiniMax-M2-GGUF:
# Run inference directly in the terminal:
llama-cli -hf cturan/MiniMax-M2-GGUF:
Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf cturan/MiniMax-M2-GGUF:
# Run inference directly in the terminal:
./llama-cli -hf cturan/MiniMax-M2-GGUF:
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf cturan/MiniMax-M2-GGUF:
# Run inference directly in the terminal:
./build/bin/llama-cli -hf cturan/MiniMax-M2-GGUF:
Use Docker
docker model run hf.co/cturan/MiniMax-M2-GGUF:
Quick Links

Building and Running the Experimental minimax Branch of llama.cpp

Note:
This setup is experimental. The minimax branch will not work with the standard llama.cpp. Use it only for testing GGUF models with experimental features.


System Requirements (you can use any supported this is for ubuntu build commands)

  • Ubuntu 22.04
  • NVIDIA GPU with CUDA support
  • CUDA Toolkit 12.8 or later
  • CMake

Installation Steps

1. Install CUDA Toolkit 12.8

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-8

2. Set Environment Variables

export CUDA_HOME=/usr/local/cuda
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64
export PATH=$PATH:$CUDA_HOME/bin

3. Install Build Tools

sudo apt install cmake

4. Clone the Experimental Branch

git clone --branch minimax --single-branch https://github.com/cturan/llama.cpp.git
cd llama.cpp

5. Build the Project

mkdir build
cd build
cmake .. -DLLAMA_CUDA=ON -DLLAMA_CURL=OFF
cmake --build . --config Release --parallel $(nproc --all)

Build Output

After the build is complete, the binaries will be located in:

llama.cpp/build/bin

Running the Model

Example command:

./llama-server -m minimax-m2-Q4_K.gguf -ngl 999 --cpu-moe --jinja -fa on -c 32000 --reasoning-format auto

This configuration offloads the experts to the CPU, so approximately 16 GB of VRAM is sufficient.


Notes

  • --cpu-moe enables CPU offloading for mixture-of-experts layers.
  • --jinja activates the Jinja templating engine.
  • Adjust -c (context length) and -ngl (GPU layers) according to your hardware.
  • Ensure the model file (minimax-m2-Q4_K.gguf) is available in the working directory.

All steps complete. The experimental CUDA-enabled build of llama.cpp is ready to use.

Downloads last month
98
GGUF
Model size
229B params
Architecture
minimax-m2
Hardware compatibility
Log In to add your hardware

2-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cturan/MiniMax-M2-GGUF

Quantized
(45)
this model