--- pipeline_tag: text-generation license: mit library_name: transformers base_model: - MiniMaxAI/MiniMax-M2 --- # Building and Running the Experimental `minimax` Branch of `llama.cpp` **Note:** This setup is experimental. The `minimax` branch will not work with the standard `llama.cpp`. Use it only for testing GGUF models with experimental features. --- ## System Requirements (you can use any supported this is for ubuntu build commands) - Ubuntu 22.04 - NVIDIA GPU with CUDA support - CUDA Toolkit 12.8 or later - CMake --- ## Installation Steps ### 1. Install CUDA Toolkit 12.8 ```bash wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb sudo dpkg -i cuda-keyring_1.1-1_all.deb sudo apt-get update sudo apt-get -y install cuda-toolkit-12-8 ``` ### 2. Set Environment Variables ```bash export CUDA_HOME=/usr/local/cuda export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64 export PATH=$PATH:$CUDA_HOME/bin ``` ### 3. Install Build Tools ```bash sudo apt install cmake ``` ### 4. Clone the Experimental Branch ```bash git clone --branch minimax --single-branch https://github.com/cturan/llama.cpp.git cd llama.cpp ``` ### 5. Build the Project ```bash mkdir build cd build cmake .. -DLLAMA_CUDA=ON -DLLAMA_CURL=OFF cmake --build . --config Release --parallel $(nproc --all) ``` --- ## Build Output After the build is complete, the binaries will be located in: ``` llama.cpp/build/bin ``` --- ## Running the Model Example command: ```bash ./llama-server -m minimax-m2-Q4_K.gguf -ngl 999 --cpu-moe --jinja -fa on -c 32000 --reasoning-format auto ``` This configuration offloads the experts to the CPU, so approximately 16 GB of VRAM is sufficient. --- ## Notes - `--cpu-moe` enables CPU offloading for mixture-of-experts layers. - `--jinja` activates the Jinja templating engine. - Adjust `-c` (context length) and `-ngl` (GPU layers) according to your hardware. - Ensure the model file (`minimax-m2-Q4_K.gguf`) is available in the working directory. --- All steps complete. The experimental CUDA-enabled build of `llama.cpp` is ready to use.