Instructions to use cyankiwi/MiniMax-M2.7-AWQ-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use cyankiwi/MiniMax-M2.7-AWQ-4bit with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="cyankiwi/MiniMax-M2.7-AWQ-4bit", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("cyankiwi/MiniMax-M2.7-AWQ-4bit", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("cyankiwi/MiniMax-M2.7-AWQ-4bit", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use cyankiwi/MiniMax-M2.7-AWQ-4bit with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "cyankiwi/MiniMax-M2.7-AWQ-4bit" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cyankiwi/MiniMax-M2.7-AWQ-4bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/cyankiwi/MiniMax-M2.7-AWQ-4bit
- SGLang
How to use cyankiwi/MiniMax-M2.7-AWQ-4bit with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "cyankiwi/MiniMax-M2.7-AWQ-4bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cyankiwi/MiniMax-M2.7-AWQ-4bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "cyankiwi/MiniMax-M2.7-AWQ-4bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cyankiwi/MiniMax-M2.7-AWQ-4bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use cyankiwi/MiniMax-M2.7-AWQ-4bit with Docker Model Runner:
docker model run hf.co/cyankiwi/MiniMax-M2.7-AWQ-4bit
These are NOT actual AWQ-quantized models.
Heads up! Despite the "AWQ" tag in the title, the config.json reveals these models are using standard compressed-tensors (W4A16) rather than the AWQ (Activation-aware Weight Quantization) method. Real AWQ requires an activation calibration process and specific scaling factors, which are missing here. This is misleading for users looking for actual AWQ kernels.
AWQ is the algorithm used to optimize this model, whereas compressed-tensors is the format i.e., weight_packed, weight_scale, weight_zero_point, weight_shape that the model is saved after quantization.
In regards to kernels used for inference, vllm uses the same Marlin kernel for compressed-tensors and AutoAWQ format, but via different routes.
Heads up! Despite the "AWQ" tag in the title, the config.json reveals these models are using standard compressed-tensors (W4A16) rather than the AWQ (Activation-aware Weight Quantization) method. Real AWQ requires an activation calibration process and specific scaling factors, which are missing here. This is misleading for users looking for actual AWQ kernels.
https://github.com/vllm-project/llm-compressor/blob/main/examples/awq/README.md
I have used cpatonn's AWQ-4bit variants for about 7 to 8 months now and they are definitely quantized. I have built a complete sovereign AI infrastructure using these models. I have no cloud dependency at all. I have attempted to serve numerous un-quantized models like mistral-small-4-119b or qwen3.5-122b on L40S-180 GPU Instances(Dual 48GB cards) the model has to be properly configured in order to shard across multiple GPU's. This is where you come to get guaranteed working models. Hopefully he has time to quantize the new nemotron-3-nano-omni-30-reasoning model(these names are just getting way too long). I had to use a random quantized model from a reputable user/repo/space(drawais) and it works, including all modalities. It's an ANY to ANY model. I run all my models via VLLM and docker-compose.