Instructions to use cyankiwi/MiniMax-M2.7-AWQ-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use cyankiwi/MiniMax-M2.7-AWQ-4bit with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="cyankiwi/MiniMax-M2.7-AWQ-4bit", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("cyankiwi/MiniMax-M2.7-AWQ-4bit", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("cyankiwi/MiniMax-M2.7-AWQ-4bit", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use cyankiwi/MiniMax-M2.7-AWQ-4bit with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "cyankiwi/MiniMax-M2.7-AWQ-4bit"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "cyankiwi/MiniMax-M2.7-AWQ-4bit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/cyankiwi/MiniMax-M2.7-AWQ-4bit

SGLang

How to use cyankiwi/MiniMax-M2.7-AWQ-4bit with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "cyankiwi/MiniMax-M2.7-AWQ-4bit" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "cyankiwi/MiniMax-M2.7-AWQ-4bit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "cyankiwi/MiniMax-M2.7-AWQ-4bit" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "cyankiwi/MiniMax-M2.7-AWQ-4bit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use cyankiwi/MiniMax-M2.7-AWQ-4bit with Docker Model Runner:
```
docker model run hf.co/cyankiwi/MiniMax-M2.7-AWQ-4bit
```

These are NOT actual AWQ-quantized models.

by cai-cai - opened Apr 15

Discussion

cai-cai

Apr 15

Heads up! Despite the "AWQ" tag in the title, the config.json reveals these models are using standard compressed-tensors (W4A16) rather than the AWQ (Activation-aware Weight Quantization) method. Real AWQ requires an activation calibration process and specific scaling factors, which are missing here. This is misleading for users looking for actual AWQ kernels.

cpatonn

cyankiwi org Apr 15

AWQ is the algorithm used to optimize this model, whereas compressed-tensors is the format i.e., weight_packed, weight_scale, weight_zero_point, weight_shape that the model is saved after quantization.

In regards to kernels used for inference, vllm uses the same Marlin kernel for compressed-tensors and AutoAWQ format, but via different routes.

CHNtentes

Apr 15

Heads up! Despite the "AWQ" tag in the title, the config.json reveals these models are using standard compressed-tensors (W4A16) rather than the AWQ (Activation-aware Weight Quantization) method. Real AWQ requires an activation calibration process and specific scaling factors, which are missing here. This is misleading for users looking for actual AWQ kernels.

https://github.com/vllm-project/llm-compressor/blob/main/examples/awq/README.md

zhuyuzhe1987

Apr 16

这个模型可以使用Lvllm进行混合推理 https://github.com/guqiong96/Lvllm/blob/main/README.md

aetherforge

May 2

I have used cpatonn's AWQ-4bit variants for about 7 to 8 months now and they are definitely quantized. I have built a complete sovereign AI infrastructure using these models. I have no cloud dependency at all. I have attempted to serve numerous un-quantized models like mistral-small-4-119b or qwen3.5-122b on L40S-180 GPU Instances(Dual 48GB cards) the model has to be properly configured in order to shard across multiple GPU's. This is where you come to get guaranteed working models. Hopefully he has time to quantize the new nemotron-3-nano-omni-30-reasoning model(these names are just getting way too long). I had to use a random quantized model from a reputable user/repo/space(drawais) and it works, including all modalities. It's an ANY to ANY model. I run all my models via VLLM and docker-compose.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment