AlexGS74's picture
Update README.md
e5d4026 verified
metadata
language:
  - en
library_name: mlx
tags:
  - minimax
  - MOE
  - pruning
  - compression
  - reap
  - cerebras
  - code
  - function-calling
  - mlx
license: apache-2.0
pipeline_tag: text-generation
base_model: 0xSero/MiniMax-M2.1-REAP-50

MiniMax REAP-50 MLX 4-bit

This is a 4-bit quantized version of the MiniMax REAP-50 model optimized for Apple Silicon using MLX.

Quantization Details

  • Quantization: 4-bit
  • Format: MLX SafeTensors
  • Optimization: Apple Silicon (M-series chips)

Usage

Python

from mlx_lm import load, generate

model, tokenizer = load("minimax-reap50-mlx-4bit")

response = generate(
    model, 
    tokenizer, 
    prompt="Write a function to calculate fibonacci numbers",
    max_tokens=500,
    verbose=True
)
print(response)

mlx.server

Start the server:

mlx_lm.server --model minimax-reap50-mlx-4bit --port 8080

Make requests:

curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default_model",
    "prompt": "Write a function to calculate fibonacci numbers",
    "max_tokens": 500
  }'

Or use the chat endpoint:

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default_model",
    "messages": [
      {"role": "user", "content": "Write a function to calculate fibonacci numbers"}
    ],
    "max_tokens": 500
  }'

Trade-offs

  • Memory: Lowest memory footprint (~65.5 GB)
  • Quality: Acceptable quality with minor degradation
  • Speed: Fastest inference speed