|
|
--- |
|
|
language: |
|
|
- en |
|
|
library_name: mlx |
|
|
tags: |
|
|
- minimax |
|
|
- MOE |
|
|
- pruning |
|
|
- compression |
|
|
- reap |
|
|
- cerebras |
|
|
- code |
|
|
- function-calling |
|
|
- mlx |
|
|
license: apache-2.0 |
|
|
pipeline_tag: text-generation |
|
|
base_model: 0xSero/MiniMax-M2.1-REAP-50 |
|
|
--- |
|
|
|
|
|
# MiniMax REAP-50 MLX 4-bit |
|
|
|
|
|
This is a 4-bit quantized version of the MiniMax REAP-50 model optimized for Apple Silicon using MLX. |
|
|
|
|
|
## Quantization Details |
|
|
- **Quantization**: 4-bit |
|
|
- **Format**: MLX SafeTensors |
|
|
- **Optimization**: Apple Silicon (M-series chips) |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Python |
|
|
|
|
|
```python |
|
|
from mlx_lm import load, generate |
|
|
|
|
|
model, tokenizer = load("minimax-reap50-mlx-4bit") |
|
|
|
|
|
response = generate( |
|
|
model, |
|
|
tokenizer, |
|
|
prompt="Write a function to calculate fibonacci numbers", |
|
|
max_tokens=500, |
|
|
verbose=True |
|
|
) |
|
|
print(response) |
|
|
``` |
|
|
|
|
|
### mlx.server |
|
|
|
|
|
Start the server: |
|
|
|
|
|
```bash |
|
|
mlx_lm.server --model minimax-reap50-mlx-4bit --port 8080 |
|
|
``` |
|
|
|
|
|
Make requests: |
|
|
|
|
|
```bash |
|
|
curl -X POST http://localhost:8080/v1/completions \ |
|
|
-H "Content-Type: application/json" \ |
|
|
-d '{ |
|
|
"model": "default_model", |
|
|
"prompt": "Write a function to calculate fibonacci numbers", |
|
|
"max_tokens": 500 |
|
|
}' |
|
|
``` |
|
|
|
|
|
Or use the chat endpoint: |
|
|
|
|
|
```bash |
|
|
curl -X POST http://localhost:8080/v1/chat/completions \ |
|
|
-H "Content-Type: application/json" \ |
|
|
-d '{ |
|
|
"model": "default_model", |
|
|
"messages": [ |
|
|
{"role": "user", "content": "Write a function to calculate fibonacci numbers"} |
|
|
], |
|
|
"max_tokens": 500 |
|
|
}' |
|
|
``` |
|
|
|
|
|
## Trade-offs |
|
|
- **Memory**: Lowest memory footprint (~65.5 GB) |
|
|
- **Quality**: Acceptable quality with minor degradation |
|
|
- **Speed**: Fastest inference speed |
|
|
|