File size: 1,608 Bytes
29fe8c4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e5d4026
29fe8c4
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
---
language:
- en
library_name: mlx
tags:
- minimax
- MOE
- pruning
- compression
- reap
- cerebras
- code
- function-calling
- mlx
license: apache-2.0
pipeline_tag: text-generation
base_model: 0xSero/MiniMax-M2.1-REAP-50
---

# MiniMax REAP-50 MLX 4-bit

This is a 4-bit quantized version of the MiniMax REAP-50 model optimized for Apple Silicon using MLX.

## Quantization Details
- **Quantization**: 4-bit
- **Format**: MLX SafeTensors
- **Optimization**: Apple Silicon (M-series chips)

## Usage

### Python

```python
from mlx_lm import load, generate

model, tokenizer = load("minimax-reap50-mlx-4bit")

response = generate(
    model, 
    tokenizer, 
    prompt="Write a function to calculate fibonacci numbers",
    max_tokens=500,
    verbose=True
)
print(response)
```

### mlx.server

Start the server:

```bash
mlx_lm.server --model minimax-reap50-mlx-4bit --port 8080
```

Make requests:

```bash
curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default_model",
    "prompt": "Write a function to calculate fibonacci numbers",
    "max_tokens": 500
  }'
```

Or use the chat endpoint:

```bash
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default_model",
    "messages": [
      {"role": "user", "content": "Write a function to calculate fibonacci numbers"}
    ],
    "max_tokens": 500
  }'
```

## Trade-offs
- **Memory**: Lowest memory footprint (~65.5 GB)
- **Quality**: Acceptable quality with minor degradation
- **Speed**: Fastest inference speed