README.md · AlexGS74/MiniMax-M2.1-REAP-50-mlx-4bit at main

Update README.md

e5d4026 verified about 1 month ago

1.61 kB

	---
	language:
	- en
	library_name: mlx
	tags:
	- minimax
	- MOE
	- pruning
	- compression
	- reap
	- cerebras
	- code
	- function-calling
	- mlx
	license: apache-2.0
	pipeline_tag: text-generation
	base_model: 0xSero/MiniMax-M2.1-REAP-50
	---

	# MiniMax REAP-50 MLX 4-bit

	This is a 4-bit quantized version of the MiniMax REAP-50 model optimized for Apple Silicon using MLX.

	## Quantization Details
	- Quantization: 4-bit
	- Format: MLX SafeTensors
	- Optimization: Apple Silicon (M-series chips)

	## Usage

	### Python

	```python
	from mlx_lm import load, generate

	model, tokenizer = load("minimax-reap50-mlx-4bit")

	response = generate(
	model,
	tokenizer,
	prompt="Write a function to calculate fibonacci numbers",
	max_tokens=500,
	verbose=True
	)
	print(response)
	```

	### mlx.server

	Start the server:

	```bash
	mlx_lm.server --model minimax-reap50-mlx-4bit --port 8080
	```

	Make requests:

	```bash
	curl -X POST http://localhost:8080/v1/completions \
	-H "Content-Type: application/json" \
	-d '{
	"model": "default_model",
	"prompt": "Write a function to calculate fibonacci numbers",
	"max_tokens": 500
	}'
	```

	Or use the chat endpoint:

	```bash
	curl -X POST http://localhost:8080/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{
	"model": "default_model",
	"messages": [
	{"role": "user", "content": "Write a function to calculate fibonacci numbers"}
	],
	"max_tokens": 500
	}'
	```

	## Trade-offs
	- Memory: Lowest memory footprint (~65.5 GB)
	- Quality: Acceptable quality with minor degradation
	- Speed: Fastest inference speed