Instructions to use 0xSero/INTELLECT-3-REAP-50 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use 0xSero/INTELLECT-3-REAP-50 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="0xSero/INTELLECT-3-REAP-50")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("0xSero/INTELLECT-3-REAP-50")
model = AutoModelForCausalLM.from_pretrained("0xSero/INTELLECT-3-REAP-50")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use 0xSero/INTELLECT-3-REAP-50 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "0xSero/INTELLECT-3-REAP-50"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/INTELLECT-3-REAP-50",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/0xSero/INTELLECT-3-REAP-50

SGLang

How to use 0xSero/INTELLECT-3-REAP-50 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "0xSero/INTELLECT-3-REAP-50" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/INTELLECT-3-REAP-50",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "0xSero/INTELLECT-3-REAP-50" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/INTELLECT-3-REAP-50",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use 0xSero/INTELLECT-3-REAP-50 with Docker Model Runner:
```
docker model run hf.co/0xSero/INTELLECT-3-REAP-50
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Support this work: donate.sybilsolutions.ai

REAP surfaces: GLM | MiniMax | Qwen | Gemma | Paper | Code | PR17 | Cerebras Collection

INTELLECT-3-REAP-50

50% expert-pruned version of PrimeIntellect/INTELLECT-3 using Cerebras REAP (Router-weighted Expert Activation Pruning).

Model Details

Property	Value
Base Model	PrimeIntellect/INTELLECT-3 (248B MoE)
Architecture	GLM-4 MoE (glm4_moe)
Compression	50% (64 experts pruned)
Remaining Experts	64 per layer
Parameters	~124B
Format	BF16 SafeTensors
Size	107 GB

REAP Configuration

dataset: 0xSero/glm47-reap-calibration-v2
samples: 1360
  - evol-codealpaca-v1: 700 (code generation)
  - xlam-function-calling-60k: 330 (function calling)
  - SWE-smith-trajectories: 330 (agentic multi-turn)
distance_measure: angular
seed: 42
model_max_length: 2048
compression_ratio: 0.50
prune_method: reap

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "0xSero/INTELLECT-3-REAP-50",
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("0xSero/INTELLECT-3-REAP-50", trust_remote_code=True)

messages = [{"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Related Models

Model	Compression	Format	Size
INTELLECT-3-REAP-50	50%	BF16	107GB
INTELLECT-3-REAP-50-W4A16	50%	W4A16 GPTQ	~30GB (coming soon)

Citation

@article{cerebras2025reap,
  title={REAP: Router-weighted Expert Activation Pruning for MoE Models},
  author={Cerebras Systems},
  year={2025}
}

Acknowledgments

Prime Intellect - For sponsoring compute and creating INTELLECT-3
Cerebras - For the REAP pruning methodology
Pruned using the Cerebras REAP implementation

This model was created as part of efficiency research for large MoE models.

Support

If this work is useful, support Sybil Solutions here: https://donate.sybilsolutions.ai

Support and links

Donate: https://donate.sybilsolutions.ai
X: https://x.com/0xsero
GitHub: https://github.com/0xsero

Model tree for 0xSero/INTELLECT-3-REAP-50

Base model

zai-org/GLM-4.5-Air-Base

Finetuned

PrimeIntellect/INTELLECT-3

Finetuned

(5)

this model

Quantizations

2 models

Paper for 0xSero/INTELLECT-3-REAP-50

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 19

0xSero
/

INTELLECT-3-REAP-50