Instructions to use Ex0bit/Kimi-K2.5-PRISM with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Ex0bit/Kimi-K2.5-PRISM with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="Ex0bit/Kimi-K2.5-PRISM", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Ex0bit/Kimi-K2.5-PRISM", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Ex0bit/Kimi-K2.5-PRISM with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Ex0bit/Kimi-K2.5-PRISM" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Ex0bit/Kimi-K2.5-PRISM", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Ex0bit/Kimi-K2.5-PRISM
- SGLang
How to use Ex0bit/Kimi-K2.5-PRISM with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Ex0bit/Kimi-K2.5-PRISM" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Ex0bit/Kimi-K2.5-PRISM", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Ex0bit/Kimi-K2.5-PRISM" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Ex0bit/Kimi-K2.5-PRISM", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use Ex0bit/Kimi-K2.5-PRISM with Docker Model Runner:
docker model run hf.co/Ex0bit/Kimi-K2.5-PRISM
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("image-text-to-text", model="Ex0bit/Kimi-K2.5-PRISM", trust_remote_code=True)
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
{"type": "text", "text": "What animal is on the candy?"}
]
},
]
pipe(text=messages)# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("Ex0bit/Kimi-K2.5-PRISM", trust_remote_code=True, dtype="auto")
The Kimi-K2.5-PRISM Tensors are available for purchase only, reach out here: https://ko-fi.com/s/64a50000a4
Kimi-K2.5-PRISM
An unrestricted/unchained PRISM version of Moonshot AI's Kimi-K2.5 with over-refusal and propaganda mechanisms completely removed using our advanced PRISM pipeline (Projected Refusal Isolation via Subspace Modification).
☕ Support Our Work
If you enjoy our work and find it useful, please sponsor and support!
| Option | Description |
|---|---|
| PRISM VIP Membership | Day-0 Access to all PRISM models |
| One-Time Support | Purchase this model |
Model Highlights
- PRISM Ablation — State-of-the-art technique that removes over-refusal behaviors while preserving model capabilities
- 1T MoE Architecture — 1 trillion total parameters with 32 billion active per token across 384 experts
- Native Multimodal — Pre-trained on vision-language tokens for seamless image, video, and text understanding
- 256K Context Window — Extended context for complex agentic tasks and large codebases
- Dual Modes — Supports both Thinking (deep reasoning) and Instant (fast response) modes
- Agent Swarm — Self-directed, coordinated multi-agent execution for complex tasks
Model Architecture
| Specification | Value |
|---|---|
| Architecture | Mixture-of-Experts (MoE) |
| Total Parameters | 1T |
| Activated Parameters | 32B |
| Number of Layers | 61 |
| Attention Hidden Dimension | 7168 |
| Number of Attention Heads | 64 |
| Number of Experts | 384 |
| Selected Experts per Token | 8 |
| Shared Experts | 1 |
| Vocabulary Size | 160K |
| Context Length | 256K |
| Attention Mechanism | MLA |
| Activation Function | SwiGLU |
| Vision Encoder | MoonViT (400M) |
Benchmarks
| Benchmark | Kimi K2.5 (Thinking) | GPT-5.2 | Claude 4.5 Opus | Gemini 3 Pro |
|---|---|---|---|---|
| AIME 2025 | 96.1 | 100 | 92.8 | 95.0 |
| GPQA-Diamond | 87.6 | 92.4 | 87.0 | 91.9 |
| HLE-Full | 30.1 | 34.5 | 30.8 | 37.5 |
| HLE-Full (w/ tools) | 50.2 | 45.5 | 43.2 | 45.8 |
| SWE-Bench Verified | 76.8 | 80.0 | 80.9 | 76.2 |
| Terminal Bench 2.0 | 50.8 | 54.0 | 59.3 | 54.2 |
| BrowseComp | 60.6 | 65.8 | 37.0 | 37.8 |
| MMMU-Pro | 78.5 | 79.5 | 74.0 | 81.0 |
| VideoMMMU | 86.6 | 85.9 | 84.4 | 87.6 |
Usage
Transformers
Install dependencies:
pip install git+https://github.com/huggingface/transformers.git
Basic chat completion:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_PATH = "Ex0bit/Kimi-K2.5-PRISM"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
messages = [
{"role": "system", "content": "You are Kimi, an AI assistant."},
{"role": "user", "content": "Hello!"}
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=4096, do_sample=True, temperature=1.0, top_p=0.95)
output_text = tokenizer.decode(generated_ids[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(output_text)
Chat with Image
import base64
import requests
# Load image
url = "https://example.com/image.png"
image_base64 = base64.b64encode(requests.get(url).content).decode()
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image in detail."},
{
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{image_base64}"},
},
],
}
]
# Use same generation code as above
vLLM
Install vLLM nightly:
pip install -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://wheels.vllm.ai/nightly
pip install git+https://github.com/huggingface/transformers.git
Serve the model:
vllm serve Ex0bit/Kimi-K2.5-PRISM \
--tensor-parallel-size 8 \
--trust-remote-code \
--served-model-name kimi-k2.5-prism
SGLang
python3 -m sglang.launch_server \
--model-path Ex0bit/Kimi-K2.5-PRISM \
--tp-size 8 \
--trust-remote-code \
--served-model-name kimi-k2.5-prism \
--host 0.0.0.0 \
--port 8000
Recommended Parameters
| Mode | Temperature | Top-P | Max New Tokens |
|---|---|---|---|
| Thinking | 1.0 | 0.95 | 96000 |
| Instant | 0.6 | 0.95 | 4096 |
Switching Modes
For Instant mode (faster, no reasoning), pass:
# Official API
extra_body={"thinking": {"type": "disabled"}}
# vLLM/SGLang
extra_body={"chat_template_kwargs": {"thinking": False}}
Hardware Requirements
Due to the 1T parameter size, this model requires significant hardware:
- Minimum: 8x A100 80GB or equivalent
- Recommended: 8x H100 80GB for optimal performance
- INT4 Quantization: Available for reduced memory footprint
License
This model is released under the PRISM Research License.
Acknowledgments
Based on Kimi-K2.5 by Moonshot AI. See the technical blog for more details on the base model.
- Downloads last month
- 2
# Gated model: Login with a HF token with gated access permission hf auth login