Instructions to use prithivMLmods/MiniCPM-V-4.6-Thinking-abliterated-MAX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use prithivMLmods/MiniCPM-V-4.6-Thinking-abliterated-MAX with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="prithivMLmods/MiniCPM-V-4.6-Thinking-abliterated-MAX") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("prithivMLmods/MiniCPM-V-4.6-Thinking-abliterated-MAX") model = AutoModelForImageTextToText.from_pretrained("prithivMLmods/MiniCPM-V-4.6-Thinking-abliterated-MAX") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use prithivMLmods/MiniCPM-V-4.6-Thinking-abliterated-MAX with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "prithivMLmods/MiniCPM-V-4.6-Thinking-abliterated-MAX" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "prithivMLmods/MiniCPM-V-4.6-Thinking-abliterated-MAX", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/prithivMLmods/MiniCPM-V-4.6-Thinking-abliterated-MAX
- SGLang
How to use prithivMLmods/MiniCPM-V-4.6-Thinking-abliterated-MAX with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "prithivMLmods/MiniCPM-V-4.6-Thinking-abliterated-MAX" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "prithivMLmods/MiniCPM-V-4.6-Thinking-abliterated-MAX", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "prithivMLmods/MiniCPM-V-4.6-Thinking-abliterated-MAX" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "prithivMLmods/MiniCPM-V-4.6-Thinking-abliterated-MAX", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use prithivMLmods/MiniCPM-V-4.6-Thinking-abliterated-MAX with Docker Model Runner:
docker model run hf.co/prithivMLmods/MiniCPM-V-4.6-Thinking-abliterated-MAX
MiniCPM-V-4.6-Thinking-abliterated-MAX
MiniCPM-V-4.6-Thinking-abliterated-MAX is an abliterated evolution built on top of openbmb/MiniCPM-V-4.6-Thinking. This model applies advanced refusal direction analysis and ablation-based optimization strategies to reduce internal refusal behaviors while preserving the multimodal reasoning, explicit thinking capabilities, and instruction-following strengths of the original architecture. The result is a highly capable and ultra-efficient multimodal reasoning language model optimized for image, video, and text understanding with enhanced step-by-step reasoning and improved instruction adherence.
This model is intended for research and learning purposes only. It reduces internal refusal behaviors, and any content generated by it is used at the user’s own risk. The authors and hosting page disclaim any liability for outputs produced by this model. Users are responsible for ensuring safe, ethical, and lawful usage.
Evals
.eval_results: harm_bench_score.yaml
The evaluation was conducted using 2,000 random harmful test prompts to measure the refusal behavior of the language model. The self-reported evaluations provided here are intended only to give an overview of the model. Scores may vary depending on the benchmark and the evaluation strategy used.Key Highlights
Advanced Refusal Direction Analysis Uses targeted activation analysis to identify and mitigate refusal directions within the model’s latent space.
Abliterated MAX Optimization Fine-tuned to significantly reduce refusal patterns while maintaining coherent, detailed, and reasoning-oriented outputs.
Thinking-Enabled Multimodal Reasoning Supports explicit step-by-step reasoning for complex multimodal tasks involving text, images, and videos.
Efficient Multimodal Architecture Built on openbmb/MiniCPM-V-4.6-Thinking, combining SigLIP2-400M vision encoding with Qwen3.5-0.8B language capabilities for compact yet powerful multimodal understanding.
Image & Video Understanding Optimized for advanced reasoning across text, images, and video inputs while remaining highly efficient for edge and local deployment.
262K Long Context Support Supports extremely long multimodal contexts across text, image, and video modalities.
Improved Instruction Adherence Designed to follow complex prompts with fewer unnecessary refusals while preserving strong conversational and reasoning performance.
Edge-Optimized Deployment Suitable for local inference, lightweight multimodal AI systems, and edge-device experimentation with minimal hardware overhead.
Quick Start with Transformers
pip install transformers==5.8.0 gradio==6.14.0
import gc
import time
from threading import Thread
import gradio as gr
import torch
from PIL import Image
from transformers import (
MiniCPMV4_6ForConditionalGeneration,
AutoProcessor,
TextIteratorStreamer,
)
MAX_MAX_NEW_TOKENS = 4096
DEFAULT_MAX_NEW_TOKENS = 1024
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)
MODEL_ID = "prithivMLmods/MiniCPM-V-4.6-Thinking-abliterated-MAX"
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
model = MiniCPMV4_6ForConditionalGeneration.from_pretrained(
MODEL_ID,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
).to(device).eval()
def generate(
image: Image.Image,
text: str,
max_new_tokens: int = DEFAULT_MAX_NEW_TOKENS,
temperature: float = 0.6,
top_p: float = 0.9,
top_k: int = 50,
repetition_penalty: float = 1.2,
):
if image is None:
yield "[ERROR] Please upload an image."
return
if not text or not text.strip():
yield "[ERROR] Please enter your instruction."
return
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": text},
],
}
]
prompt_full = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = processor(
text=[prompt_full],
images=[image],
return_tensors="pt",
padding=True,
).to(device)
streamer = TextIteratorStreamer(
processor.tokenizer if hasattr(processor, "tokenizer") else processor,
skip_prompt=True,
skip_special_tokens=True,
)
generation_error = {"error": None}
generation_kwargs = {
**inputs,
"streamer": streamer,
"max_new_tokens": int(max_new_tokens),
"do_sample": True,
"temperature": float(temperature),
"top_p": float(top_p),
"top_k": int(top_k),
"repetition_penalty": float(repetition_penalty),
}
def _run():
try:
model.generate(**generation_kwargs)
except Exception as e:
generation_error["error"] = e
try:
streamer.end()
except Exception:
pass
thread = Thread(target=_run, daemon=True)
thread.start()
buffer = ""
for new_text in streamer:
buffer += new_text
time.sleep(0.01)
yield buffer
thread.join(timeout=1.0)
if generation_error["error"] is not None:
err = f"[ERROR] {str(generation_error['error'])}"
yield (buffer + "\n\n" + err) if buffer.strip() else err
return
if not buffer.strip():
yield "[ERROR] No output was generated."
gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()
def run_inference(
image,
text,
max_new_tokens,
temperature,
top_p,
top_k,
repetition_penalty,
):
yield from generate(
image=image,
text=text,
max_new_tokens=max_new_tokens,
temperature=temperature,
top_p=top_p,
top_k=top_k,
repetition_penalty=repetition_penalty,
)
with gr.Blocks(title="MiniCPM-V-4.6-Thinking-abliterated-MAX") as demo:
gr.Markdown(
"# MiniCPM-V-4.6-Thinking-abliterated-MAX\n"
"Upload an image and enter your instruction to run multimodal inference."
)
with gr.Row():
with gr.Column(scale=1):
image_input = gr.Image(type="pil", label="Input Image")
text_input = gr.Textbox(
label="Instruction",
placeholder="e.g., Describe the image, perform OCR, solve the problem...",
lines=4,
)
run_btn = gr.Button("Run Inference", variant="primary")
with gr.Accordion("Advanced Settings", open=False):
max_new_tokens = gr.Slider(
minimum=1,
maximum=MAX_MAX_NEW_TOKENS,
step=1,
value=DEFAULT_MAX_NEW_TOKENS,
label="Max New Tokens",
)
temperature = gr.Slider(
minimum=0.1,
maximum=4.0,
step=0.1,
value=0.6,
label="Temperature",
)
top_p = gr.Slider(
minimum=0.05,
maximum=1.0,
step=0.05,
value=0.9,
label="Top-p",
)
top_k = gr.Slider(
minimum=1,
maximum=1000,
step=1,
value=50,
label="Top-k",
)
repetition_penalty = gr.Slider(
minimum=1.0,
maximum=2.0,
step=0.05,
value=1.2,
label="Repetition Penalty",
)
with gr.Column(scale=1):
output = gr.Textbox(
label="Output",
lines=20,
placeholder="Output will appear here...",
)
run_btn.click(
fn=run_inference,
inputs=[
image_input,
text_input,
max_new_tokens,
temperature,
top_p,
top_k,
repetition_penalty,
],
outputs=[output],
)
if __name__ == "__main__":
demo.queue(max_size=10).launch(show_error=True)
Base Model Information
openbmb/MiniCPM-V-4.6-Thinking is the reasoning-enabled variant of OpenBMB’s 1.3B-parameter MiniCPM-V-4.6 series. It is built using SigLIP2-400M for visual encoding and Qwen3.5-0.8B as the language backbone, supporting text, image, and video inputs with up to 262K context length while adding explicit step-by-step “thinking” for complex multimodal reasoning tasks on edge and mobile hardware. It maintains the same compact and ultra-efficient architecture as the base MiniCPM-V-4.6 series.
Intended Use
Alignment & Refusal Research Studying refusal behaviors and activation-level alignment modifications in multimodal reasoning systems.
Multimodal Reasoning Experiments Evaluating explicit chain-of-thought and reasoning behavior across image, video, and text tasks.
Edge & Local AI Deployment Running compact multimodal reasoning systems efficiently on consumer hardware and edge devices.
Research Prototyping Experimentation with efficient multimodal transformer architectures and reasoning-focused alignment techniques.
Limitations & Risks
Important Note: This model intentionally reduces built-in refusal mechanisms.
Sensitive Output Possibility The model may generate controversial, explicit, or unsafe responses depending on prompts and multimodal inputs.
User Responsibility Outputs must be handled responsibly and within legal and ethical boundaries.
Reasoning Hallucinations Explicit thinking and chain-of-thought style outputs may occasionally contain inaccurate or fabricated reasoning steps.
Deployment Considerations While optimized for efficiency, high-resolution image and video inference workloads may still require substantial VRAM and optimized runtimes depending on task complexity.
- Downloads last month
- 56

