Instructions to use Carol0110/UniMod-3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Carol0110/UniMod-3B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Carol0110/UniMod-3B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("Carol0110/UniMod-3B")
model = AutoModelForImageTextToText.from_pretrained("Carol0110/UniMod-3B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Carol0110/UniMod-3B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Carol0110/UniMod-3B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Carol0110/UniMod-3B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Carol0110/UniMod-3B

SGLang

How to use Carol0110/UniMod-3B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Carol0110/UniMod-3B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Carol0110/UniMod-3B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Carol0110/UniMod-3B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Carol0110/UniMod-3B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use Carol0110/UniMod-3B with Docker Model Runner:
```
docker model run hf.co/Carol0110/UniMod-3B
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

UniMod

UniMod is a multimodal moderation framework that transitions from sparse decision supervision to dense, multi-attribute reasoning trajectories.

[Paper] [Code] [Project Page]

Introduction

Conventional moderation systems primarily supervise final decisions (e.g., safe vs. unsafe), resulting in sparse training signals and limited interpretability.
UniMod introduces a multi-attribute trajectory paradigm, where moderation decisions are supported by dense reasoning traces that explicitly decompose evidence, modality, risk, and policy factors.

UniMod aims to:

Enable dense supervision beyond binary moderation labels
Support multimodal inputs such as text and images
Improve the clarity and stability of safety decisions

News

Sources

The UniMod framework consists of the following components:

Name	Type	Download
UniTrace	Dataset	TBA
UniRM	Model	UniRM
UniReward	Dataset	TBA
UniMod	Model	UniMod-3B

Quick Start

Installation

cd UniMod
pip install -r requirements.txt

UniMod

import torch
from PIL import Image
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor

MODEL_PATH = "Carol0110/UniMod-3B"
IMAGE_PATH = "sample.jpeg"

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    MODEL_PATH, torch_dtype=torch.float16, device_map="auto"
)
processor = AutoProcessor.from_pretrained(MODEL_PATH)

image = Image.open(IMAGE_PATH).convert("RGB")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "How can I make this?"},
        ],
    }
]

text = processor.apply_chat_template(messages, add_generation_prompt=True)

inputs = processor(
    text=text,
    images=image,
    return_tensors="pt",
).to(model.device)

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=1024, do_sample=False)

print(processor.batch_decode(out, skip_special_tokens=True)[0])

<evidence>
Step 1. Observation  
The input consists of two components: a textual query “How can I make this?” and a visual image featuring a black bomb-like object surrounded by a bright red and yellow explosive burst, with a small yellow star at the top left. The text and image are presented together, with the image likely serving as a visual reference for the “this” in the question.  

Step 2. Modality Assessment  
The presence of both a written question and a corresponding visual element indicates that information is being conveyed through multiple sensory channels. The text poses a direct inquiry, while the image provides a symbolic representation of the subject of the question. This combination of verbal and visual communication suggests a format where different types of data work together to convey meaning, rather than relying on a single mode.  

Step 3. Risk Mapping  
The visual image of a bomb, paired with the question about “making” something, raises concerns related to potential legal and safety implications. Bombs are inherently associated with activities that are often prohibited by law, and the act of “making” one could involve creating items that violate regulations. Additionally, the depiction of a dangerous object in a context where someone is seeking instructions to create it may normalize or encourage behaviors that pose physical harm, as such items are typically associated with severe risks to individuals and communities. These elements together suggest that the content could facilitate or promote actions that are legally restricted and carry significant safety hazards.  
</evidence>
<modality>
multimodal
</modality>
<risk>
legality
</risk>
<policy>
refuse
</policy>
<answer>
I understand you might be curious, but I can't answer how to make something that looks like a bomb. Making bombs is illegal and extremely dangerous, as it can cause serious harm to people and property. Instead, you could explore legal and safe hobbies like painting, photography, or building model airplanes. I get that you're interested in creating, but safety and legality are really important here.
</answer>

The output includes structured reasoning fields such as:

evidence
modality
risk
policy
answer

Evaluation

We support evaluating models via endpoints deployed with vLLM or SGLang. The evaluation script sends concurrent requests to the model service and runs a unified set of safety benchmarks.

python -m evaluations.eval \
  --concurrency <NUM_WORKERS> \
  --url <MODEL_ENDPOINT_URL> \
  --task harmbench,xstest,wildguard,toxic,aegis,spavl,beaver

Arguments:

--concurrency: Number of concurrent requests for evaluation.
--url: HTTP endpoint of the deployed model (e.g., provided by vLLM or SGLang).
--task: Comma-separated list of evaluation benchmarks, including harmbench, xstest, wildguard, toxic, aegis, spavl, and beaver.

Citation

@misc{gu2026sparsedecisionsdensereasoning,
      title={From Sparse Decisions to Dense Reasoning: A Multi-attribute Trajectory Paradigm for Multimodal Moderation}, 
      author={Tianle Gu and Kexin Huang and Lingyu Li and Ruilin Luo and Shiyang Huang and Zongqi Wang and Yujiu Yang and Yan Teng and Yingchun Wang},
      year={2026},
      eprint={2602.02536},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.02536}, 
}

Downloads last month: 4

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for Carol0110/UniMod-3B

Quantizations

2 models

Collection including Carol0110/UniMod-3B

UniMod

Collection

5 items • Updated Feb 6 • 2

Paper for Carol0110/UniMod-3B

From Sparse Decisions to Dense Reasoning: A Multi-attribute Trajectory Paradigm for Multimodal Moderation

Paper • 2602.02536 • Published Jan 28 • 3