Instructions to use Cubex11/Solari with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Cubex11/Solari with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Cubex11/Solari")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("Cubex11/Solari")
model = AutoModelForMultimodalLM.from_pretrained("Cubex11/Solari")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Cubex11/Solari with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Cubex11/Solari"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Cubex11/Solari",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Cubex11/Solari

SGLang

How to use Cubex11/Solari with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Cubex11/Solari" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Cubex11/Solari",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Cubex11/Solari" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Cubex11/Solari",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use Cubex11/Solari with Docker Model Runner:
```
docker model run hf.co/Cubex11/Solari
```

Solari / README.md

Cubex11

Update README.md

e73d146 verified 2 months ago

preview code

Raw

History Blame Contribute Delete

9.05 kB

	---
	library_name: transformers
	tags:
	- smolvlm
	- vlm
	- dpo
	- hallucination-reduction
	- accessibility
	- qlora
	- rlaif
	license: apache-2.0
	language:
	- en
	pipeline_tag: image-text-to-text
	base_model: HuggingFaceTB/SmolVLM2-500M-Video-Instruct
	datasets:
	- HuggingFaceH4/rlaif-v_formatted
	---

	![Model Logo](thumbnail.png)

	# Solari: Hallucination-Reduced Vision Language Model

	Solari is a 500M parameter vision-language model fine-tuned for reduced hallucination on real-world images. Built on [SmolVLM2-500M-Video-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct), Solari uses QLoRA + Direct Preference Optimization (DPO) on the [RLAIF-V](https://huggingface.co/datasets/HuggingFaceH4/rlaif-v_formatted) dataset to align the model toward more faithful visual descriptions.

	## Model Details

	### Model Description

	Solari targets hallucination reduction in vision-language tasks, with a focus on improving reliability for accessibility applications (e.g., assisting visually impaired users). The model was trained using parameter-efficient fine-tuning (QLoRA) with DPO to learn preferences between accurate and hallucinated image descriptions, achieving improved hallucination benchmarks while preserving general VLM capabilities.

	- Developed by: Cubex11
	- Model type: Vision-Language Model (Image-Text-to-Text)
	- Language(s): English
	- License: Apache-2.0
	- Finetuned from: [HuggingFaceTB/SmolVLM2-500M-Video-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct)

	### Model Sources

	- Base Model: [SmolVLM2-500M-Video-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct)
	- Training Dataset: [RLAIF-V (Formatted)](https://huggingface.co/datasets/HuggingFaceH4/rlaif-v_formatted) — 72K AI-generated preference pairs for hallucination reduction

	## Uses

	### Direct Use

	Solari can be used for image understanding tasks where factual accuracy is critical:

	- Describing real-world scenes for visually impaired users
	- Visual question answering with reduced hallucination
	- Image captioning with improved object recognition reliability

	### Out-of-Scope Use

	- Tasks requiring strong mathematical reasoning or code understanding (degraded from base model)
	- Non-English language tasks
	- Medical or safety-critical applications without additional validation

	## How to Get Started with the Model

	```python
	import torch
	from transformers import AutoModelForImageTextToText, AutoProcessor
	from PIL import Image
	import requests

	model_id = "Cubex11/Solari"
	model = AutoModelForImageTextToText.from_pretrained(
	model_id,
	torch_dtype=torch.bfloat16,
	device_map="auto",
	)
	processor = AutoProcessor.from_pretrained(model_id)

	# Load an image (replace with your own image path or URL)
	image = Image.open("your_image.jpg").convert("RGB")

	# Create prompt
	messages = [
	{
	"role": "user",
	"content": [
	{"type": "image"},
	{"type": "text", "text": "Describe this image in detail."}
	]
	}
	]

	text = processor.apply_chat_template(messages, add_generation_prompt=True)
	inputs = processor(text=text, images=[[image]], return_tensors="pt").to(model.device)
	output = model.generate(**inputs, max_new_tokens=256)
	trimmed = output[0][len(inputs.input_ids[0]):]
	print(processor.decode(trimmed, skip_special_tokens=True))
	```

	## Training Details

	### Training Data

	[RLAIF-V (Formatted)](https://huggingface.co/datasets/HuggingFaceH4/rlaif-v_formatted) — a large-scale multimodal preference dataset containing ~72K preference pairs. Each sample includes an image, a prompt, a chosen response (more accurate), and a rejected response (more hallucinated). Preferences are generated by open-source AI models following the RLAIF-V methodology.

	### Training Procedure

	Method: QLoRA + Direct Preference Optimization (DPO)

	The base model was quantized to 4-bit (NF4) and fine-tuned using Low-Rank Adaptation (LoRA) with DPO to learn preferences between accurate and hallucinated responses.

	#### Training Hyperparameters

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Training regime \| bf16 mixed precision \|
	\| Quantization \| 4-bit NF4 (double quantization) \|
	\| LoRA rank (r) \| 16 \|
	\| LoRA alpha \| 16 \|
	\| LoRA dropout \| 0.1 \|
	\| DoRA \| Enabled \|
	\| Target modules \| q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj \|
	\| Trainable params \| ~1.9% of total \|
	\| Learning rate \| 5e-5 \|
	\| DPO beta \| 0.1 \|
	\| Batch size \| 8 (per device) \|
	\| Gradient accumulation \| 4 (effective batch = 32) \|
	\| Epochs \| 2 (best checkpoint at ~1 epoch / step 2500) \|
	\| Warmup ratio \| 0.1 \|
	\| Optimizer \| AdamW \|

	#### Speeds, Sizes, Times

	- Training time: ~9 hours on NVIDIA L4 (24GB)
	- Best checkpoint: Step 2500 (selected by lowest validation loss)
	- Model size: ~1 GB (bf16 safetensors)

	## Evaluation

	### Testing Data, Factors & Metrics

	Evaluated using [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) on 8 standard benchmarks covering hallucination, general VLM capability, and real-world understanding.

	#### Metrics

	- POPE: F1 score across random/popular/adversarial splits (object hallucination)
	- AMBER: Attribute, Existence, Relation accuracy (multi-dimensional hallucination)
	- HallusionBench: aAcc, fAcc, qAcc (hallucination detection)
	- A-OKVQA: Accuracy on outside-knowledge VQA
	- MME: Perception and Reasoning scores
	- MMStar: Multi-modal reasoning accuracy
	- MMBench: General multi-modal understanding
	- RealWorldQA: Real-world image understanding accuracy

	### Results

	\| Benchmark \| Metric \| Base Model \| Solari \| Change \|
	\|-----------\|--------\|------------\|------------\|--------\|
	\| POPE \| Overall \| 82.67 \| 85.08 \| +2.41 \|
	\| POPE \| Recall \| 76.73 \| 85.33 \| +8.60 \|
	\| AMBER \| Avg ACC \| 79.38 \| 79.77 \| +0.39 \|
	\| AMBER \| Relation \| 72.36 \| 75.42 \| +3.06 \|
	\| HallusionBench \| Overall \| 27.58 \| 28.14 \| +0.56 \|
	\| A-OKVQA \| Overall \| 68.12 \| 69.00 \| +0.88 \|
	\| MMStar \| Overall \| 38.33 \| 39.60 \| +1.27 \|
	\| MMBench \| Test \| 53.14 \| 53.42 \| +0.28 \|
	\| RealWorldQA \| Overall \| 49.80 \| 50.59 \| +0.78 \|
	\| MME \| Perception \| 1216.19 \| 1118.51 \| -97.68 \|
	\| MME \| Reasoning \| 237.50 \| 211.79 \| -25.71 \|

	#### Summary

	Solari improves on 7 out of 8 benchmarks compared to the base model:

	- POPE recall +8.60% — dramatically better at recognizing objects actually present in images
	- All hallucination benchmarks improved — POPE, AMBER, and HallusionBench
	- General capabilities preserved or improved — A-OKVQA, MMStar, MMBench, RealWorldQA all show gains
	- Trade-off on MME — perception score dropped ~98 points, primarily on counting (-26.7), position (-26.7), and code reasoning (-27.5) subtasks due to the model becoming more conservative

	## Bias, Risks, and Limitations

	- Counting and spatial reasoning degraded: The DPO alignment made the model more conservative, reducing performance on fine-grained counting and positional reasoning tasks (reflected in MME scores).
	- Small model capacity: At 500M parameters, the model has inherent limitations on complex reasoning tasks.
	- English only: The model was trained and evaluated only on English-language tasks.
	- Training data bias: RLAIF-V preferences are AI-generated, which may introduce systematic biases.

	### Recommendations

	- Best suited for binary object recognition tasks ("Is there a X?") and general scene description
	- For tasks requiring precise counting or spatial reasoning, consider using the base model or a larger VLM
	- Always validate outputs in safety-critical applications

	## Environmental Impact

	- Hardware Type: NVIDIA L4 (24GB)
	- Hours used: ~9 hours
	- Cloud Provider: Lightning AI
	- Compute Region: US

	## Technical Specifications

	### Model Architecture and Objective

	- Architecture: SmolVLM2 (ViT vision encoder + LLM decoder with multi-modal projector)
	- Parameters: ~500M total
	- Objective: Direct Preference Optimization (DPO) — learns to prefer accurate descriptions over hallucinated ones

	### Compute Infrastructure

	#### Hardware

	NVIDIA L4 GPU (24GB VRAM) on Lightning AI

	#### Software

	- Transformers
	- TRL (DPO Trainer)
	- PEFT (QLoRA)
	- BitsAndBytes (4-bit quantization)

	## Citation

	BibTeX:

	```bibtex
	@misc{solari2026,
	title={Solari: Hallucination-Reduced Vision Language Model via QLoRA DPO on RLAIF-V},
	author={Cubex11},
	year={2026},
	url={https://huggingface.co/Cubex11/Solari}
	}
	```

	## Acknowledgments

	- [HuggingFace](https://huggingface.co/) for SmolVLM2 and the RLAIF-V formatted dataset
	- [OpenBMB](https://github.com/OpenBMB) for the RLAIF-V and RLHF-V research
	- [Lightning AI](https://lightning.ai/) for compute resources
	- [OpenCompass](https://github.com/open-compass/VLMEvalKit) for the VLMEvalKit evaluation toolkit