Instructions to use Svard/LaViT-3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Svard/LaViT-3B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Svard/LaViT-3B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("Svard/LaViT-3B")
model = AutoModelForMultimodalLM.from_pretrained("Svard/LaViT-3B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Svard/LaViT-3B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Svard/LaViT-3B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Svard/LaViT-3B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Svard/LaViT-3B

SGLang

How to use Svard/LaViT-3B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Svard/LaViT-3B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Svard/LaViT-3B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Svard/LaViT-3B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Svard/LaViT-3B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use Svard/LaViT-3B with Docker Model Runner:
```
docker model run hf.co/Svard/LaViT-3B
```

LaViT-3B: Aligning Latent Visual Thoughts for Multi-modal Reasoning

LaViT is a vision-language model that aligns latent visual thoughts for enhanced multi-modal reasoning.

📖 Overview

LaViT (Latent Visual Thoughts) addresses a critical Perception Gap in multimodal latent reasoning: student models often mimic a teacher's textual output while attending to fundamentally divergent visual regions, effectively relying on language priors rather than grounded perception.

To bridge this gap, LaViT compels the student to autoregressively reconstruct the teacher's visual semantics and attention trajectories prior to text generation, employing a curriculum sensory gating mechanism to prevent shortcut learning.

Key Features

🎯 Visual Grounding: Significantly enhanced visual grounding capabilities
🧠 Multi-modal Reasoning: Improved performance on complex reasoning tasks
📊 Efficient: Compact 3B model that outperforms larger open-source variants
🚀 State-of-the-art: Achieves up to +16.9% gains on complex reasoning tasks

📄 Paper

Title: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning

Authors: Linquan Wu, Tianxiang Jiang, Yifei Dong, Haoyu Yang, Fengji Zhang, Shichaang Meng, Ai Xuan, Linqi Song, Jacky Keung

Paper Link: arXiv:2601.10129

Abstract: Current multimodal latent reasoning often relies on external supervision (e.g., auxiliary images), ignoring intrinsic visual attention dynamics. In this work, we identify a critical Perception Gap in distillation: student models frequently mimic a teacher's textual output while attending to fundamentally divergent visual regions, effectively relying on language priors rather than grounded perception. To bridge this, we propose LaViT, a framework that aligns latent visual thoughts rather than static embeddings. LaViT compels the student to autoregressively reconstruct the teacher's visual semantics and attention trajectories prior to text generation, employing a curriculum sensory gating mechanism to prevent shortcut learning. Extensive experiments show that LaViT significantly enhances visual grounding, achieving up to +16.9% gains on complex reasoning tasks and enabling a compact 3B model to outperform larger open-source variants and proprietary models like GPT-4o.

🚀 Usage

Installation

pip install transformers torch pillow

Basic Usage

from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import requests

# Load model and processor
processor = AutoProcessor.from_pretrained("Svard/LaViT-3B")
model = AutoModelForCausalLM.from_pretrained("Svard/LaViT-3B")

# Load image
url = "https://example.com/image.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# Prepare prompt
prompt = "What is in this image?"

# Process inputs
inputs = processor(images=image, text=prompt, return_tensors="pt")

# Generate response
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)

Advanced Usage with Visual Reasoning

For tasks requiring visual reasoning, you can use the <lvr> (Latent Visual Reasoning) tokens:

prompt = "Analyze this image step by step: <lvr> What objects are present? <lvr> What are their spatial relationships? <lvr>"

📊 Performance

LaViT-3B achieves significant improvements on various benchmarks:

MMVP: Enhanced performance on multi-modal visual perception tasks
BLINK: Improved results on visual reasoning benchmarks
Visual Grounding: Up to +16.9% gains on complex reasoning tasks

🏗️ Model Architecture

Base Model: Qwen2.5-VL-3B-Instruct
Parameters: 3B
Training Method: Visual thought trajectory supervision
Key Innovation: Latent visual thought alignment with curriculum sensory gating

📝 Citation

If you find this model useful in your research, please cite:

@misc{wu2026lavitaligninglatentvisual,
      title={LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning}, 
      author={Linquan Wu and Tianxiang Jiang and Yifei Dong and Haoyu Yang and Fengji Zhang and Shichaang Meng and Ai Xuan and Linqi Song and Jacky Keung},
      year={2026},
      eprint={2601.10129},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.10129}, 
}

📄 License

This model is licensed under the Apache-2.0 License.

🙏 Acknowledgments

This model is built upon Qwen2.5-VL and inspired by the LVR (Latent Visual Reasoning) framework. We thank the open-source community for their valuable contributions.

🔗 Related Links

Paper: arXiv:2601.10129
Code Repository: GitHub
Base Model: Qwen2.5-VL-3B-Instruct

Downloads last month: 6

Safetensors

Model size

4B params

Tensor type

F32

Model tree for Svard/LaViT-3B

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Finetuned

(794)

this model

Paper for Svard/LaViT-3B

LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning

Paper • 2601.10129 • Published Jan 15 • 13