Instructions to use tencent/Youtu-VL-4B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use tencent/Youtu-VL-4B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="tencent/Youtu-VL-4B-Instruct", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("tencent/Youtu-VL-4B-Instruct", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use tencent/Youtu-VL-4B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "tencent/Youtu-VL-4B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tencent/Youtu-VL-4B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/tencent/Youtu-VL-4B-Instruct

SGLang

How to use tencent/Youtu-VL-4B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "tencent/Youtu-VL-4B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tencent/Youtu-VL-4B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "tencent/Youtu-VL-4B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tencent/Youtu-VL-4B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use tencent/Youtu-VL-4B-Instruct with Docker Model Runner:
```
docker model run hf.co/tencent/Youtu-VL-4B-Instruct
```

[Urgent Suggestion] Complete Deployment Guide: SDPA Patch (2x speed), 4-bit Fix for 8GB GPUs & Visual Examples

by NodeLinker - opened Feb 6

Discussion

NodeLinker

Feb 6

Hello Tencent Team,

Youtu-VL-4B is a powerful SOTA model, but the current Hugging Face README and codebase are missing critical elements for community adoption. Based on extensive reverse-engineering and benchmarking on RTX 5060 Ti 16GB, here is a comprehensive guide and a list of requested improvements to the official repository.

1. 🛠 Strict Environment Requirements (Pre-built Path)

To ensure the model runs out-of-the-box on Linux (Python 3.12), use this exact configuration:

# Recommended for RTX 30xx, 40xx, and Blackwell
pip install "transformers>=4.56.0,<=4.57.1" torch==2.9.0 accelerate pillow torchvision git+https://github.com/lucasb-eyer/pydensecrf.git opencv-python-headless 
pip install flash-attn==2.8.3 --no-build-isolation

Note: Ensure your system CUDA version is compatible with Torch 2.9.0 (up to 12.8).

2. ⚡ Performance Benchmarks (The SDPA Impact)

The current implementation forces eager mode on many setups, which is unusable. Our tests show that adding SDPA support doubles the speed on older architectures and provides a stable fallback for all users.

Implementation	Precision	Speed (Tokens/s)	VRAM (Peak)	Note
Eager	BF16	1.95 t/s	~13.0 GB	Very slow default.
SDPA (Patched)	BF16	3.89 t/s	~11.7 GB	2x speedup. Critical for non-Ampere GPUs.
Flash Attention 2	Mixed 4-bit	8.69 t/s	~5.9 GB	Optimal setup. 4.5x faster than default.

3. 🩹 The SDPA Patch (Mandatory for < RTX 30-series & Stability)

The official code hardcodes a check for flash_attention_2, throwing a KeyError: 'sdpa'. We propose adding a native SDPA class to modeling_siglip2.py to enable support for any 8GB+ GPU.

Action: Inject this class into modeling_siglip2.py:

class Vision_SDPAAttention(nn.Module):
    def __init__(self, config) -> None:
        super().__init__()
        dim, heads = config.hidden_size, config.num_attention_heads
        self.num_heads, self.head_dim = heads, dim // heads
        self.k_proj, self.v_proj, self.q_proj, self.out_proj = [nn.Linear(dim, dim) for _ in range(4)]
        self.dropout = getattr(config, "attention_dropout", 0.0)

    def forward(self, hidden_states, cu_seqlens, rotary_pos_emb=None, position_embeddings=None):
        seq_length = hidden_states.shape[0]
        q, k, v = self.q_proj(hidden_states).view(seq_length, self.num_heads, self.head_dim), self.k_proj(hidden_states).view(seq_length, self.num_heads, self.head_dim), self.v_proj(hidden_states).view(seq_length, self.num_heads, self.head_dim)
        if position_embeddings is None:
            emb = torch.cat((rotary_pos_emb, rotary_pos_emb), dim=-1); cos, sin = emb.cos(), emb.sin()
        else: cos, sin = position_embeddings
        q, k = apply_rotary_pos_emb_vision(q, k, cos, sin)
        m_dtype = q.dtype
        mask = torch.full([1, 1, seq_length, seq_length], torch.finfo(m_dtype).min, device=q.device, dtype=m_dtype)
        for i in range(1, len(cu_seqlens)): mask[..., cu_seqlens[i-1]:cu_seqlens[i], cu_seqlens[i-1]:cu_seqlens[i]] = 0
        q, k, v = q.transpose(0, 1).unsqueeze(0), k.transpose(0, 1).unsqueeze(0), v.transpose(0, 1).unsqueeze(0)
        attn_output = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=mask)
        return self.out_proj(attn_output.squeeze(0).transpose(0, 1).reshape(seq_length, -1).to(hidden_states.dtype)), None

And update VISION_ATTENTION_CLASSES to include 'sdpa': Vision_SDPAAttention.

4. 📉 Working 4-bit Quantization (Preventing "Blindness")

Standard 4-bit quantization breaks the Vision Tower. To run this model on 6GB-8GB VRAM cards without losing visual capabilities, use this specific Mixed Precision config:

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    # CRITICAL: Exclude Vision Tower and Heads to keep model "seeing"
    llm_int8_skip_modules=["siglip2", "merger", "embed_tokens", "lm_head"]
)

5. 🎨 Requested README Improvements

To make Youtu-VL more discoverable and user-friendly, please update the Hugging Face page:

## 🎨 Examples Section: Add a visual gallery showing:
- Object Detection (image with boxes).
- Referring Segmentation (image with masks).
- Pose Estimation.
- Show these at the top of the README. People need to see the "Vision-Centric" power immediately.
Integrated Demo: Don't force users to go to GitHub for basic tasks. Include a collapsible section with the Jupyter/Demo prompts or a direct link to the /demo folder.
Machine-Readable Benchmarks: Please provide benchmarks as Markdown Tables, not just PNG images. This is essential for AI agents, scrapers, and accessibility.
Radar Charts: A radar chart comparing Youtu-VL 4B with PaliGemma or Qwen-VL would visually emphasize your SOTA performance in vision tasks.

Please refer to the community feedback in Discussion #5 and the Reddit thread in r/LocalLLaMA (40k+ views). We want this model to succeed, but it needs a production-ready README.

zhixiangwei

Feb 10

We have received and acknowledged your Comment 3 and Comment 5:

Comment 3: Link
Comment 5: Link

Regarding Comments 1, 2, and 4: since different users utilize various GPU architectures, we are still evaluating the best approach to integrate these suggestions and are currently conducting further testing.

Thank you very much for your constructive feedback!

jjaxkp

Mar 14

Hello Tencent Team,

Youtu-VL-4B is a powerful SOTA model, but the current Hugging Face README and codebase are missing critical elements for community adoption. Based on extensive reverse-engineering and benchmarking on RTX 5060 Ti 16GB, here is a comprehensive guide and a list of requested improvements to the official repository.

1. 🛠 Strict Environment Requirements (Pre-built Path)

To ensure the model runs out-of-the-box on Linux (Python 3.12), use this exact configuration:
# Recommended for RTX 30xx, 40xx, and Blackwell
pip install "transformers>=4.56.0,<=4.57.1" torch==2.9.0 accelerate pillow torchvision git+https://github.com/lucasb-eyer/pydensecrf.git opencv-python-headless 
pip install flash-attn==2.8.3 --no-build-isolation
Note: Ensure your system CUDA version is compatible with Torch 2.9.0 (up to 12.8).

2. ⚡ Performance Benchmarks (The SDPA Impact)

The current implementation forces eager mode on many setups, which is unusable. Our tests show that adding SDPA support doubles the speed on older architectures and provides a stable fallback for all users.

Implementation Precision Speed (Tokens/s) VRAM (Peak) Note

Eager BF16 1.95 t/s ~13.0 GB Very slow default.

SDPA (Patched) BF16 3.89 t/s ~11.7 GB 2x speedup. Critical for non-Ampere GPUs.

Flash Attention 2 Mixed 4-bit 8.69 t/s ~5.9 GB Optimal setup. 4.5x faster than default.

3. 🩹 The SDPA Patch (Mandatory for < RTX 30-series & Stability)

The official code hardcodes a check for flash_attention_2, throwing a KeyError: 'sdpa'. We propose adding a native SDPA class to modeling_siglip2.py to enable support for any 8GB+ GPU.

Action: Inject this class into modeling_siglip2.py:
class Vision_SDPAAttention(nn.Module):
    def __init__(self, config) -> None:
        super().__init__()
        dim, heads = config.hidden_size, config.num_attention_heads
        self.num_heads, self.head_dim = heads, dim // heads
        self.k_proj, self.v_proj, self.q_proj, self.out_proj = [nn.Linear(dim, dim) for _ in range(4)]
        self.dropout = getattr(config, "attention_dropout", 0.0)

    def forward(self, hidden_states, cu_seqlens, rotary_pos_emb=None, position_embeddings=None):
        seq_length = hidden_states.shape[0]
        q, k, v = self.q_proj(hidden_states).view(seq_length, self.num_heads, self.head_dim), self.k_proj(hidden_states).view(seq_length, self.num_heads, self.head_dim), self.v_proj(hidden_states).view(seq_length, self.num_heads, self.head_dim)
        if position_embeddings is None:
            emb = torch.cat((rotary_pos_emb, rotary_pos_emb), dim=-1); cos, sin = emb.cos(), emb.sin()
        else: cos, sin = position_embeddings
        q, k = apply_rotary_pos_emb_vision(q, k, cos, sin)
        m_dtype = q.dtype
        mask = torch.full([1, 1, seq_length, seq_length], torch.finfo(m_dtype).min, device=q.device, dtype=m_dtype)
        for i in range(1, len(cu_seqlens)): mask[..., cu_seqlens[i-1]:cu_seqlens[i], cu_seqlens[i-1]:cu_seqlens[i]] = 0
        q, k, v = q.transpose(0, 1).unsqueeze(0), k.transpose(0, 1).unsqueeze(0), v.transpose(0, 1).unsqueeze(0)
        attn_output = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=mask)
        return self.out_proj(attn_output.squeeze(0).transpose(0, 1).reshape(seq_length, -1).to(hidden_states.dtype)), None
And update VISION_ATTENTION_CLASSES to include 'sdpa': Vision_SDPAAttention.

4. 📉 Working 4-bit Quantization (Preventing "Blindness")

Standard 4-bit quantization breaks the Vision Tower. To run this model on 6GB-8GB VRAM cards without losing visual capabilities, use this specific Mixed Precision config:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    # CRITICAL: Exclude Vision Tower and Heads to keep model "seeing"
    llm_int8_skip_modules=["siglip2", "merger", "embed_tokens", "lm_head"]
)
5. 🎨 Requested README Improvements

To make Youtu-VL more discoverable and user-friendly, please update the Hugging Face page:

## 🎨 Examples Section: Add a visual gallery showing:

Object Detection (image with boxes).

Referring Segmentation (image with masks).

Pose Estimation.

Show these at the top of the README. People need to see the "Vision-Centric" power immediately.

Integrated Demo: Don't force users to go to GitHub for basic tasks. Include a collapsible section with the Jupyter/Demo prompts or a direct link to the /demo folder.

Machine-Readable Benchmarks: Please provide benchmarks as Markdown Tables, not just PNG images. This is essential for AI agents, scrapers, and accessibility.

Radar Charts: A radar chart comparing Youtu-VL 4B with PaliGemma or Qwen-VL would visually emphasize your SOTA performance in vision tasks.

Please refer to the community feedback in Discussion #5 and the Reddit thread in r/LocalLLaMA (40k+ views). We want this model to succeed, but it needs a production-ready README.

You are amazing my good friend!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment