[Urgent Suggestion] Complete Deployment Guide: SDPA Patch (2x speed), 4-bit Fix for 8GB GPUs & Visual Examples

#6
by NodeLinker - opened

Hello Tencent Team,

Youtu-VL-4B is a powerful SOTA model, but the current Hugging Face README and codebase are missing critical elements for community adoption. Based on extensive reverse-engineering and benchmarking on RTX 5060 Ti 16GB, here is a comprehensive guide and a list of requested improvements to the official repository.

1. πŸ›  Strict Environment Requirements (Pre-built Path)

To ensure the model runs out-of-the-box on Linux (Python 3.12), use this exact configuration:

# Recommended for RTX 30xx, 40xx, and Blackwell
pip install "transformers>=4.56.0,<=4.57.1" torch==2.9.0 accelerate pillow torchvision git+https://github.com/lucasb-eyer/pydensecrf.git opencv-python-headless 
pip install flash-attn==2.8.3 --no-build-isolation

Note: Ensure your system CUDA version is compatible with Torch 2.9.0 (up to 12.8).


2. ⚑ Performance Benchmarks (The SDPA Impact)

The current implementation forces eager mode on many setups, which is unusable. Our tests show that adding SDPA support doubles the speed on older architectures and provides a stable fallback for all users.

Implementation Precision Speed (Tokens/s) VRAM (Peak) Note
Eager BF16 1.95 t/s ~13.0 GB Very slow default.
SDPA (Patched) BF16 3.89 t/s ~11.7 GB 2x speedup. Critical for non-Ampere GPUs.
Flash Attention 2 Mixed 4-bit 8.69 t/s ~5.9 GB Optimal setup. 4.5x faster than default.

3. 🩹 The SDPA Patch (Mandatory for < RTX 30-series & Stability)

The official code hardcodes a check for flash_attention_2, throwing a KeyError: 'sdpa'. We propose adding a native SDPA class to modeling_siglip2.py to enable support for any 8GB+ GPU.

Action: Inject this class into modeling_siglip2.py:

class Vision_SDPAAttention(nn.Module):
    def __init__(self, config) -> None:
        super().__init__()
        dim, heads = config.hidden_size, config.num_attention_heads
        self.num_heads, self.head_dim = heads, dim // heads
        self.k_proj, self.v_proj, self.q_proj, self.out_proj = [nn.Linear(dim, dim) for _ in range(4)]
        self.dropout = getattr(config, "attention_dropout", 0.0)

    def forward(self, hidden_states, cu_seqlens, rotary_pos_emb=None, position_embeddings=None):
        seq_length = hidden_states.shape[0]
        q, k, v = self.q_proj(hidden_states).view(seq_length, self.num_heads, self.head_dim), self.k_proj(hidden_states).view(seq_length, self.num_heads, self.head_dim), self.v_proj(hidden_states).view(seq_length, self.num_heads, self.head_dim)
        if position_embeddings is None:
            emb = torch.cat((rotary_pos_emb, rotary_pos_emb), dim=-1); cos, sin = emb.cos(), emb.sin()
        else: cos, sin = position_embeddings
        q, k = apply_rotary_pos_emb_vision(q, k, cos, sin)
        m_dtype = q.dtype
        mask = torch.full([1, 1, seq_length, seq_length], torch.finfo(m_dtype).min, device=q.device, dtype=m_dtype)
        for i in range(1, len(cu_seqlens)): mask[..., cu_seqlens[i-1]:cu_seqlens[i], cu_seqlens[i-1]:cu_seqlens[i]] = 0
        q, k, v = q.transpose(0, 1).unsqueeze(0), k.transpose(0, 1).unsqueeze(0), v.transpose(0, 1).unsqueeze(0)
        attn_output = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=mask)
        return self.out_proj(attn_output.squeeze(0).transpose(0, 1).reshape(seq_length, -1).to(hidden_states.dtype)), None

And update VISION_ATTENTION_CLASSES to include 'sdpa': Vision_SDPAAttention.


4. πŸ“‰ Working 4-bit Quantization (Preventing "Blindness")

Standard 4-bit quantization breaks the Vision Tower. To run this model on 6GB-8GB VRAM cards without losing visual capabilities, use this specific Mixed Precision config:

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    # CRITICAL: Exclude Vision Tower and Heads to keep model "seeing"
    llm_int8_skip_modules=["siglip2", "merger", "embed_tokens", "lm_head"]
)

5. 🎨 Requested README Improvements

To make Youtu-VL more discoverable and user-friendly, please update the Hugging Face page:

  • ## 🎨 Examples Section: Add a visual gallery showing:
    • Object Detection (image with boxes).
    • Referring Segmentation (image with masks).
    • Pose Estimation.
    • Show these at the top of the README. People need to see the "Vision-Centric" power immediately.
  • Integrated Demo: Don't force users to go to GitHub for basic tasks. Include a collapsible section with the Jupyter/Demo prompts or a direct link to the /demo folder.
  • Machine-Readable Benchmarks: Please provide benchmarks as Markdown Tables, not just PNG images. This is essential for AI agents, scrapers, and accessibility.
  • Radar Charts: A radar chart comparing Youtu-VL 4B with PaliGemma or Qwen-VL would visually emphasize your SOTA performance in vision tasks.

Please refer to the community feedback in Discussion #5 and the Reddit thread in r/LocalLLaMA (40k+ views). We want this model to succeed, but it needs a production-ready README.

We have received and acknowledged your Comment 3 and Comment 5:

Regarding Comments 1, 2, and 4: since different users utilize various GPU architectures, we are still evaluating the best approach to integrate these suggestions and are currently conducting further testing.

Thank you very much for your constructive feedback!

Hello Tencent Team,

Youtu-VL-4B is a powerful SOTA model, but the current Hugging Face README and codebase are missing critical elements for community adoption. Based on extensive reverse-engineering and benchmarking on RTX 5060 Ti 16GB, here is a comprehensive guide and a list of requested improvements to the official repository.

1. πŸ›  Strict Environment Requirements (Pre-built Path)

To ensure the model runs out-of-the-box on Linux (Python 3.12), use this exact configuration:

# Recommended for RTX 30xx, 40xx, and Blackwell
pip install "transformers>=4.56.0,<=4.57.1" torch==2.9.0 accelerate pillow torchvision git+https://github.com/lucasb-eyer/pydensecrf.git opencv-python-headless 
pip install flash-attn==2.8.3 --no-build-isolation

Note: Ensure your system CUDA version is compatible with Torch 2.9.0 (up to 12.8).


2. ⚑ Performance Benchmarks (The SDPA Impact)

The current implementation forces eager mode on many setups, which is unusable. Our tests show that adding SDPA support doubles the speed on older architectures and provides a stable fallback for all users.

Implementation Precision Speed (Tokens/s) VRAM (Peak) Note
Eager BF16 1.95 t/s ~13.0 GB Very slow default.
SDPA (Patched) BF16 3.89 t/s ~11.7 GB 2x speedup. Critical for non-Ampere GPUs.
Flash Attention 2 Mixed 4-bit 8.69 t/s ~5.9 GB Optimal setup. 4.5x faster than default.

3. 🩹 The SDPA Patch (Mandatory for < RTX 30-series & Stability)

The official code hardcodes a check for flash_attention_2, throwing a KeyError: 'sdpa'. We propose adding a native SDPA class to modeling_siglip2.py to enable support for any 8GB+ GPU.

Action: Inject this class into modeling_siglip2.py:

class Vision_SDPAAttention(nn.Module):
    def __init__(self, config) -> None:
        super().__init__()
        dim, heads = config.hidden_size, config.num_attention_heads
        self.num_heads, self.head_dim = heads, dim // heads
        self.k_proj, self.v_proj, self.q_proj, self.out_proj = [nn.Linear(dim, dim) for _ in range(4)]
        self.dropout = getattr(config, "attention_dropout", 0.0)

    def forward(self, hidden_states, cu_seqlens, rotary_pos_emb=None, position_embeddings=None):
        seq_length = hidden_states.shape[0]
        q, k, v = self.q_proj(hidden_states).view(seq_length, self.num_heads, self.head_dim), self.k_proj(hidden_states).view(seq_length, self.num_heads, self.head_dim), self.v_proj(hidden_states).view(seq_length, self.num_heads, self.head_dim)
        if position_embeddings is None:
            emb = torch.cat((rotary_pos_emb, rotary_pos_emb), dim=-1); cos, sin = emb.cos(), emb.sin()
        else: cos, sin = position_embeddings
        q, k = apply_rotary_pos_emb_vision(q, k, cos, sin)
        m_dtype = q.dtype
        mask = torch.full([1, 1, seq_length, seq_length], torch.finfo(m_dtype).min, device=q.device, dtype=m_dtype)
        for i in range(1, len(cu_seqlens)): mask[..., cu_seqlens[i-1]:cu_seqlens[i], cu_seqlens[i-1]:cu_seqlens[i]] = 0
        q, k, v = q.transpose(0, 1).unsqueeze(0), k.transpose(0, 1).unsqueeze(0), v.transpose(0, 1).unsqueeze(0)
        attn_output = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=mask)
        return self.out_proj(attn_output.squeeze(0).transpose(0, 1).reshape(seq_length, -1).to(hidden_states.dtype)), None

And update VISION_ATTENTION_CLASSES to include 'sdpa': Vision_SDPAAttention.


4. πŸ“‰ Working 4-bit Quantization (Preventing "Blindness")

Standard 4-bit quantization breaks the Vision Tower. To run this model on 6GB-8GB VRAM cards without losing visual capabilities, use this specific Mixed Precision config:

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    # CRITICAL: Exclude Vision Tower and Heads to keep model "seeing"
    llm_int8_skip_modules=["siglip2", "merger", "embed_tokens", "lm_head"]
)

5. 🎨 Requested README Improvements

To make Youtu-VL more discoverable and user-friendly, please update the Hugging Face page:

  • ## 🎨 Examples Section: Add a visual gallery showing:
    • Object Detection (image with boxes).
    • Referring Segmentation (image with masks).
    • Pose Estimation.
    • Show these at the top of the README. People need to see the "Vision-Centric" power immediately.
  • Integrated Demo: Don't force users to go to GitHub for basic tasks. Include a collapsible section with the Jupyter/Demo prompts or a direct link to the /demo folder.
  • Machine-Readable Benchmarks: Please provide benchmarks as Markdown Tables, not just PNG images. This is essential for AI agents, scrapers, and accessibility.
  • Radar Charts: A radar chart comparing Youtu-VL 4B with PaliGemma or Qwen-VL would visually emphasize your SOTA performance in vision tasks.

Please refer to the community feedback in Discussion #5 and the Reddit thread in r/LocalLLaMA (40k+ views). We want this model to succeed, but it needs a production-ready README.

You are amazing my good friend!

Sign up or log in to comment