De-mystifying Multimodal Learning: The Hidden Inefficiency in Vision Language Modelling

Community Article Published March 4, 2026
πŸ€— Comunity Article, πŸ“ Blogpost

Introduction

In the shift from text-only models to Vision Language Models (VLMs), we often talk about "parameters" and "emergent reasoning." However, there is a hidden currency that governs the performance, cost, and feasibility of these systems: Visual Tokens (VT).

token comparison
Figure 1: VLMs Token count vs Image Resolution. We report the Visual Token count as a function of image resolution (in pixel count (px)) over four models LLaVA-1.5, LLaVA-OneVision, Qwen3VL, Gemma3. The estimated numbers are calculated based on Spatial Merge Size of 2, Any-Resolution with 3x3 windows and Spatial Average Pooling of size 4. For P&S method we assume the number of crops increase as a step function and start from same Any-Resolution configurations. βˆ— ^*

In the our previous blogpost, we explored the architectural anatomy of VLMs and how images are converted into language-compatible vectors. In this second installment of De-mystifying Multimodal Learning we focus on the mathematics and operational impact of that conversion. Specifically, we will look at how to Calculate Visual Tokens. Presenting a practical guide to estimating token counts across different SOTA strategiesβ€”from Qwen’s dynamic merging (QwenTeam, 2025), LLaVA’s Any-Res grids (Li et al., 2024b) and Gemma3 Pan&Scan (Gemma-Team, 2025)β€”without running a single line of inference.

Understanding the computational overhead of these tokens is no longer just an academic exerciseβ€”it is a production necessity.

βˆ— ^* Please note that for LLaVa-OneVision and Gemma P&S the total number of visual tokens depends highliy on the amount of crops.

Calculating #Visual Tokens

As discussed in our previous blogpost, VTs are the fundamental units that allow LLMs to perceive visual data. Now that we understand the "what", we must address the "how much", answering the following:

How many Visual Tokens do VLMs produce given an image input size?

Original Recipe

Within the first VLM architectures like LLaVa-1.5 (Liu et al., 2023) this estimation is straightforward. First generation VLMs relied on Vision Encoder with fixed input resolution and a patch size ( PS PS ). Mathematically, assume H H and W W to be the original image's heigth and width, and X X and Y Y the resized dimension of the Vision Processor. In LLaVA-1.5, whatever the image resolution, the picture is always re-scaled to XΓ—Y X \times Y . Meaning the final number of VTs, i.e. dimension V V from our previous blogpost is:

VLLaVA-1.5=XPSΓ—YPS V_{\text{LLaVA-1.5}} = \frac{X}{PS} \times \frac{Y}{PS}

The Resolution Trap

This comes with several problems:

  1. Images resolutions are completely disregarded. Having the same amount of tokens for images of size 336^2 and 1024^2 does not make any sense.
  2. Other than not making sense, it also does not work. For OCR, visual compositional reasoning and small object detection tasks, tasks accuracy is particularly low (Yuksekgonul et al., 2023, Nulli et al., 2024, Tong et al., 2024, Nulli et al., 2025).

The simple solution of building vision encoders with higher resolution support is also not feasible. The fundamental issue stems from the fact that Tripling the resizing dimension of the processor
XΓ—Y↔336Γ—336β†’1024Γ—1024 X \times Y \leftrightarrow 336\times336 \rightarrow 1024\times1024 , would results in almost 10x the amount of VT, making this prohibitive.

Modern Approaches

Lets now look at newer approaches from Qwen2.5/3/3.5-VL, LLaVa-OneVision and Gemma3, overcoming these issues.

Strategy A: The Dynamic Merger
We have to start with the game-changers: the Qwen2.5/3/3.5-VL series (Bai et al. 2025, QwenTeam, 2025). These models ditched the "fixed resolution" rule entirely. Instead of squashing every image into a square, they process images at their native resolution. This sounds great, but it complicates our math: if the image size varies, so does the token count. To calculate it, we need a specific value from the model's config.json called the Spatial Merge Size ( SMS SMS ). Think of SMS SMS as a compression factorβ€”it tells the model how many raw image patches to pool together into one VT. With this in mind, our formula becomes a bit more dynamic:

VQwen3=H(PSβ‹…SMS)Γ—W(PSβ‹…SMS) V_{\text{Qwen3}} = \frac{H}{(PS \cdot SMS)} \times \frac{W}{(PS \cdot SMS)}

Upside: perfect aspect ratios without distortion.
Downside: large images (or several of them) can silently eat up your context window much faster than you expect.

Strategy B: The Multi-Grid / AnyRes
Around the same period LLaVA-Next/OneVision (Liu et al., 2024, Li et al., 2024b) came up with a clever, yet expensive encoding technique called "Dynamic-High Resolution"/"Any Resolution". Depicted in Figure 2, it consists of splitting the image into kΓ—kk\times k grids, with k∈{1,9}k \in \{1, 9\} before the vision encoding. This means repeating the encoding process (kΓ—k)+1(k \times k) + 1 times, with the 1 being the picture in its entirety.

High res
Figure 2: Illustration of Dynamic High Resolution on 2x2 grid from LLaVA-NeXT paper.

Although this results higher detail understanding given the entire focus of the encoder on smaller portions of the image, it also most crucially implies an enormous increase in Visual Token count. Given the calculations in the original recipe, we have

VLLaVA-OneVision=VLLaVA-1.5Γ—[(kΓ—k)+1] V_{\text{LLaVA-OneVision}} = V_{\text{LLaVA-1.5}} \times [(k \times k) + 1]

Upside: good quality for high-resolution inputs.
Downside: massive increase in token count. Prohibitive for very high-resolution multi-image settings.

Strategy C: Pan&Scan and Fixed Downsampler
Gemma3 (Gemma-Team, 2025) family of models, the most recent open source VLM from GDM also employes a fixed input sized Vision Encoder SigLIP (Zhai et al., 2024). Refer to this blogpost for a nice architectural overview of Gemma3.

To handle higher resolution images without using prohibitive amounts of VT, Gemma3 increases X=Y→896 X=Y \rightarrow 896 while applying a spatial average pooling. This helps reducing the total number of visual tokens, while allowing the vision encoder to operate at higher resolution scale and only later heavily compressing the information. Thanks to the pooling, this yields a fixed amount of visual tokens which corresponds to

VGemma3=XPSβˆ—poolingΓ—YPSβˆ—pooling=89614βˆ—4Γ—89614βˆ—4=256 V_{\text{Gemma3}} = \frac{X}{PS*\text{pooling}} \times \frac{Y}{PS*\text{pooling}} = \frac{896}{14* 4} \times \frac{896}{14* 4} = 256

with the pooling being applied within the modality connector.

A built-in alternative high-quality handling strategy is Pan&Scan (P&S). Similar to Strategy B it adaptively segments the image into different parts and encodes them separately. The main differences between Strategy B and P&S are:
(a.) the latter can be "turned on/off" by the user at inference and is more customisable (see code below),
(b.) P&S can have overlapping crops, with p=number of crops p = \text{number of crops} βˆ—βˆ— ^{**} .

VGemma3-P&S=VGemma3Γ—[p+1] V_{\text{Gemma3-P\&S}} = V_{\text{Gemma3}} \times [p + 1]

class Gemma3ProcessorKwargs(ProcessingKwargs, total=False):
    images_kwargs: Gemma3ImagesKwargs
    _defaults = {
        "text_kwargs": {
            "padding": False,
        },
        "images_kwargs": {
            "do_pan_and_scan": False,
            "pan_and_scan_min_crop_size": 256,
            "pan_and_scan_max_num_crops": 4,
            "pan_and_scan_min_ratio_to_activate": 1.2,
        },
    }
Code 1: Overview of Gemma3 Pan&Scan customizable parameters from source code.

Upside: adaptive handling of high resolution input with lower token count.
Downside: inconvenient for evaluation settings with alternating high-low input resolutions.

βˆ—βˆ— ^{**} See this blogpost for more on Pan&Scan.

A small sidenote: the total visual token number should also take into account the special tokens. Used across all strategies, these are simply signaling the beginning and end of visual content and are used for every picture.

Conclusions & Key Takeaways

Representative Models Strategy Resolution Logic Token Efficiency
LLaVA-1.5 Standard Resize Squash to fixed HΓ—W H \times W Fixed Count
Qwen3-VL Dynamic Merger Native (Preserves Aspect Ratio) Quadratic Growth
LLaVA-OneVision AnyRes / Multi-Grid Grid Split ( kΓ—k k \times k ) + Overview Massive Cost
Gemma3 Fixed Downsample Resize + Spatial Pooling Highly Compact
Pan and Scan Adaptive Grid Split + Downsample Adaptively Expensive

Table 1: Comparison of Visual Token calculation strategies across modern SOTA VLM architectures.

Visual Tokens are the bridge between the image and language world, but they are also the primary bottleneck in VLM deployment. As we have seen, moving from a fixed-resolution model (LLaVA1.5) to a dynamic one (Qwen3-VL or LLaVA-OneVision) and hybrid ones (Gemma3) can considerbly increase your input size (Figure 1).

Here are some key takeaways to keep in mind when building multimodal systems:

  • Calculate, Don't Guess: Use the formulas provided above to pre-calculate token counts. This allows you to dynamically resize images and/or adjust batch sizes to prevent OOM errors in production (more on this in our next blogpost).
  • Tokens β‰  \neq Pixels: High resolution doesn't always mean high cost. It depends entirely on the architecture (e.g., Fixed Downsampler vs. Multi-Grid).

Multimodal learning is evolving rapidly, but compute is finite. Mastering the math of Visual Tokens is the first step toward correctly exploiting VLM efficiency.

In the next blogpost (coming soon), we will dive deep into the impact of visual token counts on Context Windows, Latency, VRAM.

Citation

If you use this work, please cite:

@misc{nulli2026thehidden,
  title={De-mystifying Multimodal Learning: The Hidden Inefficiency in Vision Language Modelling},
  author={Nulli, Matteo},
  year={2026},
  url={https://huggingface.co/blog/MatteoNulli/de-mystifying-multimodal-learning-hidden-ineff},
  howpublished={Available at \url{https://matteonulli.github.io/blog/2026/demystifying1/} and \url{https://huggingface.co/blog/MatteoNulli/de-mystifying-multimodal-learning-hidden-ineff}},
  note={Hugging Face Blog}
}

References

Liu Haotian, Li Chunyuan, Wu Qingyang, Lee Yong Jae. (2023). Visual Instruction Tuning. arXiv preprint arXiv:2304.08485.

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. 2024a. Llava-onevision: Easy visual task transfer. Preprint, arXiv:2408.03326.

Liu Haotian, Li Chunyuan, Li Yuheng, Li Bo, Zhang Yuanhan, Shen Sheng, Lee Yong Jae. (2024). LLaVA-NeXT: Improved Reasoning, OCR, and World Knowledge. Blog post (January 2024). URL https://llava-vl.github.io/blog/2024-01-30-llava-next/.

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer. Sigmoid Loss for Language Image Pre-Training, 2024. URL https://arxiv.org/abs/2303.15343.

Gemma-Team. (2025). Gemma 3 Technical Report. arXiv preprint arXiv:2503.19786.

Bai Shuai, Chen Keqin, Liu Xuejing, Wang Jialin, Ge Wenbin, Song Sibo, Dang Kai, Wang Peng, Wang Shijie, Tang Jun, Zhong Humen, Zhu Yuanzhi, Yang Mingkun, Li Zhaohai, Wan Jianqiang, Wang Pengfei, Ding Wei, Fu Zheren, Xu Yiheng, Ye Jiabo, Zhang Xi, Xie Tianbao, Cheng Zesen, Zhang Hang, Yang Zhibo, Xu Haiyang, Lin Junyang. (2025). Qwen2.5-VL Technical Report. arXiv preprint arXiv:2502.13923.

QwenTeam. 2025. Qwen3-vl: Sharper vision, deeper thought, broader action.

Yuksekgonul Mert, Bianchi Federico, Kalluri Pratyusha, Jurafsky Dan, Zou James. (2023). When and Why Vision-Language Models Behave Like Bags-of-Words, and What to Do About It? arXiv preprint arXiv:2210.01936.

Matteo Nulli, Ibrahimi Anesa, Pal Avik, Lee Hoshe, Najdenkoska Ivona. (2024). In-Context Learning Improves Compositional Understanding of Vision-Language Models. In ICML 2024 Workshop on Foundation Models in the Wild. arXiv preprint arXiv:2407.15487.

Matteo Nulli, Ivona Najdenkoska, Mohammad Mahdi Derakhshani, and Yuki M Asano. 2025. Objectguided visual tokens: Eliciting compositional reasoning in multimodal language models. In EurIPS 2025 Workshop on Principles of Generative Modeling (PriGM)

Tong Shengbang, Liu Zhuang, Zhai Yuexiang, Ma Yi, LeCun Yann, Xie Saining. (2024). Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs. arXiv preprint arXiv:2401.06209.

Community

Sign up or log in to comment