De-mystifying Multimodal Learning: Enabiling Vision in Language Models

Community Article Published February 17, 2026

🤗 Comunity Article, 📝 Blogpost

Introduction

Mathematical Formulation

Vision Enocoder Breakdown

Contrastive Learning

VLM Architecture and Flow
Vision-Language Modeling Pipeline

Conclusions

What’s Next: The Efficiency Bottleneck

Citation

Introduction

In this first installment of our series, De-mystifying Multimodal Learning, we break down the mechanics of how images become language-compatible vectors. To truly understand how a Large Language Model (LLM) "sees", we must look at the mathematics defining the problem, the training objectives that align vision and text, and the specific architectural steps that process raw pixels, introducing Vision Language Models (VLMs).

We will therefore cover:

Mathematical Formulation: The theoretical foundation and formal definitions of VLMs.

Vision Enocoder Breakdown: A detailed overview of the image processing operated by the ViT-CLIP based Vision Encoders.

Contrastive Learning: Uncovering how CLIP models learn to algin the images and text representations into the same space.

VLM Architecture and Flow: Putting it all together, diving deep in the architectural components of VLMs, detailing the birth of Visual Tokens, the source of sigths for LLMs.

Mathematical Formulation

To understand Vision-Language Models (VLMs), we first need to define the notation and the transformation pipeline formally.

Let $\mathbf{X} \in \mathbb{R}^{C \times H\times W}$ be an image and $t \in \Sigma$ be a language instruction input, where $\Sigma$ is the input space of character sequences. Let $s_{\theta, \gamma, \phi}$ be an VLM parametrized by $\theta, \gamma, \phi$ . We define $f_{v\theta}$ as a contrastively pre-trained Vision Encoder model:

$f_{v\theta}: \mathbb{R}^{C \times H \times W} \rightarrow \mathbb{R}^{V \times F},$

where $V$ is the number of visual tokens and $F$ is their hidden size. $f_{t\theta'}$ represents the corresponding Text Encoder used during the pre-training phase.

To bridge the gap between vision and language, we use a connector $m_\gamma: \mathbb{R}^{V \times F} \rightarrow \mathbb{R}^{V \times D}$ , typically a Multi-Layer Perceptron (MLP). The token vocabulary for the model is defined as:

$\mathcal{V}\;=\;\mathcal{V}_{\text{vision}}\;\cup\;\mathcal{V}_{\text{text}}$

The Large Language Model itself is defined as:

$g_{\phi}\;:=\;\mathcal{D}_d\;\circ\;\operatorname{softmax}\;\circ\;F_{\phi'}\;\;:\;\mathbb{R}^{J\times D}\;\longrightarrow\;\mathcal{V}^{J}, \qquad \phi=\bigl(\phi',d\bigr),$

where $F_{\phi'}$ is the transformer that produces logits, and $\mathcal{D}_d$ is a decoding operator (such as greedy, top- $k$ , or nucleus sampling) with hyper-parameters $d$ . Thus, $g_{\phi}$ maps an embedded input token sequence to an output token sequence.

Vision Enocoder Breakdown

Now that we have established the mathematical setting, let's look at the architectural implementation of the Vision Encoder $f_{v\theta}$ , visually represented in Figure 2. Practically, the processing flow of $f_{v\theta}$ is broken down into the following steps:

1. Patch Partitioning

The first step is breaking the high-resolution image $\mathbf{X}$ into a grid of fixed-size patches.Assuming our image has $336 \times 336$ pixels and we use a patch size of $P = 14$ , standard $^{*}$ vision encoders divide the image into $24 \times 24 = 576$ distinct squares. Mathematically, the image is reshaped from $\mathbf{X} \in \mathbb{R}^{C \times H \times W}$ into a sequence of flattened 2D patches $\mathbf{x}_p \in \mathbb{R}^{N \times (P^2 \cdot C)}$ , where $N$ is the total number of patches.

$^{*}$ Standard stands for CLIP-like Vision Encoders (Radford et al., 2021, Zhai et al., 2024).

2. Linear Projection and Position Embeddings

These patches are simply raw pixel values. To convert them into vectors, $f_{v\theta}$ projects each flattened patch into a latent representation through a linear layer. Given the lack of spatial priors in Vision Transformers (ViT) (Dosovitskiy et al., 2021), these vectors are equipped with learnable positional encodings, injecting "GPS-like" coordinates so the model knows where each patch belongs in the original image.

ViT Architecture — Figure 2: Architecture of Visision Transfomers (ViT) (Dosovitskiy et al., 2021), serving as an overview the VLM architectural process.

3. Transformer Layers

The resulting vectors are passed through several Transformer Layers consisting of Multi-Head Self-Attention and MLPs. The output is a sequence of vectors where each vector represents a patch within the context of the whole image. This full process produces the representations $\mathbf{X'} = f_{v\theta}(\mathbf{X}) \in \mathbb{R}^{V\times F}$ .

Contrastive Learning

Before the Vision Encoder $f_{v\theta}$ can be used in the VLM pipeline, it must learn to extract features that are semantically aligned with text. This is achieved through Contrastive Learning, (extra sources here, here and here) a learning process through which Vision Encoders learn to be powerful feature extractors, compressing visual information into vectors (tokens) semantically aligned with language.
Mathematically, during this pre-training phase, each encoder ( $f_{v\theta}$ , $f_{t\theta'}$ ) extracts feature representations for a batch of image-text pairs. Let $t' = f_{t\theta'}(t)$ be the text features and $\mathbf{X}' = f_{v\theta}(\mathbf{X})$ be the image features. These are normalized as follows

$\mathbf{X}'_{e} = \frac{\mathbf{X}'}{\|\mathbf{X}'\|_2}, \quad t'_{e} = \frac{t'}{\|t'\|_2}$

These normalized features are used to compute the pairwise cosine similarities:

$\textit{logits} = (\mathbf{X}_e' \cdot t_e'^T ) \cdot e^{\tau}$

where $t_e'^{T}$ is the transpose of $t_{e}^{'}$ , and $\tau$ is a learnable temperature parameter.These logits are finally used to compute the joint loss function using cross-entropy (CE). The model attempts to maximize the similarity of correct image-text pairs (the diagonal of the matrix) while minimizing others:

$\begin{aligned} \mathcal{L}_{\mathbf X} &= \operatorname{CE}(\textit{logits}, \textit{labels}, \text{axis}=0), \\[4pt] \mathcal{L}_{t} &= \operatorname{CE}(\textit{logits}, \textit{labels}, \text{axis}=1), \\[4pt] \mathcal{L} &= \tfrac{1}{2}\,\bigl(\mathcal{L}_{\mathbf X} + \mathcal{L}_{t}\bigr). \end{aligned}$

Here, labels are the ground truths for that sample, and $\text{axis}=i, \text{with } i \in \{0,1\}$ represents the dimension along which the loss is computed.

VLM Architecture and Flow

Once the Vision Encoder is pre-trained, we can assemble the full model. Architecturally, Vision Language Models are constituted by three major components:

Vision Encoders ( $f_{v\theta}$ ), usually a CLIP-like image encoder (Dosovitskiy et al., 2021,Radford et al., 2021, Zhai et al., 2023, Bai et al., 2025), but it can vary in architecture and training style. See this extensive survey for more information.
Modality Connectors ( $m_\gamma$ ), often simple Multi-Layer Perceptron, with some architectures employing attention blocks (Li et al., 2023) and other alternatives (Tong et al., 2024, Nulli et al,. 2025).
Large Language Models ( $g_\phi$ ) like Qwen3 Yang An, et al. 2025, Llama3 Abhimanyu, et al. 2024, Vicuna Wei-Lin, et al. 2023 and more.

Vision-Language Modeling Pipeline

Putting everything together, we can finally describe the classic VLM pipeline during inference, as depicted in Figure 1. In our calculations below we assume:

A fixed token count. We defer to our next blogpost "The Hidden Inefficiency in Vision Language Modelling" (coming soon), for an analysis of image pre-processing (Li et al., 2024) or other kinds of spatial merging (QwenTeam, 2025, Gemma-Team, 2025) impacting the total visual token count.
A batch size of 1.

As per earlier, Vision Encoders $f_{v\theta}$ are used to encode an image $\mathbf{X}$ into a representation:

$\mathbf{X}' = f_{v\theta}(\mathbf{X}) \in \mathbb{R}^{V \times F}$

Here, $F$ is the feature dimension and $V$ is the vision encoder hidden dimension, calculated as
$V = (\frac{\textit{image resolution}}{\textit{patch size}})^2$ $^{* *}$ .

Subsequently, $\mathbf{X}'$ is transformed through the connector $m_\gamma$ into Visual Tokens ( $\mathbf{VT}$ ):

$\mathbf{VT} = m_\gamma(\mathbf{X}') \in \mathbb{R}^{V \times D}$

Crucially, these tokens now exist in the input embedding space of the Large Language Model. In parallel, a Tokenizer $\mathcal{T}: \Sigma \rightarrow \mathcal{V}^{J}$ and a learned embedding $E:\mathcal{V}^{J}\;\longrightarrow\;\mathbb{R}^{D}$ turn the text input $t$ into textual tokens: $\mathit{TT} = E^{\otimes}(\mathcal{T}(t)) \in \mathbb{R}^{J \times D}$ , where $E^{\otimes}$ is the sequence-wise lifting of operator $E$ . Lastly, the visual tokens $\mathbf{VT}$ are concatenated with the textual tokens $\mathit{TT}$ and provided as input to the LLM $g_\phi$ to obtain the output tokens $\mathbf{T}_a$ :

$\mathbf{T}_a = g_{\phi}(\mathbf{VT} \oplus \mathit{TT}) \in \mathcal{V}^{J}.$

$^{* *}$ An crucial approximation, which we'll tackle in our blogpost "The Hidden Inefficiency in Vision Language Modelling" (coming soon).

Conclusions

Through the pipeline we've explored, we have witnessed a transformation: raw pixels, once just a grid of intensity values, have been flattened, projected, and semantically aligned to emerge as Visual Tokens. These tokens are the "universal language" that allows an LLM to treat an image not as a foreign file type, but as a sequence of concepts—no different from the words in this sentence. By projecting visual data into the same $D$ -dimensional embedding space as text, we have effectively given the LLM a pair of eyes.

What’s Next: The Efficiency Bottleneck

While we have successfully "digitized" sight for our models, a massive challenge remains. The impact of the amount Visual Tokens created by the vision encoding pipeline.

In our next post, "The Hidden Inefficiency in Vision Language Modelling" (coming soon), we will dive deep into the cost of producing Visual Tokens on Inference Time & Memory Requirements. We will break down how token count impacts self-attention $O (N^{2})$ and explore why reducing the visual token count is the secret to building faster, leaner, and more capable multimodal systems.

Citation

If you use this work, please cite:

@misc{nulli2026enabling,
  title={De-mystifying Multimodal Learning: Enabiling Vision in Language Models},
  author={Nulli, Matteo},
  year={2026},
  url={https://huggingface.co/blog/MatteoNulli/de-mystifying-multimodal-learning-enabiling-vision},
  howpublished={Available at \url{https://matteonulli.github.io/blog/2026/demystifying0/} and \url{https://huggingface.co/blog/MatteoNulli/de-mystifying-multimodal-learning-enabiling-vision}},
  note={Hugging Face Blog}
}

References

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. 2024a. Llava-onevision: Easy visual task transfer. Preprint, arXiv:2408.03326.

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer. Sigmoid Loss for Language Image Pre-Training, 2024. URL https://arxiv.org/abs/2303.15343.

Gemma-Team. (2025). Gemma 3 Technical Report. arXiv preprint arXiv:2503.19786.

Dosovitskiy Alexey, Beyer Lucas, Kolesnikov Alexander, Weissenborn Dirk, Zhai Xiaohua, Unterthiner Thomas, Dehghani Mostafa, Minderer Matthias, Heigold Georg, Gelly Sylvain, Uszkoreit Jakob, Houlsby Neil. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv preprint arXiv:2010.11929.

Radford Alec, Kim Jong Wook, Hallacy Chris, Ramesh Aditya, Goh Gabriel, Agarwal Sandhini, Sastry Girish, Askell Amanda, Mishkin Pamela, Clark Jack, Krueger Gretchen, Sutskever Ilya. (2021). Learning Transferable Visual Models from Natural Language Supervision. arXiv preprint arXiv:2103.00020.

Zhai Xiaohua, Mustafa Basil, Kolesnikov Alexander, Beyer Lucas. (2023). Sigmoid Loss for Language Image Pre-Training. arXiv preprint arXiv:2303.15343.

Bai Shuai, Chen Keqin, Liu Xuejing, Wang Jialin, Ge Wenbin, Song Sibo, Dang Kai, Wang Peng, Wang Shijie, Tang Jun, Zhong Humen, Zhu Yuanzhi, Yang Mingkun, Li Zhaohai, Wan Jianqiang, Wang Pengfei, Ding Wei, Fu Zheren, Xu Yiheng, Ye Jiabo, Zhang Xi, Xie Tianbao, Cheng Zesen, Zhang Hang, Yang Zhibo, Xu Haiyang, Lin Junyang. (2025). Qwen2.5-VL Technical Report. arXiv preprint arXiv:2502.13923.

Li J, Li D, Savarese S, Hoi S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning 2023

Tong Shengbang, Brown Ellis, Wu Penghao, Woo Sanghyun, Middepogu Manoj, Akula Sai Charitha, Yang Jihan, Yang Shusheng, Iyer Adithya, Pan Xichen, Wang Austin, Fergus Rob, LeCun Yann, Xie Saining. (2024). Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs. arXiv preprint arXiv:2406.16860.

Matteo Nulli, Ivona Najdenkoska, Mohammad Mahdi Derakhshani, and Yuki M Asano. 2025. Objectguided visual tokens: Eliciting compositional reasoning in multimodal language models. In EurIPS 2025 Workshop on Principles of Generative Modeling (PriGM)

QwenTeam. 2025. Qwen3-vl: Sharper vision, deeper thought, broader action.