De-mystifying Multimodal Learning: Enabiling Vision in Language Models

Community Article Published February 17, 2026
πŸ€— Comunity Article, πŸ“ Blogpost

Introduction

In this first installment of our series, De-mystifying Multimodal Learning, we break down the mechanics of how images become language-compatible vectors. To truly understand how a Large Language Model (LLM) "sees", we must look at the mathematics defining the problem, the training objectives that align vision and text, and the specific architectural steps that process raw pixels, introducing Vision Language Models (VLMs).

VLM Architecture
Figure 1: Adaptation of Figure from LLaVA-OneVision (Li et al., 2024), serving as an overview the VLM architectural process.

We will therefore cover:

Mathematical Formulation: The theoretical foundation and formal definitions of VLMs.

Vision Enocoder Breakdown: A detailed overview of the image processing operated by the ViT-CLIP based Vision Encoders.

Contrastive Learning: Uncovering how CLIP models learn to algin the images and text representations into the same space.

VLM Architecture and Flow: Putting it all together, diving deep in the architectural components of VLMs, detailing the birth of Visual Tokens, the source of sigths for LLMs.

Mathematical Formulation

To understand Vision-Language Models (VLMs), we first need to define the notation and the transformation pipeline formally.

Let X∈RCΓ—HΓ—W \mathbf{X} \in \mathbb{R}^{C \times H\times W} be an image and t∈Σ t \in \Sigma be a language instruction input, where Ξ£ \Sigma is the input space of character sequences. Let sΞΈ,Ξ³,Ο• s_{\theta, \gamma, \phi} be an VLM parametrized by ΞΈ,Ξ³,Ο• \theta, \gamma, \phi . We define fvΞΈ f_{v\theta} as a contrastively pre-trained Vision Encoder model:

fvθ:RC×H×W→RV×F,f_{v\theta}: \mathbb{R}^{C \times H \times W} \rightarrow \mathbb{R}^{V \times F},

where V V is the number of visual tokens and F F is their hidden size. ftΞΈβ€² f_{t\theta'} represents the corresponding Text Encoder used during the pre-training phase.

To bridge the gap between vision and language, we use a connector mγ:RV×F→RV×D m_\gamma: \mathbb{R}^{V \times F} \rightarrow \mathbb{R}^{V \times D} , typically a Multi-Layer Perceptron (MLP). The token vocabulary for the model is defined as:

Vβ€…β€Š=β€…β€ŠVvisionβ€…β€Šβˆͺβ€…β€ŠVtext\mathcal{V}\;=\;\mathcal{V}_{\text{vision}}\;\cup\;\mathcal{V}_{\text{text}}

The Large Language Model itself is defined as:

gΟ•β€…β€Š:=β€…β€ŠDdβ€…β€Šβˆ˜β€…β€Šsoftmaxβ‘β€…β€Šβˆ˜β€…β€ŠFΟ•β€²β€…β€Šβ€…β€Š:β€…β€ŠRJΓ—Dβ€…β€ŠβŸΆβ€…β€ŠVJ,Ο•=(Ο•β€²,d), g_{\phi}\;:=\;\mathcal{D}_d\;\circ\;\operatorname{softmax}\;\circ\;F_{\phi'}\;\;:\;\mathbb{R}^{J\times D}\;\longrightarrow\;\mathcal{V}^{J}, \qquad \phi=\bigl(\phi',d\bigr),

where FΟ•β€² F_{\phi'} is the transformer that produces logits, and Dd \mathcal{D}_d is a decoding operator (such as greedy, top- k k , or nucleus sampling) with hyper-parameters d d . Thus, gΟ• g_{\phi} maps an embedded input token sequence to an output token sequence.

Vision Enocoder Breakdown

Now that we have established the mathematical setting, let's look at the architectural implementation of the Vision Encoder fvΞΈ f_{v\theta} , visually represented in Figure 2. Practically, the processing flow of fvΞΈ f_{v\theta} is broken down into the following steps:

1. Patch Partitioning

The first step is breaking the high-resolution image X \mathbf{X} into a grid of fixed-size patches.Assuming our image has 336Γ—336 336 \times 336 pixels and we use a patch size of P=14 P=14 , standard βˆ— ^{*} vision encoders divide the image into 24Γ—24=576 24 \times 24 = 576 distinct squares. Mathematically, the image is reshaped from X∈RCΓ—HΓ—W \mathbf{X} \in \mathbb{R}^{C \times H \times W} into a sequence of flattened 2D patches xp∈RNΓ—(P2β‹…C) \mathbf{x}_p \in \mathbb{R}^{N \times (P^2 \cdot C)} , where N N is the total number of patches.

βˆ— ^* Standard stands for CLIP-like Vision Encoders (Radford et al., 2021, Zhai et al., 2024).

2. Linear Projection and Position Embeddings

These patches are simply raw pixel values. To convert them into vectors, fvΞΈ f_{v\theta} projects each flattened patch into a latent representation through a linear layer. Given the lack of spatial priors in Vision Transformers (ViT) (Dosovitskiy et al., 2021), these vectors are equipped with learnable positional encodings, injecting "GPS-like" coordinates so the model knows where each patch belongs in the original image.

ViT Architecture
Figure 2: Architecture of Visision Transfomers (ViT) (Dosovitskiy et al., 2021), serving as an overview the VLM architectural process.

3. Transformer Layers

The resulting vectors are passed through several Transformer Layers consisting of Multi-Head Self-Attention and MLPs. The output is a sequence of vectors where each vector represents a patch within the context of the whole image. This full process produces the representations Xβ€²=fvΞΈ(X)∈RVΓ—F \mathbf{X'} = f_{v\theta}(\mathbf{X}) \in \mathbb{R}^{V\times F} .

Contrastive Learning

Before the Vision Encoder fvΞΈ f_{v\theta} can be used in the VLM pipeline, it must learn to extract features that are semantically aligned with text. This is achieved through Contrastive Learning, (extra sources here, here and here) a learning process through which Vision Encoders learn to be powerful feature extractors, compressing visual information into vectors (tokens) semantically aligned with language.
Mathematically, during this pre-training phase, each encoder ( fvΞΈ f_{v\theta} , ftΞΈβ€² f_{t\theta'} ) extracts feature representations for a batch of image-text pairs. Let tβ€²=ftΞΈβ€²(t) t' = f_{t\theta'}(t) be the text features and Xβ€²=fvΞΈ(X) \mathbf{X}' = f_{v\theta}(\mathbf{X}) be the image features. These are normalized as follows

Xeβ€²=Xβ€²βˆ₯Xβ€²βˆ₯2,teβ€²=tβ€²βˆ₯tβ€²βˆ₯2\mathbf{X}'_{e} = \frac{\mathbf{X}'}{\|\mathbf{X}'\|_2}, \quad t'_{e} = \frac{t'}{\|t'\|_2}

These normalized features are used to compute the pairwise cosine similarities:

logits=(Xeβ€²β‹…teβ€²T)β‹…eΟ„\textit{logits} = (\mathbf{X}_e' \cdot t_e'^T ) \cdot e^{\tau}

where teβ€²T t_e'^{T} is the transpose of teβ€² t_e' , and Ο„ \tau is a learnable temperature parameter.These logits are finally used to compute the joint loss function using cross-entropy (CE). The model attempts to maximize the similarity of correct image-text pairs (the diagonal of the matrix) while minimizing others:

LX=CE⁑(logits,labels,axis=0),Lt=CE⁑(logits,labels,axis=1),L=12 (LX+Lt).\begin{aligned} \mathcal{L}_{\mathbf X} &= \operatorname{CE}(\textit{logits}, \textit{labels}, \text{axis}=0), \\[4pt] \mathcal{L}_{t} &= \operatorname{CE}(\textit{logits}, \textit{labels}, \text{axis}=1), \\[4pt] \mathcal{L} &= \tfrac{1}{2}\,\bigl(\mathcal{L}_{\mathbf X} + \mathcal{L}_{t}\bigr). \end{aligned}

Here, labels are the ground truths for that sample, and axis=i,with i∈{0,1} \text{axis}=i, \text{with } i \in \{0,1\} represents the dimension along which the loss is computed.

VLM Architecture and Flow

Once the Vision Encoder is pre-trained, we can assemble the full model. Architecturally, Vision Language Models are constituted by three major components:

Vision-Language Modeling Pipeline

Putting everything together, we can finally describe the classic VLM pipeline during inference, as depicted in Figure 1. In our calculations below we assume:

  • A fixed token count. We defer to our next blogpost "The Hidden Inefficiency in Vision Language Modelling" (coming soon), for an analysis of image pre-processing (Li et al., 2024) or other kinds of spatial merging (QwenTeam, 2025, Gemma-Team, 2025) impacting the total visual token count.
  • A batch size of 1.

As per earlier, Vision Encoders fvΞΈ f_{v\theta} are used to encode an image X \mathbf{X} into a representation:

Xβ€²=fvΞΈ(X)∈RVΓ—F\mathbf{X}' = f_{v\theta}(\mathbf{X}) \in \mathbb{R}^{V \times F}

Here, F F is the feature dimension and V V is the vision encoder hidden dimension, calculated as
V=(image resolutionpatch size)2 V = (\frac{\textit{image resolution}}{\textit{patch size}})^2 βˆ—βˆ— ^{**} .

Subsequently, Xβ€² \mathbf{X}' is transformed through the connector mΞ³ m_\gamma into Visual Tokens ( VT \mathbf{VT} ):

VT=mΞ³(Xβ€²)∈RVΓ—D\mathbf{VT} = m_\gamma(\mathbf{X}') \in \mathbb{R}^{V \times D}

Crucially, these tokens now exist in the input embedding space of the Large Language Model. In parallel, a Tokenizer T:Ξ£β†’VJ \mathcal{T}: \Sigma \rightarrow \mathcal{V}^{J} and a learned embedding E:VJβ€…β€ŠβŸΆβ€…β€ŠRD E:\mathcal{V}^{J}\;\longrightarrow\;\mathbb{R}^{D} turn the text input t t into textual tokens: TT=EβŠ—(T(t))∈RJΓ—D \mathit{TT} = E^{\otimes}(\mathcal{T}(t)) \in \mathbb{R}^{J \times D} , where EβŠ— E^{\otimes} is the sequence-wise lifting of operator E E . Lastly, the visual tokens VT \mathbf{VT} are concatenated with the textual tokens TT \mathit{TT} and provided as input to the LLM gΟ• g_\phi to obtain the output tokens Ta \mathbf{T}_a :

Ta=gΟ•(VTβŠ•TT)∈VJ.\mathbf{T}_a = g_{\phi}(\mathbf{VT} \oplus \mathit{TT}) \in \mathcal{V}^{J}.

βˆ—βˆ— ^{**} An crucial approximation, which we'll tackle in our blogpost "The Hidden Inefficiency in Vision Language Modelling" (coming soon).

Conclusions

Through the pipeline we've explored, we have witnessed a transformation: raw pixels, once just a grid of intensity values, have been flattened, projected, and semantically aligned to emerge as Visual Tokens. These tokens are the "universal language" that allows an LLM to treat an image not as a foreign file type, but as a sequence of conceptsβ€”no different from the words in this sentence. By projecting visual data into the same DD-dimensional embedding space as text, we have effectively given the LLM a pair of eyes.

What’s Next: The Efficiency Bottleneck

While we have successfully "digitized" sight for our models, a massive challenge remains. The impact of the amount Visual Tokens created by the vision encoding pipeline.

In our next post, "The Hidden Inefficiency in Vision Language Modelling" (coming soon), we will dive deep into the cost of producing Visual Tokens on Inference Time & Memory Requirements. We will break down how token count impacts self-attention O(N2) O(N^2) and explore why reducing the visual token count is the secret to building faster, leaner, and more capable multimodal systems.

Citation

If you use this work, please cite:

@misc{nulli2026enabling,
  title={De-mystifying Multimodal Learning: Enabiling Vision in Language Models},
  author={Nulli, Matteo},
  year={2026},
  url={https://huggingface.co/blog/MatteoNulli/de-mystifying-multimodal-learning-enabiling-vision},
  howpublished={Available at \url{https://matteonulli.github.io/blog/2026/demystifying0/} and \url{https://huggingface.co/blog/MatteoNulli/de-mystifying-multimodal-learning-enabiling-vision}},
  note={Hugging Face Blog}
}

References

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. 2024a. Llava-onevision: Easy visual task transfer. Preprint, arXiv:2408.03326.

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer. Sigmoid Loss for Language Image Pre-Training, 2024. URL https://arxiv.org/abs/2303.15343.

Gemma-Team. (2025). Gemma 3 Technical Report. arXiv preprint arXiv:2503.19786.

Dosovitskiy Alexey, Beyer Lucas, Kolesnikov Alexander, Weissenborn Dirk, Zhai Xiaohua, Unterthiner Thomas, Dehghani Mostafa, Minderer Matthias, Heigold Georg, Gelly Sylvain, Uszkoreit Jakob, Houlsby Neil. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv preprint arXiv:2010.11929.

Radford Alec, Kim Jong Wook, Hallacy Chris, Ramesh Aditya, Goh Gabriel, Agarwal Sandhini, Sastry Girish, Askell Amanda, Mishkin Pamela, Clark Jack, Krueger Gretchen, Sutskever Ilya. (2021). Learning Transferable Visual Models from Natural Language Supervision. arXiv preprint arXiv:2103.00020.

Zhai Xiaohua, Mustafa Basil, Kolesnikov Alexander, Beyer Lucas. (2023). Sigmoid Loss for Language Image Pre-Training. arXiv preprint arXiv:2303.15343.

Bai Shuai, Chen Keqin, Liu Xuejing, Wang Jialin, Ge Wenbin, Song Sibo, Dang Kai, Wang Peng, Wang Shijie, Tang Jun, Zhong Humen, Zhu Yuanzhi, Yang Mingkun, Li Zhaohai, Wan Jianqiang, Wang Pengfei, Ding Wei, Fu Zheren, Xu Yiheng, Ye Jiabo, Zhang Xi, Xie Tianbao, Cheng Zesen, Zhang Hang, Yang Zhibo, Xu Haiyang, Lin Junyang. (2025). Qwen2.5-VL Technical Report. arXiv preprint arXiv:2502.13923.

Li J, Li D, Savarese S, Hoi S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning 2023

Tong Shengbang, Brown Ellis, Wu Penghao, Woo Sanghyun, Middepogu Manoj, Akula Sai Charitha, Yang Jihan, Yang Shusheng, Iyer Adithya, Pan Xichen, Wang Austin, Fergus Rob, LeCun Yann, Xie Saining. (2024). Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs. arXiv preprint arXiv:2406.16860.

Matteo Nulli, Ivona Najdenkoska, Mohammad Mahdi Derakhshani, and Yuki M Asano. 2025. Objectguided visual tokens: Eliciting compositional reasoning in multimodal language models. In EurIPS 2025 Workshop on Principles of Generative Modeling (PriGM)

QwenTeam. 2025. Qwen3-vl: Sharper vision, deeper thought, broader action.

Yang An, et al. (2025). Qwen3 Technical Report. arXiv preprint arXiv:2505.09388.

Dubey Abhimanyu, et al. (2024). The Llama 3 Herd of Models. arXiv preprint arXiv:2407.21783.

Chiang Wei-Lin, Li Zhuohan, Lin Zi, Sheng Ying, Wu Zhanghao, Zhang Hao, Zheng Lianmin, Zhuang Siyuan, Zhuang Yonghao, Gonzalez Joseph E., Stoica Ion, Xing Eric P. (2023). Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. LMSYS Org Blog. https://lmsys.org/blog/2023-03-30-vicuna/

Community

Sign up or log in to comment