De-mystifying Multimodal Learning: Enabiling Vision in Language Models
π€ Comunity Article, π Blogpost
Introduction
In this first installment of our series, De-mystifying Multimodal Learning, we break down the mechanics of how images become language-compatible vectors. To truly understand how a Large Language Model (LLM) "sees", we must look at the mathematics defining the problem, the training objectives that align vision and text, and the specific architectural steps that process raw pixels, introducing Vision Language Models (VLMs).
We will therefore cover:
Mathematical Formulation: The theoretical foundation and formal definitions of VLMs.
Vision Enocoder Breakdown: A detailed overview of the image processing operated by the ViT-CLIP based Vision Encoders.
Contrastive Learning: Uncovering how CLIP models learn to algin the images and text representations into the same space.
VLM Architecture and Flow: Putting it all together, diving deep in the architectural components of VLMs, detailing the birth of Visual Tokens, the source of sigths for LLMs.
Mathematical Formulation
To understand Vision-Language Models (VLMs), we first need to define the notation and the transformation pipeline formally.
Let be an image and be a language instruction input, where is the input space of character sequences. Let be an VLM parametrized by . We define as a contrastively pre-trained Vision Encoder model:
where is the number of visual tokens and is their hidden size. represents the corresponding Text Encoder used during the pre-training phase.
To bridge the gap between vision and language, we use a connector , typically a Multi-Layer Perceptron (MLP). The token vocabulary for the model is defined as:
The Large Language Model itself is defined as:
where is the transformer that produces logits, and is a decoding operator (such as greedy, top- , or nucleus sampling) with hyper-parameters . Thus, maps an embedded input token sequence to an output token sequence.
Vision Enocoder Breakdown
Now that we have established the mathematical setting, let's look at the architectural implementation of the Vision Encoder , visually represented in Figure 2. Practically, the processing flow of is broken down into the following steps:
1. Patch Partitioning
The first step is breaking the high-resolution image into a grid of fixed-size patches.Assuming our image has pixels and we use a patch size of , standard vision encoders divide the image into distinct squares. Mathematically, the image is reshaped from into a sequence of flattened 2D patches , where is the total number of patches.
Standard stands for CLIP-like Vision Encoders (Radford et al., 2021, Zhai et al., 2024).
2. Linear Projection and Position Embeddings
These patches are simply raw pixel values. To convert them into vectors, projects each flattened patch into a latent representation through a linear layer. Given the lack of spatial priors in Vision Transformers (ViT) (Dosovitskiy et al., 2021), these vectors are equipped with learnable positional encodings, injecting "GPS-like" coordinates so the model knows where each patch belongs in the original image.
3. Transformer Layers
The resulting vectors are passed through several Transformer Layers consisting of Multi-Head Self-Attention and MLPs. The output is a sequence of vectors where each vector represents a patch within the context of the whole image. This full process produces the representations .
Contrastive Learning
Before the Vision Encoder can be used in the VLM pipeline, it must learn to extract features that are semantically aligned with text.
This is achieved through Contrastive Learning, (extra sources here, here and here) a learning process through which Vision Encoders learn to be powerful feature extractors, compressing visual information into vectors (tokens) semantically aligned with language.
Mathematically, during this pre-training phase, each encoder ( , ) extracts feature representations for a batch of image-text pairs. Let be the text features and be the image features.
These are normalized as follows
These normalized features are used to compute the pairwise cosine similarities:
where is the transpose of , and is a learnable temperature parameter.These logits are finally used to compute the joint loss function using cross-entropy (CE). The model attempts to maximize the similarity of correct image-text pairs (the diagonal of the matrix) while minimizing others:
Here, labels are the ground truths for that sample, and represents the dimension along which the loss is computed.
VLM Architecture and Flow
Once the Vision Encoder is pre-trained, we can assemble the full model. Architecturally, Vision Language Models are constituted by three major components:
- Vision Encoders ( ), usually a CLIP-like image encoder (Dosovitskiy et al., 2021,Radford et al., 2021, Zhai et al., 2023, Bai et al., 2025), but it can vary in architecture and training style. See this extensive survey for more information.
- Modality Connectors ( ), often simple Multi-Layer Perceptron, with some architectures employing attention blocks (Li et al., 2023) and other alternatives (Tong et al., 2024, Nulli et al,. 2025).
- Large Language Models ( ) like Qwen3 Yang An, et al. 2025, Llama3 Abhimanyu, et al. 2024, Vicuna Wei-Lin, et al. 2023 and more.
Vision-Language Modeling Pipeline
Putting everything together, we can finally describe the classic VLM pipeline during inference, as depicted in Figure 1. In our calculations below we assume:
- A fixed token count. We defer to our next blogpost "The Hidden Inefficiency in Vision Language Modelling" (coming soon), for an analysis of image pre-processing (Li et al., 2024) or other kinds of spatial merging (QwenTeam, 2025, Gemma-Team, 2025) impacting the total visual token count.
- A batch size of 1.
As per earlier, Vision Encoders are used to encode an image into a representation:
Here, is the feature dimension and is the vision encoder hidden dimension, calculated as
.
Subsequently, is transformed through the connector into Visual Tokens ( ):
Crucially, these tokens now exist in the input embedding space of the Large Language Model. In parallel, a Tokenizer and a learned embedding turn the text input into textual tokens: , where is the sequence-wise lifting of operator . Lastly, the visual tokens are concatenated with the textual tokens and provided as input to the LLM to obtain the output tokens :
An crucial approximation, which we'll tackle in our blogpost "The Hidden Inefficiency in Vision Language Modelling" (coming soon).
Conclusions
Through the pipeline we've explored, we have witnessed a transformation: raw pixels, once just a grid of intensity values, have been flattened, projected, and semantically aligned to emerge as Visual Tokens. These tokens are the "universal language" that allows an LLM to treat an image not as a foreign file type, but as a sequence of conceptsβno different from the words in this sentence. By projecting visual data into the same -dimensional embedding space as text, we have effectively given the LLM a pair of eyes.
Whatβs Next: The Efficiency Bottleneck
While we have successfully "digitized" sight for our models, a massive challenge remains. The impact of the amount Visual Tokens created by the vision encoding pipeline.
In our next post, "The Hidden Inefficiency in Vision Language Modelling" (coming soon), we will dive deep into the cost of producing Visual Tokens on Inference Time & Memory Requirements. We will break down how token count impacts self-attention and explore why reducing the visual token count is the secret to building faster, leaner, and more capable multimodal systems.
Citation
If you use this work, please cite:
@misc{nulli2026enabling,
title={De-mystifying Multimodal Learning: Enabiling Vision in Language Models},
author={Nulli, Matteo},
year={2026},
url={https://huggingface.co/blog/MatteoNulli/de-mystifying-multimodal-learning-enabiling-vision},
howpublished={Available at \url{https://matteonulli.github.io/blog/2026/demystifying0/} and \url{https://huggingface.co/blog/MatteoNulli/de-mystifying-multimodal-learning-enabiling-vision}},
note={Hugging Face Blog}
}
References
Gemma-Team. (2025). Gemma 3 Technical Report. arXiv preprint arXiv:2503.19786.
QwenTeam. 2025. Qwen3-vl: Sharper vision, deeper thought, broader action.
Yang An, et al. (2025). Qwen3 Technical Report. arXiv preprint arXiv:2505.09388.
Dubey Abhimanyu, et al. (2024). The Llama 3 Herd of Models. arXiv preprint arXiv:2407.21783.