Meridian: Hyperbolic Multimodal Representation Learning
Hyperbolic multimodal retrieval built on frozen CLIP representations, combining Lorentz and Euclidean embedding spaces for compact and hierarchy-aware semantic search.
Meridian is a multimodal retrieval model built on top of CLIP ViT-B/16 that learns both hyperbolic (Lorentz) and Euclidean representations for images and text.
Unlike conventional retrieval systems that operate entirely in Euclidean space, Meridian maps semantic information onto a Lorentz manifold where hierarchical structure emerges naturally. This geometry enables cleaner semantic organization, improved hierarchy preservation, and compact multimodal representations while retaining strong retrieval performance.
Highlights
- ~4Γ Embedding Compression compared to the CLIP baseline
- ~1.5Γ Faster Retrieval through compact representations
- Hierarchical Multimodal Retrieval
- Hyperbolic + Euclidean Dual Representations
- Adaptive Geometry Gating
- Layer-Weighted Transformer Aggregation
- Built on OpenAI CLIP ViT-B/16
Performance
Evaluation on MS-COCO retrieval.
| Variant | i2t R@1 | i2t R@5 | i2t R@10 | t2i R@1 | t2i R@5 | t2i R@10 |
|---|---|---|---|---|---|---|
| Meridian (64d) | 29.66 | 55.20 | 67.18 | 25.29 | 51.00 | 63.02 |
Where:
- i2t = Image β Text Retrieval
- t2i = Text β Image Retrieval
- Retained βΌ70β78% I2T and βΌ87β91% T2I recall (R@5/10) relative to the full-dim CLIP zero-shot baseline, with βΌ1.6Γ retrieval speedup on a 1.7M-item index.
Loading the Model
from transformers import AutoModel
model = AutoModel.from_pretrained(
"kaustuk000/meridian",
trust_remote_code=True,
)
model.eval()
The CLIP backbone is downloaded automatically.
Loading the Retrieval Index
The retrieval index loader is built directly into the model.
index = model.load_index("kaustuk000/meridian")
Available tensors:
index["tensors"]["h_image"]
index["tensors"]["e_image"]
index["tensors"]["h_text"]
index["tensors"]["e_text"]
Index tensors are stored on the Hub in FP16 format for storage efficiency and automatically converted to FP32 when loaded.
Encoding Text
import torch
inputs = model.processor(
text=["a photo of a dog running on a beach"],
return_tensors="pt",
padding="max_length",
max_length=77,
)
eos_indices = inputs["attention_mask"].sum(dim=1) - 1
with torch.no_grad():
out = model.encode_text(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
eos_indices=eos_indices,
)
h_text = out["h_text"]
e_text = out["e_text"]
These embeddings can be compared against:
index["tensors"]["h_text"]
index["tensors"]["e_text"]
Encoding Images
from PIL import Image
import torch
image = Image.open("image.jpg").convert("RGB")
inputs = model.processor(
images=image,
return_tensors="pt",
)
with torch.no_grad():
out = model.encode_image(
pixel_values=inputs["pixel_values"]
)
h_image = out["h_image"]
e_image = out["e_image"]
These embeddings can be compared against:
index["tensors"]["h_image"]
index["tensors"]["e_image"]
Architecture
Meridian consists of five major components:
1. CLIP Vision Encoder
OpenAI CLIP ViT-B/16 image encoder.
2. CLIP Text Encoder
OpenAI CLIP text transformer.
3. Layer Aggregation
Instead of relying solely on the final transformer layer, Meridian learns weighted combinations across all transformer blocks.
4. Dual Representation Heads
Two parallel embedding spaces are learned:
- Hyperbolic Lorentz embeddings
- Euclidean embeddings
The hyperbolic branch maps features onto a Lorentz manifold using exponential-map operations.
5. Adaptive Gating
A learned gating mechanism dynamically combines hyperbolic and Euclidean similarities for each sample.
Training Note
The CLIP vision and text encoders remain completely frozen during training.
Meridian learns:
- Layer aggregation weights
- Hyperbolic projection heads
- Euclidean projection heads
- Adaptive gating modules
This enables improved hierarchical organization and retrieval behavior without modifying the pretrained CLIP backbone.
Hierarchy Comparison
Meridian is designed to preserve hierarchical structure more effectively than conventional Euclidean embeddings.
The examples below compare hierarchical organization produced by the original frozen CLIP embeddings against the representations learned by Meridian. The CLIP vision and text encoders remain frozen throughout training; improvements arise from Meridian's learned layer aggregation, projection heads, and adaptive gating modules.
OpenAI CLIP ViT-B/16
Mixed Animal Region
βββ Dog
βββ Cat
βββ Flower
βββ Lion
βββ Tiger
βββ Tree
βββ Elephant
Meridian
Feline Region
βββ Domestic Cat
βββ Tabby Cat
βββ Kitten
βββ Tiger
βββ Tiger Cub
βββ Lion
βββ Lioness
βββ Big Cat Cub
Meridian naturally organizes semantically related concepts into coherent neighborhoods while reducing cross-category mixing.
Geometric Intuition
Hyperbolic space grows exponentially with distance from the origin.
This makes it particularly well suited for representing hierarchical data.
In Meridian:
- General concepts occupy central regions.
- Fine-grained concepts move toward the boundary.
- Hierarchical relationships emerge naturally from geometry.
The hyperbolic branch uses Lorentz manifold operations rather than standard Euclidean distance metrics.
The geodesic distance between two points is:
dL(x,y) = (1/βc) arcosh(-cβ¨x,yβ©L)
where:
β¨x,yβ©L = -xβyβ + Ξ£α΅’ xα΅’yα΅’
This geometry provides exponentially increasing representational capacity as embeddings move away from the origin.
Training Configuration
The released checkpoint was trained using:
- Dataset: CC3M (Conceptual Captions 3M)
- Surviving Training Pairs: ~1.7M
- Backbone: CLIP ViT-B/16
- Training Steps: 100,000
- Batch Size: 196
- Warmup Steps: 5,000
- Frozen CLIP Encoders
- Learnable Layer Aggregation
- Learnable Projection Heads
- Learnable Adaptive Gating
Intended Uses
Meridian is suitable for:
- Text-to-image retrieval
- Image-to-text retrieval
- Semantic search
- Multimodal indexing
- Hierarchical clustering
- Embedding generation
- Dataset exploration
- Semantic visualization
Limitations
- Designed for retrieval rather than image generation.
- Inherits biases present in CLIP and web-scale datasets.
- Retrieval quality may decrease for domains far outside CC3M.
- Hyperbolic embeddings require specialized similarity calculations for best performance.
Repository
GitHub:
https://github.com/kaustuk000/Meridian
Citation
@misc{singh2026meridian,
title = {Meridian: Hyperbolic Image--Text Representations},
author = {Kaustuk Pratap Singh},
year = {2026},
url = {https://github.com/kaustuk000/Meridian}
}
If you use concepts originating from MERU, please also cite:
@inproceedings{desai2023meru,
title = {Hyperbolic Image-Text Representations},
author = {Desai, Karan and Nickel, Maximilian and Rajpurohit, Tanmay and Johnson, Justin and Vedantam, Ramakrishna},
booktitle = {International Conference on Machine Learning},
year = {2023}
}
Acknowledgements
Meridian builds upon several foundational projects:
- OpenAI CLIP
- OpenCLIP
- MERU (Hyperbolic Image-Text Representations)
Special thanks to the authors of these projects for their contributions to multimodal representation learning.
- Downloads last month
- 72
Model tree for kaustuk000/meridian
Base model
openai/clip-vit-base-patch16