Meridian: Hyperbolic Multimodal Representation Learning

Hyperbolic multimodal retrieval built on frozen CLIP representations, combining Lorentz and Euclidean embedding spaces for compact and hierarchy-aware semantic search.

Meridian is a multimodal retrieval model built on top of CLIP ViT-B/16 that learns both hyperbolic (Lorentz) and Euclidean representations for images and text.

Unlike conventional retrieval systems that operate entirely in Euclidean space, Meridian maps semantic information onto a Lorentz manifold where hierarchical structure emerges naturally. This geometry enables cleaner semantic organization, improved hierarchy preservation, and compact multimodal representations while retaining strong retrieval performance.

Highlights

  • ~4Γ— Embedding Compression compared to the CLIP baseline
  • ~1.5Γ— Faster Retrieval through compact representations
  • Hierarchical Multimodal Retrieval
  • Hyperbolic + Euclidean Dual Representations
  • Adaptive Geometry Gating
  • Layer-Weighted Transformer Aggregation
  • Built on OpenAI CLIP ViT-B/16

Performance

Evaluation on MS-COCO retrieval.

Variant i2t R@1 i2t R@5 i2t R@10 t2i R@1 t2i R@5 t2i R@10
Meridian (64d) 29.66 55.20 67.18 25.29 51.00 63.02

Where:

  • i2t = Image β†’ Text Retrieval
  • t2i = Text β†’ Image Retrieval
  • Retained ∼70–78% I2T and ∼87–91% T2I recall (R@5/10) relative to the full-dim CLIP zero-shot baseline, with ∼1.6Γ— retrieval speedup on a 1.7M-item index.

Loading the Model

from transformers import AutoModel

model = AutoModel.from_pretrained(
    "kaustuk000/meridian",
    trust_remote_code=True,
)

model.eval()

The CLIP backbone is downloaded automatically.


Loading the Retrieval Index

The retrieval index loader is built directly into the model.

index = model.load_index("kaustuk000/meridian")

Available tensors:

index["tensors"]["h_image"]
index["tensors"]["e_image"]

index["tensors"]["h_text"]
index["tensors"]["e_text"]

Index tensors are stored on the Hub in FP16 format for storage efficiency and automatically converted to FP32 when loaded.


Encoding Text

import torch

inputs = model.processor(
    text=["a photo of a dog running on a beach"],
    return_tensors="pt",
    padding="max_length",
    max_length=77,
)

eos_indices = inputs["attention_mask"].sum(dim=1) - 1

with torch.no_grad():
    out = model.encode_text(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        eos_indices=eos_indices,
    )

h_text = out["h_text"]
e_text = out["e_text"]

These embeddings can be compared against:

index["tensors"]["h_text"]
index["tensors"]["e_text"]

Encoding Images

from PIL import Image
import torch

image = Image.open("image.jpg").convert("RGB")

inputs = model.processor(
    images=image,
    return_tensors="pt",
)

with torch.no_grad():
    out = model.encode_image(
        pixel_values=inputs["pixel_values"]
    )

h_image = out["h_image"]
e_image = out["e_image"]

These embeddings can be compared against:

index["tensors"]["h_image"]
index["tensors"]["e_image"]

Architecture

Meridian consists of five major components:

1. CLIP Vision Encoder

OpenAI CLIP ViT-B/16 image encoder.

2. CLIP Text Encoder

OpenAI CLIP text transformer.

3. Layer Aggregation

Instead of relying solely on the final transformer layer, Meridian learns weighted combinations across all transformer blocks.

4. Dual Representation Heads

Two parallel embedding spaces are learned:

  • Hyperbolic Lorentz embeddings
  • Euclidean embeddings

The hyperbolic branch maps features onto a Lorentz manifold using exponential-map operations.

5. Adaptive Gating

A learned gating mechanism dynamically combines hyperbolic and Euclidean similarities for each sample.

Training Note

The CLIP vision and text encoders remain completely frozen during training.

Meridian learns:

  • Layer aggregation weights
  • Hyperbolic projection heads
  • Euclidean projection heads
  • Adaptive gating modules

This enables improved hierarchical organization and retrieval behavior without modifying the pretrained CLIP backbone.


Hierarchy Comparison

Meridian is designed to preserve hierarchical structure more effectively than conventional Euclidean embeddings.

The examples below compare hierarchical organization produced by the original frozen CLIP embeddings against the representations learned by Meridian. The CLIP vision and text encoders remain frozen throughout training; improvements arise from Meridian's learned layer aggregation, projection heads, and adaptive gating modules.

OpenAI CLIP ViT-B/16

Mixed Animal Region
β”œβ”€β”€ Dog
β”œβ”€β”€ Cat
β”œβ”€β”€ Flower
β”œβ”€β”€ Lion
β”œβ”€β”€ Tiger
β”œβ”€β”€ Tree
└── Elephant

Meridian

Feline Region
β”œβ”€β”€ Domestic Cat
β”œβ”€β”€ Tabby Cat
β”œβ”€β”€ Kitten
β”œβ”€β”€ Tiger
β”œβ”€β”€ Tiger Cub
β”œβ”€β”€ Lion
β”œβ”€β”€ Lioness
└── Big Cat Cub

Meridian naturally organizes semantically related concepts into coherent neighborhoods while reducing cross-category mixing.


Geometric Intuition

Hyperbolic space grows exponentially with distance from the origin.

This makes it particularly well suited for representing hierarchical data.

In Meridian:

  • General concepts occupy central regions.
  • Fine-grained concepts move toward the boundary.
  • Hierarchical relationships emerge naturally from geometry.

The hyperbolic branch uses Lorentz manifold operations rather than standard Euclidean distance metrics.

The geodesic distance between two points is:

dL(x,y) = (1/√c) arcosh(-c⟨x,y⟩L)

where:

⟨x,y⟩L = -xβ‚€yβ‚€ + Ξ£α΅’ xα΅’yα΅’

This geometry provides exponentially increasing representational capacity as embeddings move away from the origin.


Training Configuration

The released checkpoint was trained using:

  • Dataset: CC3M (Conceptual Captions 3M)
  • Surviving Training Pairs: ~1.7M
  • Backbone: CLIP ViT-B/16
  • Training Steps: 100,000
  • Batch Size: 196
  • Warmup Steps: 5,000
  • Frozen CLIP Encoders
  • Learnable Layer Aggregation
  • Learnable Projection Heads
  • Learnable Adaptive Gating

Intended Uses

Meridian is suitable for:

  • Text-to-image retrieval
  • Image-to-text retrieval
  • Semantic search
  • Multimodal indexing
  • Hierarchical clustering
  • Embedding generation
  • Dataset exploration
  • Semantic visualization

Limitations

  • Designed for retrieval rather than image generation.
  • Inherits biases present in CLIP and web-scale datasets.
  • Retrieval quality may decrease for domains far outside CC3M.
  • Hyperbolic embeddings require specialized similarity calculations for best performance.

Repository

GitHub:

https://github.com/kaustuk000/Meridian


Citation

@misc{singh2026meridian,
  title  = {Meridian: Hyperbolic Image--Text Representations},
  author = {Kaustuk Pratap Singh},
  year   = {2026},
  url    = {https://github.com/kaustuk000/Meridian}
}

If you use concepts originating from MERU, please also cite:

@inproceedings{desai2023meru,
    title     = {Hyperbolic Image-Text Representations},
    author    = {Desai, Karan and Nickel, Maximilian and Rajpurohit, Tanmay and Johnson, Justin and Vedantam, Ramakrishna},
    booktitle = {International Conference on Machine Learning},
    year      = {2023}
}

Acknowledgements

Meridian builds upon several foundational projects:

  • OpenAI CLIP
  • OpenCLIP
  • MERU (Hyperbolic Image-Text Representations)

Special thanks to the authors of these projects for their contributions to multimodal representation learning.

Downloads last month
72
Safetensors
Model size
41.5M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for kaustuk000/meridian

Finetuned
(56)
this model