Meridian: Hyperbolic Multimodal Representation Learning

Hyperbolic multimodal retrieval built on frozen CLIP representations, combining Lorentz and Euclidean embedding spaces for compact and hierarchy-aware semantic search.

Meridian is a multimodal retrieval model built on top of CLIP ViT-B/16 that learns both hyperbolic (Lorentz) and Euclidean representations for images and text.

Unlike conventional retrieval systems that operate entirely in Euclidean space, Meridian maps semantic information onto a Lorentz manifold where hierarchical structure emerges naturally. This geometry enables cleaner semantic organization, improved hierarchy preservation, and compact multimodal representations while retaining strong retrieval performance.

Highlights

~4× Embedding Compression compared to the CLIP baseline
~1.5× Faster Retrieval through compact representations
Hierarchical Multimodal Retrieval
Hyperbolic + Euclidean Dual Representations
Adaptive Geometry Gating
Layer-Weighted Transformer Aggregation
Built on OpenAI CLIP ViT-B/16

Performance

Evaluation on MS-COCO retrieval.

Variant	i2t R@1	i2t R@5	i2t R@10	t2i R@1	t2i R@5	t2i R@10
Meridian (64d)	29.66	55.20	67.18	25.29	51.00	63.02

Where:

i2t = Image → Text Retrieval
t2i = Text → Image Retrieval
Retained ∼70–78% I2T and ∼87–91% T2I recall (R@5/10) relative to the full-dim CLIP zero-shot baseline, with ∼1.6× retrieval speedup on a 1.7M-item index.

Loading the Model

from transformers import AutoModel

model = AutoModel.from_pretrained(
    "kaustuk000/meridian",
    trust_remote_code=True,
)

model.eval()

The CLIP backbone is downloaded automatically.

Loading the Retrieval Index

The retrieval index loader is built directly into the model.

index = model.load_index("kaustuk000/meridian")

Available tensors:

index["tensors"]["h_image"]
index["tensors"]["e_image"]

index["tensors"]["h_text"]
index["tensors"]["e_text"]

Index tensors are stored on the Hub in FP16 format for storage efficiency and automatically converted to FP32 when loaded.

Encoding Text

import torch

inputs = model.processor(
    text=["a photo of a dog running on a beach"],
    return_tensors="pt",
    padding="max_length",
    max_length=77,
)

eos_indices = inputs["attention_mask"].sum(dim=1) - 1

with torch.no_grad():
    out = model.encode_text(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        eos_indices=eos_indices,
    )

h_text = out["h_text"]
e_text = out["e_text"]

These embeddings can be compared against:

index["tensors"]["h_text"]
index["tensors"]["e_text"]

Encoding Images

from PIL import Image
import torch

image = Image.open("image.jpg").convert("RGB")

inputs = model.processor(
    images=image,
    return_tensors="pt",
)

with torch.no_grad():
    out = model.encode_image(
        pixel_values=inputs["pixel_values"]
    )

h_image = out["h_image"]
e_image = out["e_image"]

These embeddings can be compared against:

index["tensors"]["h_image"]
index["tensors"]["e_image"]

Architecture

Meridian consists of five major components:

1. CLIP Vision Encoder

OpenAI CLIP ViT-B/16 image encoder.

2. CLIP Text Encoder

OpenAI CLIP text transformer.

3. Layer Aggregation

Instead of relying solely on the final transformer layer, Meridian learns weighted combinations across all transformer blocks.

4. Dual Representation Heads

Two parallel embedding spaces are learned:

Hyperbolic Lorentz embeddings
Euclidean embeddings

The hyperbolic branch maps features onto a Lorentz manifold using exponential-map operations.

5. Adaptive Gating

A learned gating mechanism dynamically combines hyperbolic and Euclidean similarities for each sample.

Training Note

The CLIP vision and text encoders remain completely frozen during training.

Meridian learns:

Layer aggregation weights
Hyperbolic projection heads
Euclidean projection heads
Adaptive gating modules

This enables improved hierarchical organization and retrieval behavior without modifying the pretrained CLIP backbone.

Hierarchy Comparison

Meridian is designed to preserve hierarchical structure more effectively than conventional Euclidean embeddings.

The examples below compare hierarchical organization produced by the original frozen CLIP embeddings against the representations learned by Meridian. The CLIP vision and text encoders remain frozen throughout training; improvements arise from Meridian's learned layer aggregation, projection heads, and adaptive gating modules.

OpenAI CLIP ViT-B/16

Mixed Animal Region
├── Dog
├── Cat
├── Flower
├── Lion
├── Tiger
├── Tree
└── Elephant

Meridian

Feline Region
├── Domestic Cat
├── Tabby Cat
├── Kitten
├── Tiger
├── Tiger Cub
├── Lion
├── Lioness
└── Big Cat Cub

Meridian naturally organizes semantically related concepts into coherent neighborhoods while reducing cross-category mixing.

Geometric Intuition

Hyperbolic space grows exponentially with distance from the origin.

This makes it particularly well suited for representing hierarchical data.

In Meridian:

General concepts occupy central regions.
Fine-grained concepts move toward the boundary.
Hierarchical relationships emerge naturally from geometry.

The hyperbolic branch uses Lorentz manifold operations rather than standard Euclidean distance metrics.

The geodesic distance between two points is:

dL(x,y) = (1/√c) arcosh(-c⟨x,y⟩L)

where:

⟨x,y⟩L = -x₀y₀ + Σᵢ xᵢyᵢ

This geometry provides exponentially increasing representational capacity as embeddings move away from the origin.

Training Configuration

The released checkpoint was trained using:

Dataset: CC3M (Conceptual Captions 3M)
Surviving Training Pairs: ~1.7M
Backbone: CLIP ViT-B/16
Training Steps: 100,000
Batch Size: 196
Warmup Steps: 5,000
Frozen CLIP Encoders
Learnable Layer Aggregation
Learnable Projection Heads
Learnable Adaptive Gating

Intended Uses

Meridian is suitable for:

Text-to-image retrieval
Image-to-text retrieval
Semantic search
Multimodal indexing
Hierarchical clustering
Embedding generation
Dataset exploration
Semantic visualization

Limitations

Designed for retrieval rather than image generation.
Inherits biases present in CLIP and web-scale datasets.
Retrieval quality may decrease for domains far outside CC3M.
Hyperbolic embeddings require specialized similarity calculations for best performance.

Repository

GitHub:

https://github.com/kaustuk000/Meridian

Citation

@misc{singh2026meridian,
  title  = {Meridian: Hyperbolic Image--Text Representations},
  author = {Kaustuk Pratap Singh},
  year   = {2026},
  url    = {https://github.com/kaustuk000/Meridian}
}

If you use concepts originating from MERU, please also cite:

@inproceedings{desai2023meru,
    title     = {Hyperbolic Image-Text Representations},
    author    = {Desai, Karan and Nickel, Maximilian and Rajpurohit, Tanmay and Johnson, Justin and Vedantam, Ramakrishna},
    booktitle = {International Conference on Machine Learning},
    year      = {2023}
}

Acknowledgements

Meridian builds upon several foundational projects:

OpenAI CLIP
OpenCLIP
MERU (Hyperbolic Image-Text Representations)

Special thanks to the authors of these projects for their contributions to multimodal representation learning.

Downloads last month: 72

Safetensors

Model size

41.5M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kaustuk000/meridian

Base model

openai/clip-vit-base-patch16

Finetuned

(56)

this model