Instructions to use llm-semantic-router/multi-modal-embed-large with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use llm-semantic-router/multi-modal-embed-large with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("llm-semantic-router/multi-modal-embed-large") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
Draft integration with Sentence Transformers v5.5.0
Hello!
Pull Request overview
- Integrate
multi-modal-embed-largewith the upcoming Sentence Transformers v5.5 via a Router-based pipeline plus two small custom Module classes (each a thinTransformersubclass).
Details
The integration adds a Sentence Transformers entry point alongside the existing hf_st_mm packaged source. Loading via SentenceTransformer("llm-semantic-router/multi-modal-embed-large", trust_remote_code=True) now produces a Router model with three routes that mirror the original tri-encoder:
textroute:Transformer(mmbert) -> Pooling(mean) -> Normalizeimageroute:SiglipVisionTransformer -> Pooling(mean) -> Dense(1152, 768) -> Normalizeaudioroute:WhisperEncoderTransformer -> Pooling(mean) -> Dense(1024, 768) -> Normalize
Routes are picked automatically by ST's modality inference (plain string -> text, image path / URL / PIL image -> image, audio path / URL / NumPy array -> audio). The task argument is optional and only needed when the input doesn't carry a recognizable extension or PIL/array type.
A few non-obvious choices:
- The image route uses a tiny
Transformersubclass (SiglipVisionTransformer) because the trained projection was fit againstvision_model(...).last_hidden_state[:, 1:].mean(dim=1)(i.e. the mean of all patch tokens with the first one dropped). The standardTransformerwould not be able to use the[:, 1:]slicing. Beyond that, the model loading takes only the vision model. Everything else (preprocess, save, load, processor wiring) is inherited from the parentTransformer. - The audio route uses an even smaller
Transformersubclass (WhisperEncoderTransformer) whose only job is to decode audio file paths/URLs into raw waveforms viatransformers.audio_utils.load_audiobefore delegating to the parent's preprocess.WhisperFeatureExtractor(and audio feature extractors generally) intransformersonly accept arrays, not paths, so without thismodel.encode("call.wav")would fail. This subclass should be removable once transformers grows native path-loading at the feature-extractor / processor layer. - The integration is config-only:
model.ptis left untouched, and the existinghf_st_mmusage path continues to work unchanged, but I would recommend switching to a fully Sentence Transformers format. However, I'm a bit biased.
Outputs were verified against the existing MultiModalSentenceEmbedder reference: the pairwise cosine-similarity matrix across two text inputs, two images, and two audio clips matches identically (to six decimals) on the same hardware. The text route returns bfloat16 (mmbert ships in bf16); image and audio routes return float32. Users wanting a uniform dtype can pass model_kwargs={"torch_dtype": torch.float32} at load time.
Added files:
modeling_multimodal_embed.py:SiglipVisionTransformerandWhisperEncoderTransformer— thinTransformersubclasses used by the image and audio routes.modules.json: top-level ST module list, pointing at the built-inRouter.router_config.json: per-route module structure and types (text / image / audio).config_sentence_transformers.json: ST model-level config (model_type, similarity_fn_name, etc.).text_0_Transformer/: ModernBERT (mmbert) weights, tokenizer, and ST module config.text_1_Pooling/: mean pooling config for the text route.image_0_SiglipVisionTransformer/: SigLIP2 vision-tower weights and image preprocessor.image_1_Pooling/+image_2_Dense/: mean pooling and 1152 -> 768 projection (image_proj).audio_0_WhisperEncoderTransformer/: Whisper encoder weights, feature extractor, and ST module config.audio_1_Pooling/+audio_2_Dense/: mean pooling and 1024 -> 768 projection (audio_proj).
Modified files:
README.md: switchedlibrary_nametosentence-transformers, added a "Using Sentence Transformers" subsection at the top of "How To Use It", and kept the existinghf_st_mmsnippet under a sibling subsection for backward compatibility.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("llm-semantic-router/multi-modal-embed-large", trust_remote_code=True)
text_embeddings = model.encode(
[
"Martin Luther King Jr. delivering his I have a dream speech",
"two cats sleeping side by side on a pink couch",
]
)
image_embeddings = model.encode(
[
"http://images.cocodataset.org/val2017/000000039769.jpg", # two cats on a pink couch
"http://images.cocodataset.org/val2017/000000000139.jpg", # distractor
]
)
audio_embeddings = model.encode(
[
"https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac", # MLK speech
"https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/i-know-kung-fu.mp3", # distractor
]
)
print(text_embeddings.shape, image_embeddings.shape, audio_embeddings.shape)
# (2, 768) (2, 768) (2, 768)
print(model.similarity(text_embeddings, image_embeddings))
# tensor([[0.0704, 0.0121], # MLK text: neither image matches
# [0.5532, 0.3070]]) # cats text: the cats photo wins
print(model.similarity(text_embeddings, audio_embeddings))
# tensor([[ 0.2186, 0.1428], # MLK text: the MLK audio wins
# [-0.0625, 0.0667]]) # cats text: neither audio matches
Note that none of the existing behaviour is affected. model.pt, config.json, and src/hf_st_mm/... are all unchanged, so the original MultiModalSentenceEmbedder snippet in the README continues to work as before. This PR only adds an additional way to run the model through the Sentence Transformers API.
Note that this requires https://github.com/huggingface/sentence-transformers/pull/3749, which I intend to release very soon in Sentence Transformers v5.5.0, so this'll be a draft until then, also to show you what Sentence Transformers can do nowadays.
Happy to tweak anything you'd like changed. Please let me know if you have any questions or feedback!
- Tom Aarsen
Thanks @tomaarsen Tom! PTAL @HuaminChen