Draft integration with Sentence Transformers v5.5.0

by tomaarsen HF Staff - opened May 4

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+700

-5

tomaarsen

May 4

•

edited May 4

Hello!

Pull Request overview

Integrate multi-modal-embed-large with the upcoming Sentence Transformers v5.5 via a Router-based pipeline plus two small custom Module classes (each a thin Transformer subclass).

Details

The integration adds a Sentence Transformers entry point alongside the existing hf_st_mm packaged source. Loading via SentenceTransformer("llm-semantic-router/multi-modal-embed-large", trust_remote_code=True) now produces a Router model with three routes that mirror the original tri-encoder:

text route: Transformer(mmbert) -> Pooling(mean) -> Normalize
image route: SiglipVisionTransformer -> Pooling(mean) -> Dense(1152, 768) -> Normalize
audio route: WhisperEncoderTransformer -> Pooling(mean) -> Dense(1024, 768) -> Normalize

Routes are picked automatically by ST's modality inference (plain string -> text, image path / URL / PIL image -> image, audio path / URL / NumPy array -> audio). The task argument is optional and only needed when the input doesn't carry a recognizable extension or PIL/array type.

A few non-obvious choices:

The image route uses a tiny Transformer subclass (SiglipVisionTransformer) because the trained projection was fit against vision_model(...).last_hidden_state[:, 1:].mean(dim=1) (i.e. the mean of all patch tokens with the first one dropped). The standard Transformer would not be able to use the [:, 1:]slicing. Beyond that, the model loading takes only the vision model. Everything else (preprocess, save, load, processor wiring) is inherited from the parent Transformer.
The audio route uses an even smaller Transformer subclass (WhisperEncoderTransformer) whose only job is to decode audio file paths/URLs into raw waveforms via transformers.audio_utils.load_audio before delegating to the parent's preprocess. WhisperFeatureExtractor (and audio feature extractors generally) in transformers only accept arrays, not paths, so without this model.encode("call.wav") would fail. This subclass should be removable once transformers grows native path-loading at the feature-extractor / processor layer.
The integration is config-only: model.pt is left untouched, and the existing hf_st_mm usage path continues to work unchanged, but I would recommend switching to a fully Sentence Transformers format. However, I'm a bit biased.

Outputs were verified against the existing MultiModalSentenceEmbedder reference: the pairwise cosine-similarity matrix across two text inputs, two images, and two audio clips matches identically (to six decimals) on the same hardware. The text route returns bfloat16 (mmbert ships in bf16); image and audio routes return float32. Users wanting a uniform dtype can pass model_kwargs={"torch_dtype": torch.float32} at load time.

Added files:

modeling_multimodal_embed.py: SiglipVisionTransformer and WhisperEncoderTransformer — thin Transformer subclasses used by the image and audio routes.
modules.json: top-level ST module list, pointing at the built-in Router.
router_config.json: per-route module structure and types (text / image / audio).
config_sentence_transformers.json: ST model-level config (model_type, similarity_fn_name, etc.).
text_0_Transformer/: ModernBERT (mmbert) weights, tokenizer, and ST module config.
text_1_Pooling/: mean pooling config for the text route.
image_0_SiglipVisionTransformer/: SigLIP2 vision-tower weights and image preprocessor.
image_1_Pooling/ + image_2_Dense/: mean pooling and 1152 -> 768 projection (image_proj).
audio_0_WhisperEncoderTransformer/: Whisper encoder weights, feature extractor, and ST module config.
audio_1_Pooling/ + audio_2_Dense/: mean pooling and 1024 -> 768 projection (audio_proj).

Modified files:

README.md: switched library_name to sentence-transformers, added a "Using Sentence Transformers" subsection at the top of "How To Use It", and kept the existing hf_st_mm snippet under a sibling subsection for backward compatibility.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("llm-semantic-router/multi-modal-embed-large", trust_remote_code=True)

text_embeddings = model.encode(
    [
        "Martin Luther King Jr. delivering his I have a dream speech",
        "two cats sleeping side by side on a pink couch",
    ]
)
image_embeddings = model.encode(
    [
        "http://images.cocodataset.org/val2017/000000039769.jpg",  # two cats on a pink couch
        "http://images.cocodataset.org/val2017/000000000139.jpg",  # distractor
    ]
)
audio_embeddings = model.encode(
    [
        "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac",            # MLK speech
        "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/i-know-kung-fu.mp3",  # distractor
    ]
)

print(text_embeddings.shape, image_embeddings.shape, audio_embeddings.shape)
# (2, 768) (2, 768) (2, 768)

print(model.similarity(text_embeddings, image_embeddings))
# tensor([[0.0704, 0.0121],   # MLK text:  neither image matches
#         [0.5532, 0.3070]])  # cats text: the cats photo wins

print(model.similarity(text_embeddings, audio_embeddings))
# tensor([[ 0.2186,  0.1428],   # MLK text:  the MLK audio wins
#         [-0.0625,  0.0667]])  # cats text: neither audio matches

Note that none of the existing behaviour is affected. model.pt, config.json, and src/hf_st_mm/... are all unchanged, so the original MultiModalSentenceEmbedder snippet in the README continues to work as before. This PR only adds an additional way to run the model through the Sentence Transformers API.

Note that this requires https://github.com/huggingface/sentence-transformers/pull/3749, which I intend to release very soon in Sentence Transformers v5.5.0, so this'll be a draft until then, also to show you what Sentence Transformers can do nowadays.
Happy to tweak anything you'd like changed. Please let me know if you have any questions or feedback!

cc @Xunzhuo @HuaminChen

Tom Aarsen

Integrate with upcoming Sentence Transformers v5.5.01a5fa14d

tomaarsen changed pull request status to open May 12

Xunzhuo

vLLM Semantic Router org May 18

Thanks @tomaarsen Tom! PTAL @HuaminChen

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Ready to merge

This branch is ready to get merged automatically.

· Sign up or log in to comment