Integrate with Transformers v5 and Sentence Transformers v5.4

#2
by tomaarsen HF Staff - opened

Hello!

Pull Request overview

  • Integrate eager-embed-v1 with Sentence Transformers v5.4
  • Restore training-consistent embeddings under transformers 5.x via a thin custom modeling class, fully backwards compatible with existing user code

Details

While integrating, I found a subtle but important behavioral change in transformers. The model was trained with transformers==4.57.1 using model_outputs.hidden_states[-1] for embedding extraction. In 4.57.1, that field on Qwen3VLForConditionalGeneration was the state of the text decoder BEFORE the model's final RMSNorm. In transformers 5.x the capture logic was changed (the tie_last_hidden_states default in the new capture_outputs decorator), so hidden_states[-1] now equals the post-norm last_hidden_state. Anyone running the README example on a modern transformers install silently gets a different embedding space than the one the model was optimized for (ranking is preserved, but absolute scores differ materially: 0.29 vs 0.23 on the README's query/text-1 example).

To fix this, I added a small modeling_eager_embed.py with two thin subclasses (EagerEmbedModel and EagerEmbedForConditionalGeneration) that replace the text model's final RMSNorm with nn.Identity(). Registered via auto_map for AutoModel and AutoModelForImageTextToText, this makes both the Sentence Transformers path and the AutoModelForImageTextToText path produce the exact pre-norm representation the model was trained on, under any transformers version. The swap requires trust_remote_code=True. architectures in config.json is unchanged (still Qwen3VLForConditionalGeneration), so existing code using Qwen3VLForConditionalGeneration.from_pretrained(...) or AutoModel.from_pretrained(...) without trust_remote_code keeps its original behaviour.

I also extended the chat template with a new optional add_embedding_token flag that, when set, appends <|endoftext|> after the generation prompt. The existing add_generation_prompt conditional is unchanged, so code that does apply_chat_template(... add_generation_prompt=True) + "<|endoftext|>" continues to produce the exact same string. Sentence Transformers opts into the new flag via processing_kwargs.chat_template.add_embedding_token=true in sentence_bert_config.json.

Finally, I added {"query": "Query: ", "document": ""} to the prompts dict in config_sentence_transformers.json so Sentence Transformers users can call model.encode_query(...) / model.encode_document(...) without manually prefixing "Query: " to their queries. For message format models, ST injects prompts as a system-role message, so I also adjusted the chat template: when a system message is present (and no tools are declared), its text is inlined into the user message instead of being rendered as a separate <|im_start|>system\n...<|im_end|>\n block. This matches the flat-text tokenization the model was trained on. The plain-user-only path (no system message) is bit-for-bit identical to the original template.

Added files:

  • modeling_eager_embed.py: custom EagerEmbedModel / EagerEmbedForConditionalGeneration classes that make the text model's final RMSNorm a no-op to restore pre-norm embeddings
  • modules.json: ST module pipeline (Transformer, Pooling, Normalize)
  • config_sentence_transformers.json: ST model config with cosine similarity and {"query": "Query: ", "document": ""} prompts so model.encode_query(...) prepends the training-time prefix automatically
  • sentence_bert_config.json: transformer task, modality config for text/image/video/message, processing kwargs with add_generation_prompt and add_embedding_token for the chat template, unpad_inputs: false
  • 1_Pooling/config.json: lasttoken pooling at dimension 2560

Modified files:

  • config.json: added auto_map pointing AutoModel and AutoModelForImageTextToText at the new classes; architectures unchanged for backwards compatibility
  • chat_template.jinja: added an optional add_embedding_token conditional that appends <|endoftext|>; the original add_generation_prompt conditional is unchanged; in the non-tools path, a leading system message is inlined into the user message instead of being wrapped in a <|im_start|>system\n...<|im_end|>\n block (no-op when no system message is present, so plain-user inputs are bit-for-bit identical to before)
  • README.md: added sentence-transformers tag; added a "Using Sentence Transformers" section with text and image retrieval examples; updated the "Using transformers" example to use AutoModelForImageTextToText with trust_remote_code=True so it loads the no-op-norm subclass and therefore also produces training-consistent embeddings under transformers 5.x; added expected-output comments to every code snippet
import requests
from io import BytesIO
from PIL import Image
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("eagerworks/eager-embed-v1", revision="refs/pr/2", trust_remote_code=True)

# Multilingual text retrieval
# `encode_query` automatically prepends the "Query: " prefix the model was trained on.
queries = ["What is the capital city of Uruguay?"]
documents = [
    "Montevideo es la capital y la ciudad más poblada de la República Oriental del Uruguay, así como la capital del departamento homónimo",
    "El río Uruguay es un río internacional que forma parte de la cuenca del Plata. Nace en Brasil, recorre unos 1.800 km y desemboca en el Río de la Plata",
]

query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# (1, 2560) (2, 2560)

similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[0.2907, 0.1573]])

# Image document retrieval
MAX_IMAGE_SIZE = 784

def fetch_image(url):
    img = Image.open(BytesIO(requests.get(url).content)).convert("RGB")
    return img.resize((MAX_IMAGE_SIZE, MAX_IMAGE_SIZE))

queries = ["Where can we find the animal llama?"]
documents = [
    fetch_image("https://huggingface.co/Tevatron/dse-phi3-docmatix-v2/resolve/main/animal-llama.png"),
    fetch_image("https://huggingface.co/Tevatron/dse-phi3-docmatix-v2/resolve/main/meta-llama.png"),
]

query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# (1, 2560) (2, 2560)

similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[0.2709, 0.0930]])

Note that none of the old behaviour is affected or changed. Existing code paths (direct Qwen3VLForConditionalGeneration imports, AutoModel without trust_remote_code, apply_chat_template(... add_generation_prompt=True) + "<|endoftext|>") all produce bit-for-bit identical output to before. The custom modeling class and the new template flag are purely additive opt-ins that restore training-consistent behaviour for Sentence Transformers users and for users on transformers 5.x.

  • Tom Aarsen
tomaarsen changed pull request status to open
Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment