Divergence with original model nearing max. sequence length?

by tomaarsen - opened Apr 7, 2025

Apr 7, 2025

Hello!

In my quick tests with the original model, it has some very specific rules when it comes to maximum sequence lengths and truncation that make it tricky to implement an AutoTokenizer that matches 100%. My understanding is that this model technically also differs when you reach the maximum sequence length, is that true?

Granted, the maximum sequence length is really high, something like 32k tokens presumably based on the position embeddings.

Also - congratulations on your BEI release with Baseten! It looks very solid

Tom Aarsen

michaelfeil

Owner Apr 7, 2025

Hey Tom,

Agree! There is to following rule, setting the max_length of query + document to 8192 (or 32768, if none is set). https://github.com/mixedbread-ai/mxbai-rerank/blob/7592f2d37db7d2dcc9627ad6dd002e8f4d4cc82b/mxbai_rerank/mxbai_rerank_v2.py#L79

The system prompt is maybe 100tokens + truncated to 6k query tokens + rest of budget to document.
Also, query and document are split by the \n token, which is tokenized, then appended. (aka, if tokenize + detokenize + tokenize again, the result would be different).
BAAI's llm reranker has similar policies: https://github.com/FlagOpen/FlagEmbedding/blob/d5292b68758f41c7911fe85596cdd0329901a3a5/FlagEmbedding/inference/reranker/decoder_only/base.py#L385

Not sure if any of this helps, let me know!

Thanks!

tomaarsen

Apr 7, 2025

Very interesting! Thanks for sharing. These LLM-style reranker formats are quite tricky in terms of tokenization.

Tom Aarsen

michaelfeil

Owner Apr 7, 2025

Yeah, at least with the "ForSequenceClassification" they should be quite easy to run e.g. with vLLM / TEI etc.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment