Reranker

#30

by Totole - opened Mar 20, 2024

Hi, thanks a lot for your work !

Two questions:

Is the model.compute_score(sentence_pairs, max_passage_length, weights_for_different_modes) just making a score (e.g. cosine) with the embeddings (dense, sparse, colbert) done by the model ? In other words, is it cross-encoding or bi-encoding ?
Why does the max_length_token of this model seems to be 514 and not 8000 ?

Beijing Academy of Artificial Intelligence org Mar 21, 2024

Thanks for your interest in our work!

The bge-m3 is bi-encoding model. Its compute_score function will summarize the scores from different embedding mode(dense, sparse, colbert)
The max length is 8192. You can see the config: https://huggingface.co/BAAI/bge-m3/blob/main/tokenizer_config.json

Besides, we release some new rerankers(cross-encoders): https://huggingface.co/BAAI/bge-reranker-v2-m3#model-list . Feel free to use them and provide your feedback.

Thanks!
I have an error when computed for query above 514 tokens with the model.compute_score function (not with the model.encode)

Here is my code

And the call

I have better results (in French) with the Embedder than with the Reranker :) I keep you in touch

Beijing Academy of Artificial Intelligence org Mar 21, 2024

Hello, I need more detailed information about the error.

Can you run the code here successfully?
Maybe you can paste your full code here, and then I will test it to see if this error can be reproduced.

For a very weird reason, it works on Colab but not on Azure ML...

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment