Instructions to use minishlab/M2V_multilingual_output with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Model2Vec
How to use minishlab/M2V_multilingual_output with Model2Vec:
from model2vec import StaticModel model = StaticModel.from_pretrained("minishlab/M2V_multilingual_output") - sentence-transformers
How to use minishlab/M2V_multilingual_output with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("minishlab/M2V_multilingual_output") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
question on max_seq_length
Does this model have the same max_seq_length as LaBSE (256) or can you go beyond this?
Thank you.
Hi, this model does not have a max_seq_len limit. It has static embeddings, so you can process documents of arbitrary length with it. If you want do do this, please set max_length to None, e.g. embeddings = model.encode(["Example sentence"], max_length=None) and it will process whatever length your input is.
Thank you for the quick response.
Will this affect the quality of the embedding?
That's hard to say; we have not done extensive experiments yet on long documents, most of our benchmarks were for documents < 512 tokens (MTEB). We do plan on experimenting with this in the future
It does affect the quality. Very long input texts with millions of tokens lead to almost useless embeddings (like with normal models, the longer the input, the poorer the quality), wrote a little bit about it here in the comments section: https://www.linkedin.com/posts/dominik-weckm%C3%BCller_from-days-to-seconds-creating-embeddings-activity-7255095750496321537-WwI2?utm_source=share&utm_medium=member_desktop. Will write about my findings soon.