Multi-Lingual?

#10

by dejanseo - opened May 30, 2024

May 30, 2024

Tokenizer suggests a multi-lingual vocabulary. Would be interesting to hear more details about how much of your training data was non-English, and whether this is all just identical to original Mistral. I will put it to a test soon on a large multi-lingual website to find related pages for internal link recommendations.

iButeoMe

Jun 11, 2024

I would also here some Infos about multi-lingual and code capabilities. @dejanseo have got any updates yet?

dejanseo

Jun 14, 2024

I did test it but honestly can't tell the difference in embeddings quality between NV-Embed-v1 and LaBSE. In fact I think LaBSE is a little better at similarity mapping.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment