List of supported languages

#34
by patlilt - opened

Hi!

Is there a list of languages supported by the model?

Thanks

Hi @patlilt ,

Welcome to Gemma models, thanks for reaching out to us.
The google/embeddinggemma-300m model is designed for broad application as a multilingual embedding model. It has been trained on data covering 100+ spoken languages globally, making it suitable for tasks like cross-lingual retrieval and semantic search.
However, an exact, exhaustive list of all 100+ languages is not available. You can find more technical details and usage examples on the developers docs page.

Thanks.

Thanks @BalakrishnaCh , any reason that list can't be shared?

Have interest in it as well. People may spend resources for a very high cost, not realizing their embedding process malfunctioning all this time

Hey,

Thank you for raising this, the concern about investing resources without clear language coverage is completely valid, especially for production systems where silent failures can be costly.
For EmbeddingGemma-300m, we don't provide a fixed supported languages list because language support isn't binary. The model is multilingual and was trained on a mixture covering 100+ languages, but performance varies along a spectrum depending on the quantity and nature of training data for a given language.

Rather than relying on a static list, it's recommended to validate performance in ways that directly reflect your production requirements:

  1. Check Multilingual Benchmark Results: The model has been evaluated on multilingual benchmarks such as MTEB Multilingual v2, which include language and task specific breakdowns. These results can help estimate expected performance for languages similar to yours.
  2. Inspect Tokeniser Coverage: The model uses a large shared multilingual tokeniser designed to handle scripts. Reviewing the tokeniser vocabulary can help confirm that your language's characters and common subwords are well represented, a necessary foundation for good performance.
  3. Run a lightweight Semantic Validation: Before scaling, it's recommended to embed a small validation set in your target language and evaluating cosine similarity or retrieval quality. Instead of relying on a fixed similarity threshold, compare relative alignment between clearly related and clearly unrelated pairs. This inexpensive check can quickly surface potential issues prior to full deployment.

We recognise that greater transparency around training mixtures is helpful for planning and we are continuing to improve documentation in that direction.
Thank you!

Sign up or log in to comment