Update README.md
Browse files
README.md
CHANGED
|
@@ -13,7 +13,7 @@ task_categories:
|
|
| 13 |
- retrieval
|
| 14 |
- clustering
|
| 15 |
tags:
|
| 16 |
-
-
|
| 17 |
- SparkEmbedding
|
| 18 |
- Embedding
|
| 19 |
- embedding
|
|
@@ -27,7 +27,7 @@ annotations_creators:
|
|
| 27 |
- machine-generated
|
| 28 |
- expert-verified
|
| 29 |
source_datasets:
|
| 30 |
-
-
|
| 31 |
multilinguality:
|
| 32 |
- multilingual
|
| 33 |
---
|
|
@@ -46,7 +46,7 @@ multilinguality:
|
|
| 46 |
|
| 47 |
|
| 48 |
### Description
|
| 49 |
-
SparkEmbedding-300m is a 300 million parameter multilingual text embedding model with **SoTA cross‑lingual retrieval** developed by the
|
| 50 |
|
| 51 |
The model generates high-dimensional vector representations capturing rich semantic and contextual information, excelling in bridging linguistic gaps for applications like global information retrieval, multilingual question answering, and cross-language semantic search. With a native 2048-token context window, it handles extended inputs (e.g., full articles or documents) while preserving long-range dependencies.
|
| 52 |
|
|
@@ -86,7 +86,7 @@ from sentence_transformers import SentenceTransformer
|
|
| 86 |
import torch
|
| 87 |
import numpy as np
|
| 88 |
|
| 89 |
-
model = SentenceTransformer("
|
| 90 |
|
| 91 |
query = "How does artificial intelligence impact global economies?" # English
|
| 92 |
corpus = [
|
|
@@ -193,12 +193,12 @@ Qualitative: Tight t-SNE clustering for parallels; excels in complex/mixed-langu
|
|
| 193 |
|
| 194 |
### Citation
|
| 195 |
```bibtex
|
| 196 |
-
@misc{
|
| 197 |
title={SparkEmbedding-300m: A Fine-Tuned Multilingual Embedding Model for Cross-Lingual Retrieval},
|
| 198 |
author= {Parvesh Rawal}},
|
| 199 |
publisher={Hugging Face},
|
| 200 |
year={2025},
|
| 201 |
-
url={https://huggingface.co/
|
| 202 |
}
|
| 203 |
```
|
| 204 |
|
|
|
|
| 13 |
- retrieval
|
| 14 |
- clustering
|
| 15 |
tags:
|
| 16 |
+
- Modotte
|
| 17 |
- SparkEmbedding
|
| 18 |
- Embedding
|
| 19 |
- embedding
|
|
|
|
| 27 |
- machine-generated
|
| 28 |
- expert-verified
|
| 29 |
source_datasets:
|
| 30 |
+
- Modotte internal synthetic generation
|
| 31 |
multilinguality:
|
| 32 |
- multilingual
|
| 33 |
---
|
|
|
|
| 46 |
|
| 47 |
|
| 48 |
### Description
|
| 49 |
+
SparkEmbedding-300m is a 300 million parameter multilingual text embedding model with **SoTA cross‑lingual retrieval** developed by the Modotte team. Fine-tuned from Google's EmbeddingGemma-300m, it incorporates an additional 1 million curated samples across 119(all 22 Indian languages included) languages, emphasizing data complexity, linguistic diversity, and deep language understanding. This optimization enhances cross-lingual retrieval, producing embeddings with superior semantic alignment and efficacy in multilingual settings.
|
| 50 |
|
| 51 |
The model generates high-dimensional vector representations capturing rich semantic and contextual information, excelling in bridging linguistic gaps for applications like global information retrieval, multilingual question answering, and cross-language semantic search. With a native 2048-token context window, it handles extended inputs (e.g., full articles or documents) while preserving long-range dependencies.
|
| 52 |
|
|
|
|
| 86 |
import torch
|
| 87 |
import numpy as np
|
| 88 |
|
| 89 |
+
model = SentenceTransformer("Modotte/SparkEmbedding-300m", device='cuda' if torch.cuda.is_available() else 'cpu')
|
| 90 |
|
| 91 |
query = "How does artificial intelligence impact global economies?" # English
|
| 92 |
corpus = [
|
|
|
|
| 193 |
|
| 194 |
### Citation
|
| 195 |
```bibtex
|
| 196 |
+
@misc{Modotte_sparkembedding_2025,
|
| 197 |
title={SparkEmbedding-300m: A Fine-Tuned Multilingual Embedding Model for Cross-Lingual Retrieval},
|
| 198 |
author= {Parvesh Rawal}},
|
| 199 |
publisher={Hugging Face},
|
| 200 |
year={2025},
|
| 201 |
+
url={https://huggingface.co/Modotte/SparkEmbedding-300m}
|
| 202 |
}
|
| 203 |
```
|
| 204 |
|