|
|
--- |
|
|
datasets: |
|
|
- unicamp-dl/mmarco |
|
|
library_name: sentence-transformers |
|
|
pipeline_tag: sentence-similarity |
|
|
tags: |
|
|
- sentence-transformers |
|
|
- feature-extraction |
|
|
- sentence-similarity |
|
|
license: mit |
|
|
widget: [] |
|
|
base_model: |
|
|
- BAAI/bge-m3 |
|
|
--- |
|
|
|
|
|
# BGE-m3 RU mMARCO/v2 Native Queries |
|
|
|
|
|
This is a [BGE-M3](https://huggingface.co/BAAI/bge-m3) model post-trained on the Russian dataset from MMARCO/v2. |
|
|
|
|
|
The model was used for the SIGIR 2025 Short paper: Lost in Transliteration: Bridging the Script Gap in Neural IR. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
- **Model Type:** Sentence Transformer |
|
|
<!-- - **Base model:** [Unknown](https://huggingface.co/unknown) --> |
|
|
- **Maximum Sequence Length:** 8192 tokens |
|
|
- **Output Dimensionality:** 1024 tokens |
|
|
- **Similarity Function:** Cosine Similarity |
|
|
<!-- - **Training Dataset:** Unknown --> |
|
|
<!-- - **Language:** Unknown --> |
|
|
<!-- - **License:** Unknown --> |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Framework Versions |
|
|
- Python: 3.10.13 |
|
|
- Sentence Transformers: 3.1.1 |
|
|
- Transformers: 4.45.1 |
|
|
- PyTorch: 2.4.1 |
|
|
- Accelerate: 0.34.2 |
|
|
- Datasets: 3.0.1 |
|
|
- Tokenizers: 0.20.3 |