Sentence Similarity
sentence-transformers
Safetensors
Transformers
Russian
English
bert
feature-extraction
russian
pretraining
embeddings
tiny
retrieval
mteb
text-embeddings-inference
Instructions to use sergeyzh/rubert-base-retriever with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use sergeyzh/rubert-base-retriever with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("sergeyzh/rubert-base-retriever") sentences = [ "Это счастливый человек", "Это счастливая собака", "Это очень счастливый человек", "Сегодня солнечный день" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Transformers
How to use sergeyzh/rubert-base-retriever with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("sergeyzh/rubert-base-retriever") model = AutoModel.from_pretrained("sergeyzh/rubert-base-retriever") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - ru | |
| - en | |
| pipeline_tag: sentence-similarity | |
| tags: | |
| - russian | |
| - pretraining | |
| - embeddings | |
| - tiny | |
| - feature-extraction | |
| - sentence-similarity | |
| - retrieval | |
| - sentence-transformers | |
| - transformers | |
| - mteb | |
| datasets: | |
| - IlyaGusev/gazeta | |
| - zloelias/lenta-ru | |
| - HuggingFaceFW/fineweb-2 | |
| - HuggingFaceFW/fineweb | |
| license: mit | |
| base_model: sergeyzh/BERTA | |
| Модель BERT для задач текстового поиска (retrieval). Модель получена дистилляцией эмбеддингов русских и английских текстов [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) в [BERTA](https://huggingface.co/sergeyzh/BERTA). | |
| Основные характеристики модели: | |
| - размер ембеддинга - 768, | |
| - длина контекста - 512, | |
| - слоёв - 12, | |
| - префиксы - не требуются. | |
| ## Использование | |
| ```Python | |
| from sentence_transformers import SentenceTransformer | |
| model = SentenceTransformer('sergeyzh/rubert-base-retriever') | |
| sentences = ["привет мир", "hello world", "здравствуй вселенная"] | |
| embeddings = model.encode(sentences) | |
| print(model.similarity(embeddings, embeddings)) | |
| ``` | |
| ## Метрики | |
| Оценки модели на задачах текстового поиска для русского языка: | |
| | Model Name | MIRACL Reranking | MIRACL Retrival | RiaNews Retrieval | RuBQ Reranking | RuBQ Retrieval | Average | | |
| | :--- | :---: | :---: | :---: | :---: | :---: | :---: | | |
| | bge-m3 | 0,654 | 0,702 | 0,830 | 0,740 | 0,712 | 0,728 | | |
| | BERTA | 0,643 | 0,676 | 0,816 | 0,752 | 0,710 | 0,719 | | |
| | **rubert-base-retriever** | 0,635 | 0,660 | 0,787 | 0,735 | 0,699 | 0,703 | | |
| | multilingual-e5-base | 0,605 | 0,616 | 0,702 | 0,720 | 0,696 | 0,668 | | |
| Оценки модели на задачах текстового поиска для английского языка: | |
| | Model Name | AILA Statutes | Argu Ana | Legal Bench Corporate Lobbying | SCIDOCS | Stack Overflow QA | Statcan Dialogue Dataset Retrieval | Wikipedia Retrieval Multilingual | Average | | |
| | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | |
| | bge-m3 | 0,298 | 0,539 | 0,904 | 0,164 | 0,806 | 0,284 | 0,924 | 0,560 | | |
| | **rubert-base-retriever** | 0,249 | 0,528 | 0,912 | 0,154 | 0,703 | 0,346 | 0,928 | 0,546 | | |
| | multilingual-e5-large | 0,208 | 0,544 | 0,897 | 0,174 | 0,889 | 0,106 | 0,911 | 0,533 | | |
| | multilingual-e5-base | 0,204 | 0,442 | 0,890 | 0,172 | 0,851 | 0,137 | 0,888 | 0,512 | | |
| | BERTA | 0,188 | 0,414 | 0,907 | 0,112 | 0,493 | 0,304 | 0,888 | 0,472 | | |