| | --- |
| | tags: |
| | - sentence-transformers |
| | - sentence-similarity |
| | - feature-extraction |
| | - generated_from_trainer |
| | - dataset_size:15298 |
| | - loss:CachedMultipleNegativesSymmetricRankingLoss |
| | - russian |
| | - constructicon |
| | - nlp |
| | - linguistics |
| | base_model: intfloat/multilingual-e5-large-instruct |
| | widget: |
| | - source_sentence: 'Instruct: Given a sentence, find the constructions of the Russian |
| | Constructicon that it contains |
| | |
| | Query: Петр так и замер.' |
| | sentences: |
| | - NP-Nom так и VP-Pfv |
| | - VP вокруг да около |
| | - NP-Nom в гробу видать NP-Acc |
| | - source_sentence: 'Instruct: Given a sentence, find the constructions of the Russian |
| | Constructicon that it contains |
| | |
| | Query: Мы, мягко говоря, совсем не ладили.' |
| | sentences: |
| | - VP по всем правилам (NP-Gen) |
| | - как насчёт XP? |
| | - мягко говоря, Cl |
| | - source_sentence: 'Instruct: Given a sentence, find the constructions of the Russian |
| | Constructicon that it contains |
| | |
| | Query: Не беспокойтесь, всё будет сделано в лучшем виде.' |
| | sentences: |
| | - быть может, XP/Cl |
| | - вот было бы здорово, если бы Cl |
| | - всё будет Adv/Adj-Short |
| | - source_sentence: 'Instruct: Given a sentence, find the constructions of the Russian |
| | Constructicon that it contains |
| | |
| | Query: Самолет до Саратова уже год как отменили.' |
| | sentences: |
| | - показать, где раки зимуют NP-Dat |
| | - VP как угорелый |
| | - (вот) (уже) (NumCrd-Nom/NumCrd-Acc) NP Cop как Cl/NP-Nom (вот) (уже) (NumCrd-Acc) |
| | NP как XP |
| | - source_sentence: 'Instruct: Given a sentence, find the constructions of the Russian |
| | Constructicon that it contains |
| | |
| | Query: Срочно делай уроки, а не то будешь иметь дело с раздраженным отцом!' |
| | sentences: |
| | - Cl, (а) не то Aux-Fut иметь дело с NP-Ins |
| | - VP (NP-Acc) с ног на голову |
| | - VP под NP-Acc |
| | pipeline_tag: sentence-similarity |
| | library_name: sentence-transformers |
| | language: |
| | - ru |
| | --- |
| | |
| | # Russian Constructicon Embedder |
| |
|
| | This is a specialized [sentence-transformers](https://www.SBERT.net) model fine-tuned from [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct) for finding Russian Constructicon patterns in text. The model is trained to compare Russian text examples with construction patterns from the Russian Constructicon database, enabling semantic search for linguistic constructions. |
| |
|
| | ## Model Details |
| |
|
| | ### Model Description |
| | - **Model Type:** Sentence Transformer specialized for Russian Constructicon patterns |
| | - **Base model:** [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct) |
| | - **Maximum Sequence Length:** 512 tokens |
| | - **Output Dimensionality:** 1024 dimensions |
| | - **Similarity Function:** Cosine Similarity |
| | - **Language:** Russian |
| | - **Training Dataset:** Russian Constructicon examples and patterns |
| |
|
| | ### Model Purpose |
| |
|
| | This model is specifically designed to encode Russian text examples and Constructicon patterns into a shared embedding space where similar constructions are close together. It enables: |
| |
|
| | - Finding Constructicon patterns that match given Russian text examples |
| | - Semantic search through Russian construction databases |
| | - Similarity comparison between text examples and linguistic patterns |
| | - Construction pattern retrieval and ranking |
| |
|
| | ## Usage |
| |
|
| | ### Primary Usage (RusCxnPipe Library) |
| |
|
| | This model is designed to be used with the [RusCxnPipe](https://github.com/Futyn-Maker/ruscxnpipe) library for automatic Russian Constructicon pattern extraction: |
| |
|
| | ```python |
| | from ruscxnpipe import SemanticSearch |
| | |
| | # Initialize with this specific model |
| | search = SemanticSearch( |
| | model_name="Futyn-Maker/ruscxn-embedder", |
| | query_prefix="Instruct: Given a sentence, find the constructions of the Russian Constructicon that it contains\nQuery: ", |
| | pattern_prefix="" |
| | ) |
| | |
| | # Find construction candidates |
| | examples = ["Петр так и замер.", "Мы, мягко говоря, совсем не ладили."] |
| | results = search.find_candidates(queries=examples, n=5) |
| | |
| | for result in results: |
| | print(f"Example: {result['query']}") |
| | for candidate in result['candidates']: |
| | print(f" Pattern: {candidate['pattern']} (similarity: {candidate['similarity']:.3f})") |
| | ``` |
| |
|
| | ### Direct Usage (Sentence Transformers) |
| |
|
| | For advanced users who want to use the model directly: |
| |
|
| | ```python |
| | from sentence_transformers import SentenceTransformer |
| | |
| | model = SentenceTransformer("Futyn-Maker/ruscxn-embedder") |
| | |
| | # Note: Use the correct prefixes for optimal performance |
| | query_prefix = "Instruct: Given a sentence, find the constructions of the Russian Constructicon that it contains\nQuery: " |
| | pattern_prefix = "" |
| | |
| | # Encode a Russian example |
| | example = query_prefix + "Петр так и замер." |
| | example_embedding = model.encode(example) |
| | |
| | # Encode construction patterns (no prefix needed) |
| | patterns = [ |
| | "NP-Nom так и VP-Pfv", |
| | "VP вокруг да около", |
| | "мягко говоря, Cl" |
| | ] |
| | pattern_embeddings = model.encode(patterns) |
| | |
| | # Calculate similarities |
| | from sentence_transformers.util import cos_sim |
| | similarities = cos_sim(example_embedding, pattern_embeddings) |
| | print(similarities) |
| | ``` |
| |
|
| | ## Out-of-Scope Use |
| |
|
| | While this model is optimized for Russian Constructicon pattern matching, it may also be useful for other tasks involving Russian linguistic patterns, such as: |
| |
|
| | - Clustering of similar constructions |
| | - Classification of constructions |
| |
|
| | However, performance on these tasks has not been systematically evaluated. |
| |
|
| | ## Training Details |
| |
|
| | ### Training Dataset |
| |
|
| | The model was trained on **15,298 examples** from the Russian Constructicon database, where each training sample consists of: |
| | - **Query:** A Russian text example with the instruction prefix |
| | - **Pattern:** A corresponding Constructicon pattern |
| |
|
| | ### Training Objective |
| |
|
| | The model was fine-tuned using **CachedMultipleNegativesSymmetricRankingLoss** to learn embeddings where: |
| | - Examples containing a construction are similar to that construction's pattern |
| | - The embedding space preserves semantic relationships between related constructions |
| |
|
| | ### Training Hyperparameters |
| |
|
| | - **Learning rate:** 2e-05 |
| | - **Batch size:** 1024 |
| | - **Training epochs:** 10 (best model from epoch 5) |
| | - **Warmup ratio:** 0.1 |
| | - **Weight decay:** 0.01 |
| | - **Loss function:** CachedMultipleNegativesSymmetricRankingLoss |
| |
|
| | ### Model Architecture |
| |
|
| | ``` |
| | SentenceTransformer( |
| | (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel |
| | (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) |
| | (2): Normalize() |
| | ) |
| | ``` |
| |
|
| | ## Performance |
| |
|
| | The model achieved its best validation performance at epoch 5 with a validation loss of **0.1145**. |
| |
|
| | ## Framework Versions |
| |
|
| | - Python: 3.10.12 |
| | - Sentence Transformers: 4.1.0 |
| | - Transformers: 4.51.3 |
| | - PyTorch: 2.7.0+cu126 |
| |
|