|
|
--- |
|
|
tags: |
|
|
- transformers |
|
|
- text-classification |
|
|
- russian |
|
|
- constructicon |
|
|
- nlp |
|
|
- linguistics |
|
|
base_model: intfloat/multilingual-e5-large |
|
|
language: |
|
|
- ru |
|
|
pipeline_tag: text-classification |
|
|
widget: |
|
|
- text: "passage: NP-Nom так и VP-Pfv[Sep]query: Петр так и замер." |
|
|
example_title: "Positive example" |
|
|
- text: "passage: NP-Nom так и VP-Pfv[Sep]query: Мы хорошо поработали." |
|
|
example_title: "Negative example" |
|
|
- text: "passage: мягко говоря, Cl[Sep]query: Мягко говоря, это была ошибка." |
|
|
example_title: "Positive example" |
|
|
--- |
|
|
|
|
|
# Russian Constructicon Classifier |
|
|
|
|
|
A binary classification model for determining whether a Russian Constructicon pattern is present in a given text example. Fine-tuned from [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) in two stages: first as a semantic model on Russian Constructicon data, then for binary classification. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Base model:** intfloat/multilingual-e5-large |
|
|
- **Task:** Binary text classification |
|
|
- **Language:** Russian |
|
|
- **Training:** Two-stage fine-tuning on Russian Constructicon data |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Primary Usage (RusCxnPipe Library) |
|
|
|
|
|
This model is designed for use with the [RusCxnPipe](https://github.com/Futyn-Maker/ruscxnpipe) library: |
|
|
|
|
|
```python |
|
|
from ruscxnpipe import ConstructionClassifier |
|
|
|
|
|
classifier = ConstructionClassifier( |
|
|
model_name="Futyn-Maker/ruscxn-classifier" |
|
|
) |
|
|
|
|
|
# Classify candidates (output from semantic search) |
|
|
queries = ["Петр так и замер."] |
|
|
candidates = [[{"id": "pattern1", "pattern": "NP-Nom так и VP-Pfv"}]] |
|
|
|
|
|
results = classifier.classify_candidates(queries, candidates) |
|
|
print(results[0][0]['is_present']) # 1 if present, 0 if absent |
|
|
``` |
|
|
|
|
|
### Direct Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
import torch |
|
|
|
|
|
model = AutoModelForSequenceClassification.from_pretrained("Futyn-Maker/ruscxn-classifier") |
|
|
tokenizer = AutoTokenizer.from_pretrained("Futyn-Maker/ruscxn-classifier") |
|
|
|
|
|
# Format: "passage: [pattern][Sep]query: [example]" |
|
|
text = "passage: NP-Nom так и VP-Pfv[Sep]query: Петр так и замер." |
|
|
inputs = tokenizer(text, return_tensors="pt", truncation=True) |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
prediction = torch.softmax(outputs.logits, dim=-1) |
|
|
is_present = torch.argmax(prediction, dim=-1).item() |
|
|
|
|
|
print(f"Construction present: {is_present}") # 1 = present, 0 = absent |
|
|
``` |
|
|
|
|
|
## Input Format |
|
|
|
|
|
The model expects input in the format: `"passage: [pattern][Sep]query: [example]"` |
|
|
|
|
|
- **query:** The Russian text to analyze |
|
|
- **passage:** The constructicon pattern to check for |
|
|
|
|
|
## Training |
|
|
|
|
|
1. **Stage 1:** Semantic embedding training on Russian Constructicon examples and patterns |
|
|
2. **Stage 2:** Binary classification fine-tuning to predict construction presence |
|
|
|
|
|
## Output |
|
|
|
|
|
- **Label 0:** Construction is NOT present in the text |
|
|
- **Label 1:** Construction IS present in the text |
|
|
|
|
|
## Framework Versions |
|
|
|
|
|
- Transformers: 4.51.3 |
|
|
- PyTorch: 2.7.0+cu126 |
|
|
- Python: 3.10.12 |
|
|
``` |
|
|
|