ruscxn-classifier / README.md
Futyn-Maker's picture
Update Model Card
d6e6daa verified
---
tags:
- transformers
- text-classification
- russian
- constructicon
- nlp
- linguistics
base_model: intfloat/multilingual-e5-large
language:
- ru
pipeline_tag: text-classification
widget:
- text: "passage: NP-Nom так и VP-Pfv[Sep]query: Петр так и замер."
example_title: "Positive example"
- text: "passage: NP-Nom так и VP-Pfv[Sep]query: Мы хорошо поработали."
example_title: "Negative example"
- text: "passage: мягко говоря, Cl[Sep]query: Мягко говоря, это была ошибка."
example_title: "Positive example"
---
# Russian Constructicon Classifier
A binary classification model for determining whether a Russian Constructicon pattern is present in a given text example. Fine-tuned from [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) in two stages: first as a semantic model on Russian Constructicon data, then for binary classification.
## Model Details
- **Base model:** intfloat/multilingual-e5-large
- **Task:** Binary text classification
- **Language:** Russian
- **Training:** Two-stage fine-tuning on Russian Constructicon data
## Usage
### Primary Usage (RusCxnPipe Library)
This model is designed for use with the [RusCxnPipe](https://github.com/Futyn-Maker/ruscxnpipe) library:
```python
from ruscxnpipe import ConstructionClassifier
classifier = ConstructionClassifier(
model_name="Futyn-Maker/ruscxn-classifier"
)
# Classify candidates (output from semantic search)
queries = ["Петр так и замер."]
candidates = [[{"id": "pattern1", "pattern": "NP-Nom так и VP-Pfv"}]]
results = classifier.classify_candidates(queries, candidates)
print(results[0][0]['is_present']) # 1 if present, 0 if absent
```
### Direct Usage
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model = AutoModelForSequenceClassification.from_pretrained("Futyn-Maker/ruscxn-classifier")
tokenizer = AutoTokenizer.from_pretrained("Futyn-Maker/ruscxn-classifier")
# Format: "passage: [pattern][Sep]query: [example]"
text = "passage: NP-Nom так и VP-Pfv[Sep]query: Петр так и замер."
inputs = tokenizer(text, return_tensors="pt", truncation=True)
with torch.no_grad():
outputs = model(**inputs)
prediction = torch.softmax(outputs.logits, dim=-1)
is_present = torch.argmax(prediction, dim=-1).item()
print(f"Construction present: {is_present}") # 1 = present, 0 = absent
```
## Input Format
The model expects input in the format: `"passage: [pattern][Sep]query: [example]"`
- **query:** The Russian text to analyze
- **passage:** The constructicon pattern to check for
## Training
1. **Stage 1:** Semantic embedding training on Russian Constructicon examples and patterns
2. **Stage 2:** Binary classification fine-tuning to predict construction presence
## Output
- **Label 0:** Construction is NOT present in the text
- **Label 1:** Construction IS present in the text
## Framework Versions
- Transformers: 4.51.3
- PyTorch: 2.7.0+cu126
- Python: 3.10.12
```