ParlLangID-UA-RU
Model description
ParlLangID-UA-RU is a token-level language identification model designed for Ukrainian–Russian code-switched parliamentary texts. The model assigns a language label to each token in mixed-language sentences.
The model is based on the pretrained multilingual transformer BERT and fine-tuned for token classification on manually annotated parliamentary proceedings containing Ukrainian–Russian code-switching.
This model is intended to support research on:
- multilingual language processing
- code-switching detection
- linguistic annotation of political discourse
Intended use
The model can be used for:
- token-level language identification
- annotation of Ukrainian–Russian code-switched texts
- preprocessing for NLP pipelines
- linguistic research on parliamentary discourse
- analysis of multilingual communication in political texts
Typical applications include:
- corpus annotation
- linguistic research
- code-switching detection
- preprocessing for machine translation or text analysis systems
Training data
The model was trained on a manually annotated dataset of Ukrainian–Russian parliamentary texts containing code-switching.
Each token in the dataset is labeled with its language.
Training data is publicly available at:
https://zenodo.org/records/14724542
Labels used in the dataset
| Label | Language |
|---|---|
| uk | Ukrainian |
| ru | Russian |
Model architecture
- Base model: BERT
- Variant: BERT Multilingual Cased
- Task: Token classification
- Architecture: Transformer encoder with classification head
Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("USERNAME/ParlLangID-UA-RU")
model = AutoModelForTokenClassification.from_pretrained("USERNAME/ParlLangID-UA-RU")
text = "Шановні колеги сегодня мы обсуждаем законопроект"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=2)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = predictions[0].tolist()
for token, label in zip(tokens, labels):
print(token, label)
- Downloads last month
- 1