ParlLangID-UA-RU

Model description

ParlLangID-UA-RU is a token-level language identification model designed for Ukrainian–Russian code-switched parliamentary texts. The model assigns a language label to each token in mixed-language sentences.

The model is based on the pretrained multilingual transformer BERT and fine-tuned for token classification on manually annotated parliamentary proceedings containing Ukrainian–Russian code-switching.

This model is intended to support research on:

multilingual language processing
code-switching detection
linguistic annotation of political discourse

Intended use

The model can be used for:

token-level language identification
annotation of Ukrainian–Russian code-switched texts
preprocessing for NLP pipelines
linguistic research on parliamentary discourse
analysis of multilingual communication in political texts

Typical applications include:

corpus annotation
linguistic research
code-switching detection
preprocessing for machine translation or text analysis systems

Training data

The model was trained on a manually annotated dataset of Ukrainian–Russian parliamentary texts containing code-switching.

Each token in the dataset is labeled with its language.

Training data is publicly available at:

https://zenodo.org/records/14724542

Labels used in the dataset

Label	Language
uk	Ukrainian
ru	Russian

Model architecture

Base model: BERT
Variant: BERT Multilingual Cased
Task: Token classification
Architecture: Transformer encoder with classification head

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("USERNAME/ParlLangID-UA-RU")
model = AutoModelForTokenClassification.from_pretrained("USERNAME/ParlLangID-UA-RU")

text = "Шановні колеги сегодня мы обсуждаем законопроект"

inputs = tokenizer(text, return_tensors="pt")

outputs = model(**inputs)

predictions = torch.argmax(outputs.logits, dim=2)

tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = predictions[0].tolist()

for token, label in zip(tokens, labels):
    print(token, label)


## Interactive Demo

You can explore the model using an interactive visualization tool:

https://huggingface.co/spaces/OKanishcheva/parl-langid-demo

The demo highlights Ukrainian and Russian tokens in mixed-language text and provides language distribution statistics.

Downloads last month: 20

Safetensors

Model size

0.2B params

Tensor type

F32

OKanishcheva
/

ParlLangID-UA-RU