ParlLangID-UA-RU

Model description

ParlLangID-UA-RU is a token-level language identification model designed for Ukrainian–Russian code-switched parliamentary texts. The model assigns a language label to each token in mixed-language sentences.

The model is based on the pretrained multilingual transformer BERT and fine-tuned for token classification on manually annotated parliamentary proceedings containing Ukrainian–Russian code-switching.

This model is intended to support research on:

  • multilingual language processing
  • code-switching detection
  • linguistic annotation of political discourse

Intended use

The model can be used for:

  • token-level language identification
  • annotation of Ukrainian–Russian code-switched texts
  • preprocessing for NLP pipelines
  • linguistic research on parliamentary discourse
  • analysis of multilingual communication in political texts

Typical applications include:

  • corpus annotation
  • linguistic research
  • code-switching detection
  • preprocessing for machine translation or text analysis systems

Training data

The model was trained on a manually annotated dataset of Ukrainian–Russian parliamentary texts containing code-switching.

Each token in the dataset is labeled with its language.

Training data is publicly available at:

https://zenodo.org/records/14724542

Labels used in the dataset

Label Language
uk Ukrainian
ru Russian

Model architecture

  • Base model: BERT
  • Variant: BERT Multilingual Cased
  • Task: Token classification
  • Architecture: Transformer encoder with classification head

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("USERNAME/ParlLangID-UA-RU")
model = AutoModelForTokenClassification.from_pretrained("USERNAME/ParlLangID-UA-RU")

text = "Шановні колеги сегодня мы обсуждаем законопроект"

inputs = tokenizer(text, return_tensors="pt")

outputs = model(**inputs)

predictions = torch.argmax(outputs.logits, dim=2)

tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = predictions[0].tolist()

for token, label in zip(tokens, labels):
    print(token, label)
Downloads last month
1
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support