kiselyovd/grnti-text-classifier

grnti-text-classifier - Russian scientific-text classification across 28 GRNTI sections

Production-grade Russian scientific-text classifier over the 28 top-level GRNTI sections (State Rubricator of Scientific and Technical Information). The main model is XLM-RoBERTa-base fine-tuned on Russian scientific abstracts; a single-language ruBERT-base-cased baseline is reported alongside it for comparison. The id2label map embeds the real GRNTI codes and Russian section names, so the inference widget returns human-readable predictions such as 290000: Физика.

The model takes a Russian abstract or title and returns a probability distribution over the 28 sections. It is intended for cataloguing and triage of Russian scientific text, not for high-stakes or non-Russian inputs.

Metrics

Test split: n = 2772 abstracts, 28 GRNTI sections, balanced.

Model Top-1 accuracy Top-5 accuracy Macro F1 Weighted F1
FacebookAI/xlm-roberta-base (main) 72.4% 96.8% 72.3% 72.3%
DeepPavlov/rubert-base-cased (baseline) 72.9% 95.9% 72.8% 72.8%

The baseline is marginally sharper on the top-1 argmax (+0.5pp), while the multilingual XLM-RoBERTa wins the top-5 rerank (+0.9pp) - its broader pre-training spreads probability mass more usefully across related sections. The per-class metrics below are computed by re-running the published main model on the held-out test split; the resulting macro-F1 (0.723) matches the reported value exactly.

Visualizations

Per-class top-1 F1 across the 28 GRNTI sections. Computed from real predictions of this model on the test split, sorted descending, with the macro-F1 reference line. Performance is strongest on well-separated domains (sport, food industry, literature) and weakest where section boundaries overlap (mechanical engineering, agriculture).

Per-class top-1 F1

Confusion matrix (28 x 28). Row-normalised over the test split. The dominant diagonal confirms the classifier is well-calibrated per section; off-diagonal mass concentrates between thematically adjacent sections.

Confusion matrix

Usage

from transformers import pipeline

clf = pipeline("text-classification", model="kiselyovd/grnti-text-classifier", top_k=5)
clf("Исследование квантовой электродинамики в кристаллах.")

Each returned label is formatted as <GRNTI code>: <section name> (for example 290000: Физика).

The 28 GRNTI sections

The label space is the 28 top-level GRNTI sections present in the dataset, spanning the humanities, social sciences, natural sciences, and engineering: Философия; История. Исторические науки; Социология; Экономика. Экономические науки; Государство и право. Юридические науки; Политика. Политические науки; Культура. Культурология; Народное образование. Педагогика; Психология; Языкознание; Литература. Литературоведение. Устное народное творчество; Искусство; Математика; Физика; Химия; Биология; Геология; Энергетика; Автоматика. Вычислительная техника; Горное дело; Машиностроение; Пищевая промышленность; Строительство. Архитектура; Сельское и лесное хозяйство (codes 680000 and 683500); Транспорт; Медицина и здравоохранение; Физическая культура и спорт.

Intended use and limitations

This model is trained for Russian-language top-level GRNTI section classification. It is not evaluated outside Russian scientific text and should not be used for generic multilingual classification. Outputs are probabilistic and subject to training-data biases; do not rely on this model for high-stakes decisions.

Training

  • Dataset: ai-forever/ru-scibench-grnti-classification (MIT, 28 476 train + 2 772 test).
  • Base model: FacebookAI/xlm-roberta-base; baseline DeepPavlov/rubert-base-cased.
  • Tokenizer: xlm-roberta-base, max_length=256 (median sequence ~120 tokens).
  • Precision: bf16-mixed on CUDA. Optimizer: AdamW with linear warmup and decay.
  • Hyperparameters tuned with an Optuna sweep (val macro-F1), then a final training run with the best parameters (lr=3.1e-5, weight_decay=0.012, warmup_ratio=0.147, val macro-F1 = 73.1%).

Source and license

Downloads last month
29
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kiselyovd/grnti-text-classifier

Finetuned
(4079)
this model

Dataset used to train kiselyovd/grnti-text-classifier

Evaluation results

  • Top-1 accuracy on ru-scibench-grnti-classification
    self-reported
    0.724
  • Top-5 accuracy on ru-scibench-grnti-classification
    self-reported
    0.968
  • Macro F1 on ru-scibench-grnti-classification
    self-reported
    0.723
  • Weighted F1 on ru-scibench-grnti-classification
    self-reported
    0.723