Instructions to use kiselyovd/grnti-text-classifier with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use kiselyovd/grnti-text-classifier with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="kiselyovd/grnti-text-classifier")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("kiselyovd/grnti-text-classifier") model = AutoModelForSequenceClassification.from_pretrained("kiselyovd/grnti-text-classifier") - Notebooks
- Google Colab
- Kaggle
kiselyovd/grnti-text-classifier
Production-grade Russian scientific-text classifier over the 28 top-level GRNTI sections (State Rubricator of Scientific and Technical Information). The main model is XLM-RoBERTa-base fine-tuned on Russian scientific abstracts; a single-language ruBERT-base-cased baseline is reported alongside it for comparison. The id2label map embeds the real GRNTI codes and Russian section names, so the inference widget returns human-readable predictions such as 290000: Физика.
The model takes a Russian abstract or title and returns a probability distribution over the 28 sections. It is intended for cataloguing and triage of Russian scientific text, not for high-stakes or non-Russian inputs.
Metrics
Test split: n = 2772 abstracts, 28 GRNTI sections, balanced.
| Model | Top-1 accuracy | Top-5 accuracy | Macro F1 | Weighted F1 |
|---|---|---|---|---|
| FacebookAI/xlm-roberta-base (main) | 72.4% | 96.8% | 72.3% | 72.3% |
| DeepPavlov/rubert-base-cased (baseline) | 72.9% | 95.9% | 72.8% | 72.8% |
The baseline is marginally sharper on the top-1 argmax (+0.5pp), while the multilingual XLM-RoBERTa wins the top-5 rerank (+0.9pp) - its broader pre-training spreads probability mass more usefully across related sections. The per-class metrics below are computed by re-running the published main model on the held-out test split; the resulting macro-F1 (0.723) matches the reported value exactly.
Visualizations
Per-class top-1 F1 across the 28 GRNTI sections. Computed from real predictions of this model on the test split, sorted descending, with the macro-F1 reference line. Performance is strongest on well-separated domains (sport, food industry, literature) and weakest where section boundaries overlap (mechanical engineering, agriculture).
Confusion matrix (28 x 28). Row-normalised over the test split. The dominant diagonal confirms the classifier is well-calibrated per section; off-diagonal mass concentrates between thematically adjacent sections.
Usage
from transformers import pipeline
clf = pipeline("text-classification", model="kiselyovd/grnti-text-classifier", top_k=5)
clf("Исследование квантовой электродинамики в кристаллах.")
Each returned label is formatted as <GRNTI code>: <section name> (for example 290000: Физика).
The 28 GRNTI sections
The label space is the 28 top-level GRNTI sections present in the dataset, spanning the humanities, social sciences, natural sciences, and engineering: Философия; История. Исторические науки; Социология; Экономика. Экономические науки; Государство и право. Юридические науки; Политика. Политические науки; Культура. Культурология; Народное образование. Педагогика; Психология; Языкознание; Литература. Литературоведение. Устное народное творчество; Искусство; Математика; Физика; Химия; Биология; Геология; Энергетика; Автоматика. Вычислительная техника; Горное дело; Машиностроение; Пищевая промышленность; Строительство. Архитектура; Сельское и лесное хозяйство (codes 680000 and 683500); Транспорт; Медицина и здравоохранение; Физическая культура и спорт.
Intended use and limitations
This model is trained for Russian-language top-level GRNTI section classification. It is not evaluated outside Russian scientific text and should not be used for generic multilingual classification. Outputs are probabilistic and subject to training-data biases; do not rely on this model for high-stakes decisions.
Training
- Dataset:
ai-forever/ru-scibench-grnti-classification(MIT, 28 476 train + 2 772 test). - Base model:
FacebookAI/xlm-roberta-base; baselineDeepPavlov/rubert-base-cased. - Tokenizer:
xlm-roberta-base,max_length=256(median sequence ~120 tokens). - Precision: bf16-mixed on CUDA. Optimizer: AdamW with linear warmup and decay.
- Hyperparameters tuned with an Optuna sweep (val macro-F1), then a final training run with the best parameters (
lr=3.1e-5, weight_decay=0.012, warmup_ratio=0.147, val macro-F1 = 73.1%).
Source and license
- Source code, training pipeline, and documentation: https://github.com/kiselyovd/grnti-text-classifier
- License: MIT.
- Downloads last month
- 29
Model tree for kiselyovd/grnti-text-classifier
Base model
FacebookAI/xlm-roberta-baseDataset used to train kiselyovd/grnti-text-classifier
Evaluation results
- Top-1 accuracy on ru-scibench-grnti-classificationself-reported0.724
- Top-5 accuracy on ru-scibench-grnti-classificationself-reported0.968
- Macro F1 on ru-scibench-grnti-classificationself-reported0.723
- Weighted F1 on ru-scibench-grnti-classificationself-reported0.723

