README.md
Browse files---
language: ru
license: apache-2.0
tags:
- russian
- discourse-markers
- pragmatics
- classification
- rubert
- linguistics
base_model: DeepPavlov/rubert-base-cased
library_name: transformers
pipeline_tag: text-classification
model-index:
- name: DiMa_new
results:
- task:
type: text-classification
name: Binary classification (discourse marker usage)
dataset:
name: Internal spreadsheet (Russian DM candidates)
type: private
split: stratified hold-out (15%)
metrics:
- type: accuracy
value: 0.9859
- type: precision
value: 0.9906
- type: recall
value: 0.9893
- type: f1
value: 0.9900
- type: eval_loss
value: 0.1074
---
# DiMa_new — Russian Discourse Marker Candidate Classifier (with `<cand>…</cand>` markers)
**Task.** Given a Russian sentence and a *known candidate span*, decide whether the candidate functions as a **discourse marker** (DM) in that context (binary: `dm` vs `not_dm`).
**Key design.** The candidate span is explicitly wrapped with special tokens `<cand>` and `</cand>`. The model is fine-tuned for **sequence classification** (2 labels). This design focuses the encoder’s attention on the pragmatic contribution of the marked span, avoiding BIO/NER complexity.
---
Optional: detect candidates with the built-in gazetteer
This repository ships a simple gazetteer (assets/gazetteer.json and .txt) built from the training spreadsheet to help detect candidate spans in raw text (longest-first, non-overlapping greedy).
| Metric | Value |
| --------- | ------ |
| Accuracy | 0.9859 |
| Precision | 0.9906 |
| Recall | 0.9893 |
| F1 | 0.9900 |
| Loss | 0.1074 |
# Training details
Base model: DeepPavlov/rubert-base-cased
Task: binary sequence classification (dm vs not_dm)
Special tokens: <cand>, </cand> added to tokenizer
Max length: 256 tokens
Optimizer/schedule: HF TrainingArguments default with warmup ratio 0.1
Epochs: 4
Batch size: 16
Learning rate (init): 2e-5
Weight decay: 0.01
Loss: CrossEntropy with class weights (to mitigate moderate class imbalance)
Hardware: not specified
Reproducibility: fixed seed
# Intended use & limitations
Intended use: research and demos in pragmatics/discourse for Russian; downstream tasks like discourse parsing, summarization, or language teaching where DM usage matters.
Non-goals: general NER; token-level BIO tagging.
Limitations:
The gazetteer is not exhaustive; it’s a convenience for demos.
The model decides on usage given a known candidate; it does not discover novel DM forms by itself.
Validation is internal; external generalization is unmeasured here.
# Citation
If you use this model, please cite:
@misc
{DiMa_new,
title = {DiMa\_new: Russian Discourse Marker Candidate Classifier},
author = {Maria Ols},
year = {2025},
note = {Hugging Face model: MariaOls/DiMa_new}
}
|
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- ru
|
| 5 |
+
base_model:
|
| 6 |
+
- DeepPavlov/rubert-base-cased
|
| 7 |
+
pipeline_tag: text-classification
|
| 8 |
+
tags:
|
| 9 |
+
- russian
|
| 10 |
+
- discourse-markers
|
| 11 |
+
- pragmatics
|
| 12 |
+
- classification
|
| 13 |
+
- rubert
|
| 14 |
+
- linguistics
|
| 15 |
+
---
|