README.md

---
language: ru
license: apache-2.0
tags:
- russian
- discourse-markers
- pragmatics
- classification
- rubert
- linguistics
base_model: DeepPavlov/rubert-base-cased
library_name: transformers
pipeline_tag: text-classification
model-index:
- name: DiMa_new
results:
- task:
type: text-classification
name: Binary classification (discourse marker usage)
dataset:
name: Internal spreadsheet (Russian DM candidates)
type: private
split: stratified hold-out (15%)
metrics:
- type: accuracy
value: 0.9859
- type: precision
value: 0.9906
- type: recall
value: 0.9893
- type: f1
value: 0.9900
- type: eval_loss
value: 0.1074
---

# DiMa_new — Russian Discourse Marker Candidate Classifier (with `<cand>…</cand>` markers)

**Task.** Given a Russian sentence and a *known candidate span*, decide whether the candidate functions as a **discourse marker** (DM) in that context (binary: `dm` vs `not_dm`).

**Key design.** The candidate span is explicitly wrapped with special tokens `<cand>` and `</cand>`. The model is fine-tuned for **sequence classification** (2 labels). This design focuses the encoder’s attention on the pragmatic contribution of the marked span, avoiding BIO/NER complexity.

---

Optional: detect candidates with the built-in gazetteer

This repository ships a simple gazetteer (assets/gazetteer.json and .txt) built from the training spreadsheet to help detect candidate spans in raw text (longest-first, non-overlapping greedy).

| Metric | Value |
| --------- | ------ |
| Accuracy | 0.9859 |
| Precision | 0.9906 |
| Recall | 0.9893 |
| F1 | 0.9900 |
| Loss | 0.1074 |

# Training details

Base model: DeepPavlov/rubert-base-cased

Task: binary sequence classification (dm vs not_dm)

Special tokens: <cand>, </cand> added to tokenizer

Max length: 256 tokens

Optimizer/schedule: HF TrainingArguments default with warmup ratio 0.1

Epochs: 4

Batch size: 16

Learning rate (init): 2e-5

Weight decay: 0.01

Loss: CrossEntropy with class weights (to mitigate moderate class imbalance)

Hardware: not specified

Reproducibility: fixed seed

# Intended use & limitations

Intended use: research and demos in pragmatics/discourse for Russian; downstream tasks like discourse parsing, summarization, or language teaching where DM usage matters.

Non-goals: general NER; token-level BIO tagging.

Limitations:

The gazetteer is not exhaustive; it’s a convenience for demos.
The model decides on usage given a known candidate; it does not discover novel DM forms by itself.
Validation is internal; external generalization is unmeasured here.

# Citation

If you use this model, please cite:

@misc {DiMa_new,
title = {DiMa\_new: Russian Discourse Marker Candidate Classifier},
author = {Maria Ols},
year = {2025},
note = {Hugging Face model: MariaOls/DiMa_new}
}

Files changed (1) hide show

README.md +15 -0

README.md ADDED Viewed

	@@ -0,0 +1,15 @@

+---
+license: apache-2.0
+language:
+- ru
+base_model:
+- DeepPavlov/rubert-base-cased
+pipeline_tag: text-classification
+tags:
+- russian
+- discourse-markers
+- pragmatics
+- classification
+- rubert
+- linguistics
+---