CrabInHoney commited on
Commit
b1f06b3
·
verified ·
1 Parent(s): 21a333c

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +92 -3
README.md CHANGED
@@ -1,3 +1,92 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ language: ru
4
+ license: apache-2.0
5
+ library_name: transformers
6
+ tags:
7
+ - russian
8
+ - morpheme-segmentation
9
+ - token-classification
10
+ - morphbert
11
+ - lightweight
12
+ pipeline_tag: token-classification
13
+
14
+ ---
15
+
16
+ # MorphBERT-Tiny: Russian Morpheme Segmentation
17
+
18
+ This repository contains the `CrabInHoney/morphbert-tiny-morpheme-segmentation-ru` model, a highly compact transformer-based system fine-tuned for morpheme segmentation of Russian words. The model classifies each character of a given word into one of four morpheme categories: Prefix (PREF), Root (ROOT), Suffix (SUFF), or Ending (END).
19
+
20
+ ## Model Description
21
+
22
+ `morphbert-tiny-morpheme-segmentation-ru` leverages a lightweight transformer architecture, enabling efficient deployment and inference while maintaining high performance on the specific task of morphological analysis at the character level. Despite its diminutive size, the model demonstrates considerable accuracy in identifying the constituent morphemes within Russian words.
23
+
24
+ The model was derived through logit distillation from a larger teacher model, comparable in complexity to bert-base
25
+
26
+ **Key Features:**
27
+
28
+ * **Task:** Morpheme Segmentation (Token Classification at Character Level)
29
+ * **Language:** Russian (ru)
30
+ * **Architecture:** Transformer (BERT-like, optimized for size)
31
+ * **Labels:** PREF, ROOT, SUFF, END
32
+
33
+ **Model Size & Specifications:**
34
+
35
+ * **Parameters:** ~3.58 Million
36
+ * **Tensor Type:** F32
37
+ * **Disk Footprint:** ~14.3 MB
38
+
39
+ ## Usage
40
+
41
+ The model can be easily used with the Hugging Face `transformers` library. It processes words character by character.
42
+
43
+ ```python
44
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
45
+ import torch
46
+
47
+ model_name = "CrabInHoney/morphbert-tiny-morpheme-segmentation-ru"
48
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
49
+ model = AutoModelForTokenClassification.from_pretrained(model_name)
50
+ model.eval()
51
+
52
+ def analyze(word):
53
+ tokens = list(word)
54
+ encoded = tokenizer(tokens, is_split_into_words=True, return_tensors="pt", truncation=True, max_length=34)
55
+ with torch.no_grad():
56
+ logits = model(**encoded).logits
57
+ predictions = logits.argmax(dim=-1)[0]
58
+
59
+ word_ids = encoded.word_ids()
60
+ output = []
61
+ for i, word_idx in enumerate(word_ids):
62
+ if word_idx is not None and word_idx < len(tokens):
63
+ label_id = predictions[i].item()
64
+ label = model.config.id2label[label_id]
65
+ output.append(f"{tokens[word_idx]}:{label}")
66
+ return " / ".join(output)
67
+
68
+ # Примеры
69
+ for word in ["масляный", "предчувствий", "тарковский", "кот", "подгон"]:
70
+ print(f"{word} → {analyze(word)}")
71
+
72
+ ```
73
+
74
+ ## Example Predictions
75
+
76
+ ```
77
+ масляный → м:ROOT / а:ROOT / с:ROOT / л:ROOT / я:SUFF / н:SUFF / ы:END / й:END
78
+ предчувствий → п:PREF / р:PREF / е:PREF / д:PREF / ч:ROOT / у:ROOT / в:SUFF / с:SUFF / т:SUFF / в:SUFF / и:END / й:END
79
+ тарковский → т:ROOT / а:ROOT / р:ROOT / к:ROOT / о:SUFF / в:SUFF / с:SUFF / к:SUFF / и:END / й:END
80
+ кот → к:ROOT / о:ROOT / т:ROOT
81
+ подгон → п:PREF / о:PREF / д:PREF / г:ROOT / о:ROOT / н:ROOT
82
+ ```
83
+
84
+ ## Performance
85
+
86
+ The model achieves an approximate character-level accuracy of **0.975** on its evaluation dataset.
87
+
88
+ ## Limitations
89
+
90
+ * Performance may vary on out-of-vocabulary words, neologisms, or highly complex morphological structures not sufficiently represented in the training data.
91
+ * The model operates strictly at the character level; it does not incorporate broader lexical or syntactic context.
92
+ * Ambiguous cases in morpheme boundaries might be resolved based on patterns learned during training, which may not always align with linguistic conventions in edge cases.