Dariachup commited on
Commit
eef8f3b
·
verified ·
1 Parent(s): c2e75f8

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +187 -0
README.md ADDED
@@ -0,0 +1,187 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: dormouse
3
+ tags:
4
+ - ukrainian
5
+ - nlp
6
+ - tokenization
7
+ - text-optimization
8
+ - seq2seq
9
+ - translation
10
+ - ua-en
11
+ language:
12
+ - uk
13
+ - en
14
+ license: mit
15
+ pipeline_tag: translation
16
+ datasets:
17
+ - Dariachup/dormouse-corpus
18
+ ---
19
+
20
+ # dormouse — Ukrainian Text Optimizer for LLMs
21
+
22
+ **Seq2seq expression translator (UA→EN)** trained on 28,149 parallel pairs for token-efficient Ukrainian text compression.
23
+
24
+ This repository contains model weights and lexicon data for the [dormouse-ua](https://pypi.org/project/dormouse-ua/) Python library.
25
+
26
+ ## What this model does
27
+
28
+ Translates Ukrainian multi-word expressions into compact English equivalents for LLM consumption:
29
+
30
+ ```
31
+ "немає резюме" → "no summary given"
32
+ "запустити програму" → "execute the program"
33
+ "повна синхронізація" → "full synchronization"
34
+ "горить дедлайн" → "deadline approaching"
35
+ "зберегти закладки" → "save bookmarks"
36
+ ```
37
+
38
+ This is **not** a general-purpose translator. It's a specialized compression model that maps Ukrainian expressions (2-4 words) to minimal English while preserving meaning for LLM understanding.
39
+
40
+ ## Model Details
41
+
42
+ | Parameter | Value |
43
+ |-----------|-------|
44
+ | Architecture | GRU Encoder-Decoder with Attention |
45
+ | Parameters | **7.3M** |
46
+ | Encoder | Bidirectional GRU, hidden=256, embed=128 |
47
+ | Decoder | GRU with Bahdanau attention |
48
+ | Source vocab | 15,679 tokens (Ukrainian) |
49
+ | Target vocab | 9,608 tokens (English) |
50
+ | Dropout | 0.0 (inference) |
51
+ | Training pairs | 28,149 |
52
+ | Validation set | 500 pairs |
53
+ | Framework | PyTorch |
54
+
55
+ ## Performance
56
+
57
+ | Metric | Value |
58
+ |--------|-------|
59
+ | Exact match (val) | **98.2%** |
60
+ | Word overlap (val) | **99.33%** |
61
+ | Token savings (full pipeline) | **73%** |
62
+ | GPT quality preservation | **150%** (squeezed > original) |
63
+
64
+ Evaluated on 53,351 texts (Telegram corpus + Ukrainian literature). Full pipeline with lexicon + seq2seq achieves 73% token reduction while GPT-4 understands squeezed text **better** than original Ukrainian (100% vs 67% accuracy on IT prompts).
65
+
66
+ ## Training
67
+
68
+ **Data sources:**
69
+ - OPUS parallel corpus (UA-EN): 38K cleaned entries from KDE/GNOME/documentation
70
+ - Auto-generated expression pairs via LLM: 7.7K entries
71
+ - Telegram slang/surzhyk: 802 entries
72
+ - Manual UA→EN mappings: 208 entries
73
+
74
+ **Training configuration:**
75
+ - Optimizer: Adam
76
+ - Loss: CrossEntropyLoss (ignore padding)
77
+ - Label smoothing: applied during training
78
+ - Anti-overfitting: dropout in encoder/decoder during training, smaller model size
79
+ - Hardware: HuggingFace Spaces (free tier CPU)
80
+
81
+ **Data pipeline:**
82
+ ```
83
+ Telegram corpus → crack_open (normalize) → generate pairs (LLM) → train seq2seq
84
+ ```
85
+
86
+ ## Files
87
+
88
+ | File | Size | Description |
89
+ |------|------|-------------|
90
+ | `expr_seq2seq.pt` | 28MB | Model weights (PyTorch state_dict) |
91
+ | `expr_vocab_src.json` | 396KB | Source vocabulary (Ukrainian, 15.6K tokens) |
92
+ | `expr_vocab_tgt.json` | 164KB | Target vocabulary (English, 9.6K tokens) |
93
+ | `expr_config.json` | 108B | Model hyperparameters |
94
+ | `lexicon.db` | 12MB | SQLite lexicon (47K UA→EN word mappings) |
95
+
96
+ ## Usage
97
+
98
+ ### Via pip (recommended)
99
+
100
+ ```bash
101
+ pip install dormouse-ua
102
+ ```
103
+
104
+ ```python
105
+ from dormouse import squeeze
106
+
107
+ # Full pipeline: normalize → compress → translate (uses this model)
108
+ squeeze("блін продакшн впав після деплою", target="cloud")
109
+ # → "damn production crashed after deploy"
110
+ # Tokens: 45 → 12 (-73%)
111
+ ```
112
+
113
+ Assets download automatically on first use to `~/.cache/dormouse/v0.3.0/`.
114
+
115
+ ### Direct model usage
116
+
117
+ ```python
118
+ import torch
119
+ from dormouse.seq2seq import wake_up_expr
120
+
121
+ model, src_vocab, tgt_vocab = wake_up_expr()
122
+
123
+ text = "запустити програму"
124
+ src_ids = torch.tensor(src_vocab.encode(text))
125
+ result = model.translate(src_ids, tgt_vocab)
126
+ print(result) # "execute the program"
127
+ ```
128
+
129
+ ## Use Cases
130
+
131
+ 1. **LLM token optimization** — Ukrainian Cyrillic costs 3-4x more tokens than English. This model is part of a pipeline that saves 73% tokens.
132
+
133
+ 2. **Chatbot preprocessing** — Normalize surzhyk/slang before sending to GPT/Claude. Response quality improves from 67% to 100%.
134
+
135
+ 3. **Cost reduction** — 10K Ukrainian prompts/day through GPT → 60-73% savings on input token costs.
136
+
137
+ 4. **AI agents** — Compress Ukrainian context for longer agent memory. 73% compression = 73% more context window.
138
+
139
+ 5. **Local search & classification** — The lexicon.db enables offline Ukrainian text indexing, semantic search, and topic classification without any API calls.
140
+
141
+ ## Full Pipeline
142
+
143
+ ```mermaid
144
+ graph LR
145
+ A[UA text] --> B[crack_open<br/>360 rules + pymorphy3]
146
+ B --> C[compress<br/>remove fillers]
147
+ C --> D[seq2seq<br/>this model]
148
+ C --> E[lexicon.db<br/>word-by-word]
149
+ D --> F[EN compressed]
150
+ E --> F
151
+
152
+ style A fill:#fdd,stroke:#c33
153
+ style F fill:#dfd,stroke:#3a3
154
+ style D fill:#def,stroke:#38a
155
+ ```
156
+
157
+ ## Comparison
158
+
159
+ | Approach | Ukrainian support | Token savings | Quality impact |
160
+ |----------|:-----------------:|:------------:|:--------------:|
161
+ | **dormouse (this model)** | native | **73%** | **+50%** |
162
+ | LLMLingua | no | up to 20x | -5-15% |
163
+ | Selective Context | no | 40-50% | -10-20% |
164
+ | Google Translate | partial | 30-40% | variable |
165
+
166
+ [Research paper on Ukrainian tokenization inefficiency (Frontiers in AI, 2025)](https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2025.1538165/full)
167
+
168
+ ## Links
169
+
170
+ - **PyPI:** [dormouse-ua](https://pypi.org/project/dormouse-ua/)
171
+ - **GitHub:** [ChuprinaDaria/dormouse](https://github.com/ChuprinaDaria/dormouse)
172
+ - **Author:** [Daria Chuprina](https://www.linkedin.com/in/dchuprina/) | [Lazysoft](https://lazysoft.pl/) | dchuprina@lazysoft.pl
173
+
174
+ ## License
175
+
176
+ MIT
177
+
178
+ ## Citation
179
+
180
+ ```bibtex
181
+ @software{dormouse2026,
182
+ author = {Chuprina, Daria},
183
+ title = {dormouse: Ukrainian Text Optimizer for LLMs},
184
+ year = {2026},
185
+ url = {https://github.com/ChuprinaDaria/dormouse},
186
+ }
187
+ ```