File size: 5,978 Bytes
eef8f3b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
---
library_name: dormouse
tags:
  - ukrainian
  - nlp
  - tokenization
  - text-optimization
  - seq2seq
  - translation
  - ua-en
language:
  - uk
  - en
license: mit
pipeline_tag: translation
datasets:
  - Dariachup/dormouse-corpus
---

# dormouse — Ukrainian Text Optimizer for LLMs

**Seq2seq expression translator (UA→EN)** trained on 28,149 parallel pairs for token-efficient Ukrainian text compression.

This repository contains model weights and lexicon data for the [dormouse-ua](https://pypi.org/project/dormouse-ua/) Python library.

## What this model does

Translates Ukrainian multi-word expressions into compact English equivalents for LLM consumption:

```
"немає резюме"          → "no summary given"
"запустити програму"    → "execute the program"
"повна синхронізація"   → "full synchronization"
"горить дедлайн"        → "deadline approaching"
"зберегти закладки"     → "save bookmarks"
```

This is **not** a general-purpose translator. It's a specialized compression model that maps Ukrainian expressions (2-4 words) to minimal English while preserving meaning for LLM understanding.

## Model Details

| Parameter | Value |
|-----------|-------|
| Architecture | GRU Encoder-Decoder with Attention |
| Parameters | **7.3M** |
| Encoder | Bidirectional GRU, hidden=256, embed=128 |
| Decoder | GRU with Bahdanau attention |
| Source vocab | 15,679 tokens (Ukrainian) |
| Target vocab | 9,608 tokens (English) |
| Dropout | 0.0 (inference) |
| Training pairs | 28,149 |
| Validation set | 500 pairs |
| Framework | PyTorch |

## Performance

| Metric | Value |
|--------|-------|
| Exact match (val) | **98.2%** |
| Word overlap (val) | **99.33%** |
| Token savings (full pipeline) | **73%** |
| GPT quality preservation | **150%** (squeezed > original) |

Evaluated on 53,351 texts (Telegram corpus + Ukrainian literature). Full pipeline with lexicon + seq2seq achieves 73% token reduction while GPT-4 understands squeezed text **better** than original Ukrainian (100% vs 67% accuracy on IT prompts).

## Training

**Data sources:**
- OPUS parallel corpus (UA-EN): 38K cleaned entries from KDE/GNOME/documentation
- Auto-generated expression pairs via LLM: 7.7K entries
- Telegram slang/surzhyk: 802 entries
- Manual UA→EN mappings: 208 entries

**Training configuration:**
- Optimizer: Adam
- Loss: CrossEntropyLoss (ignore padding)
- Label smoothing: applied during training
- Anti-overfitting: dropout in encoder/decoder during training, smaller model size
- Hardware: HuggingFace Spaces (free tier CPU)

**Data pipeline:**
```
Telegram corpus → crack_open (normalize) → generate pairs (LLM) → train seq2seq
```

## Files

| File | Size | Description |
|------|------|-------------|
| `expr_seq2seq.pt` | 28MB | Model weights (PyTorch state_dict) |
| `expr_vocab_src.json` | 396KB | Source vocabulary (Ukrainian, 15.6K tokens) |
| `expr_vocab_tgt.json` | 164KB | Target vocabulary (English, 9.6K tokens) |
| `expr_config.json` | 108B | Model hyperparameters |
| `lexicon.db` | 12MB | SQLite lexicon (47K UA→EN word mappings) |

## Usage

### Via pip (recommended)

```bash
pip install dormouse-ua
```

```python
from dormouse import squeeze

# Full pipeline: normalize → compress → translate (uses this model)
squeeze("блін продакшн впав після деплою", target="cloud")
# → "damn production crashed after deploy"
# Tokens: 45 → 12 (-73%)
```

Assets download automatically on first use to `~/.cache/dormouse/v0.3.0/`.

### Direct model usage

```python
import torch
from dormouse.seq2seq import wake_up_expr

model, src_vocab, tgt_vocab = wake_up_expr()

text = "запустити програму"
src_ids = torch.tensor(src_vocab.encode(text))
result = model.translate(src_ids, tgt_vocab)
print(result)  # "execute the program"
```

## Use Cases

1. **LLM token optimization** — Ukrainian Cyrillic costs 3-4x more tokens than English. This model is part of a pipeline that saves 73% tokens.

2. **Chatbot preprocessing** — Normalize surzhyk/slang before sending to GPT/Claude. Response quality improves from 67% to 100%.

3. **Cost reduction** — 10K Ukrainian prompts/day through GPT → 60-73% savings on input token costs.

4. **AI agents** — Compress Ukrainian context for longer agent memory. 73% compression = 73% more context window.

5. **Local search & classification** — The lexicon.db enables offline Ukrainian text indexing, semantic search, and topic classification without any API calls.

## Full Pipeline

```mermaid
graph LR
    A[UA text] --> B[crack_open<br/>360 rules + pymorphy3]
    B --> C[compress<br/>remove fillers]
    C --> D[seq2seq<br/>this model]
    C --> E[lexicon.db<br/>word-by-word]
    D --> F[EN compressed]
    E --> F

    style A fill:#fdd,stroke:#c33
    style F fill:#dfd,stroke:#3a3
    style D fill:#def,stroke:#38a
```

## Comparison

| Approach | Ukrainian support | Token savings | Quality impact |
|----------|:-----------------:|:------------:|:--------------:|
| **dormouse (this model)** | native | **73%** | **+50%** |
| LLMLingua | no | up to 20x | -5-15% |
| Selective Context | no | 40-50% | -10-20% |
| Google Translate | partial | 30-40% | variable |

[Research paper on Ukrainian tokenization inefficiency (Frontiers in AI, 2025)](https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2025.1538165/full)

## Links

- **PyPI:** [dormouse-ua](https://pypi.org/project/dormouse-ua/)
- **GitHub:** [ChuprinaDaria/dormouse](https://github.com/ChuprinaDaria/dormouse)
- **Author:** [Daria Chuprina](https://www.linkedin.com/in/dchuprina/) | [Lazysoft](https://lazysoft.pl/) | dchuprina@lazysoft.pl

## License

MIT

## Citation

```bibtex
@software{dormouse2026,
  author = {Chuprina, Daria},
  title = {dormouse: Ukrainian Text Optimizer for LLMs},
  year = {2026},
  url = {https://github.com/ChuprinaDaria/dormouse},
}
```