File size: 5,978 Bytes
eef8f3b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 | ---
library_name: dormouse
tags:
- ukrainian
- nlp
- tokenization
- text-optimization
- seq2seq
- translation
- ua-en
language:
- uk
- en
license: mit
pipeline_tag: translation
datasets:
- Dariachup/dormouse-corpus
---
# dormouse — Ukrainian Text Optimizer for LLMs
**Seq2seq expression translator (UA→EN)** trained on 28,149 parallel pairs for token-efficient Ukrainian text compression.
This repository contains model weights and lexicon data for the [dormouse-ua](https://pypi.org/project/dormouse-ua/) Python library.
## What this model does
Translates Ukrainian multi-word expressions into compact English equivalents for LLM consumption:
```
"немає резюме" → "no summary given"
"запустити програму" → "execute the program"
"повна синхронізація" → "full synchronization"
"горить дедлайн" → "deadline approaching"
"зберегти закладки" → "save bookmarks"
```
This is **not** a general-purpose translator. It's a specialized compression model that maps Ukrainian expressions (2-4 words) to minimal English while preserving meaning for LLM understanding.
## Model Details
| Parameter | Value |
|-----------|-------|
| Architecture | GRU Encoder-Decoder with Attention |
| Parameters | **7.3M** |
| Encoder | Bidirectional GRU, hidden=256, embed=128 |
| Decoder | GRU with Bahdanau attention |
| Source vocab | 15,679 tokens (Ukrainian) |
| Target vocab | 9,608 tokens (English) |
| Dropout | 0.0 (inference) |
| Training pairs | 28,149 |
| Validation set | 500 pairs |
| Framework | PyTorch |
## Performance
| Metric | Value |
|--------|-------|
| Exact match (val) | **98.2%** |
| Word overlap (val) | **99.33%** |
| Token savings (full pipeline) | **73%** |
| GPT quality preservation | **150%** (squeezed > original) |
Evaluated on 53,351 texts (Telegram corpus + Ukrainian literature). Full pipeline with lexicon + seq2seq achieves 73% token reduction while GPT-4 understands squeezed text **better** than original Ukrainian (100% vs 67% accuracy on IT prompts).
## Training
**Data sources:**
- OPUS parallel corpus (UA-EN): 38K cleaned entries from KDE/GNOME/documentation
- Auto-generated expression pairs via LLM: 7.7K entries
- Telegram slang/surzhyk: 802 entries
- Manual UA→EN mappings: 208 entries
**Training configuration:**
- Optimizer: Adam
- Loss: CrossEntropyLoss (ignore padding)
- Label smoothing: applied during training
- Anti-overfitting: dropout in encoder/decoder during training, smaller model size
- Hardware: HuggingFace Spaces (free tier CPU)
**Data pipeline:**
```
Telegram corpus → crack_open (normalize) → generate pairs (LLM) → train seq2seq
```
## Files
| File | Size | Description |
|------|------|-------------|
| `expr_seq2seq.pt` | 28MB | Model weights (PyTorch state_dict) |
| `expr_vocab_src.json` | 396KB | Source vocabulary (Ukrainian, 15.6K tokens) |
| `expr_vocab_tgt.json` | 164KB | Target vocabulary (English, 9.6K tokens) |
| `expr_config.json` | 108B | Model hyperparameters |
| `lexicon.db` | 12MB | SQLite lexicon (47K UA→EN word mappings) |
## Usage
### Via pip (recommended)
```bash
pip install dormouse-ua
```
```python
from dormouse import squeeze
# Full pipeline: normalize → compress → translate (uses this model)
squeeze("блін продакшн впав після деплою", target="cloud")
# → "damn production crashed after deploy"
# Tokens: 45 → 12 (-73%)
```
Assets download automatically on first use to `~/.cache/dormouse/v0.3.0/`.
### Direct model usage
```python
import torch
from dormouse.seq2seq import wake_up_expr
model, src_vocab, tgt_vocab = wake_up_expr()
text = "запустити програму"
src_ids = torch.tensor(src_vocab.encode(text))
result = model.translate(src_ids, tgt_vocab)
print(result) # "execute the program"
```
## Use Cases
1. **LLM token optimization** — Ukrainian Cyrillic costs 3-4x more tokens than English. This model is part of a pipeline that saves 73% tokens.
2. **Chatbot preprocessing** — Normalize surzhyk/slang before sending to GPT/Claude. Response quality improves from 67% to 100%.
3. **Cost reduction** — 10K Ukrainian prompts/day through GPT → 60-73% savings on input token costs.
4. **AI agents** — Compress Ukrainian context for longer agent memory. 73% compression = 73% more context window.
5. **Local search & classification** — The lexicon.db enables offline Ukrainian text indexing, semantic search, and topic classification without any API calls.
## Full Pipeline
```mermaid
graph LR
A[UA text] --> B[crack_open<br/>360 rules + pymorphy3]
B --> C[compress<br/>remove fillers]
C --> D[seq2seq<br/>this model]
C --> E[lexicon.db<br/>word-by-word]
D --> F[EN compressed]
E --> F
style A fill:#fdd,stroke:#c33
style F fill:#dfd,stroke:#3a3
style D fill:#def,stroke:#38a
```
## Comparison
| Approach | Ukrainian support | Token savings | Quality impact |
|----------|:-----------------:|:------------:|:--------------:|
| **dormouse (this model)** | native | **73%** | **+50%** |
| LLMLingua | no | up to 20x | -5-15% |
| Selective Context | no | 40-50% | -10-20% |
| Google Translate | partial | 30-40% | variable |
[Research paper on Ukrainian tokenization inefficiency (Frontiers in AI, 2025)](https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2025.1538165/full)
## Links
- **PyPI:** [dormouse-ua](https://pypi.org/project/dormouse-ua/)
- **GitHub:** [ChuprinaDaria/dormouse](https://github.com/ChuprinaDaria/dormouse)
- **Author:** [Daria Chuprina](https://www.linkedin.com/in/dchuprina/) | [Lazysoft](https://lazysoft.pl/) | dchuprina@lazysoft.pl
## License
MIT
## Citation
```bibtex
@software{dormouse2026,
author = {Chuprina, Daria},
title = {dormouse: Ukrainian Text Optimizer for LLMs},
year = {2026},
url = {https://github.com/ChuprinaDaria/dormouse},
}
```
|