dormouse — Ukrainian Text Optimizer for LLMs
Seq2seq expression translator (UA→EN) trained on 28,149 parallel pairs for token-efficient Ukrainian text compression.
This repository contains model weights and lexicon data for the dormouse-ua Python library.
What this model does
Translates Ukrainian multi-word expressions into compact English equivalents for LLM consumption:
"немає резюме" → "no summary given"
"запустити програму" → "execute the program"
"повна синхронізація" → "full synchronization"
"горить дедлайн" → "deadline approaching"
"зберегти закладки" → "save bookmarks"
This is not a general-purpose translator. It's a specialized compression model that maps Ukrainian expressions (2-4 words) to minimal English while preserving meaning for LLM understanding.
Model Details
| Parameter | Value |
|---|---|
| Architecture | GRU Encoder-Decoder with Attention |
| Parameters | 7.3M |
| Encoder | Bidirectional GRU, hidden=256, embed=128 |
| Decoder | GRU with Bahdanau attention |
| Source vocab | 15,679 tokens (Ukrainian) |
| Target vocab | 9,608 tokens (English) |
| Dropout | 0.0 (inference) |
| Training pairs | 28,149 |
| Validation set | 500 pairs |
| Framework | PyTorch |
Performance
| Metric | Value |
|---|---|
| Exact match (val) | 98.2% |
| Word overlap (val) | 99.33% |
| Token savings (full pipeline) | 73% |
| GPT quality preservation | 150% (squeezed > original) |
Evaluated on 53,351 texts (Telegram corpus + Ukrainian literature). Full pipeline with lexicon + seq2seq achieves 73% token reduction while GPT-4 understands squeezed text better than original Ukrainian (100% vs 67% accuracy on IT prompts).
Training
Data sources:
- OPUS parallel corpus (UA-EN): 38K cleaned entries from KDE/GNOME/documentation
- Auto-generated expression pairs via LLM: 7.7K entries
- Telegram slang/surzhyk: 802 entries
- Manual UA→EN mappings: 208 entries
Training configuration:
- Optimizer: Adam
- Loss: CrossEntropyLoss (ignore padding)
- Label smoothing: applied during training
- Anti-overfitting: dropout in encoder/decoder during training, smaller model size
- Hardware: HuggingFace Spaces (free tier CPU)
Data pipeline:
Telegram corpus → crack_open (normalize) → generate pairs (LLM) → train seq2seq
Files
| File | Size | Description |
|---|---|---|
expr_seq2seq.pt |
28MB | Model weights (PyTorch state_dict) |
expr_vocab_src.json |
396KB | Source vocabulary (Ukrainian, 15.6K tokens) |
expr_vocab_tgt.json |
164KB | Target vocabulary (English, 9.6K tokens) |
expr_config.json |
108B | Model hyperparameters |
lexicon.db |
12MB | SQLite lexicon (47K UA→EN word mappings) |
Usage
Via pip (recommended)
pip install dormouse-ua
from dormouse import squeeze
# Full pipeline: normalize → compress → translate (uses this model)
squeeze("блін продакшн впав після деплою", target="cloud")
# → "damn production crashed after deploy"
# Tokens: 45 → 12 (-73%)
Assets download automatically on first use to ~/.cache/dormouse/v0.3.0/.
Direct model usage
import torch
from dormouse.seq2seq import wake_up_expr
model, src_vocab, tgt_vocab = wake_up_expr()
text = "запустити програму"
src_ids = torch.tensor(src_vocab.encode(text))
result = model.translate(src_ids, tgt_vocab)
print(result) # "execute the program"
Use Cases
LLM token optimization — Ukrainian Cyrillic costs 3-4x more tokens than English. This model is part of a pipeline that saves 73% tokens.
Chatbot preprocessing — Normalize surzhyk/slang before sending to GPT/Claude. Response quality improves from 67% to 100%.
Cost reduction — 10K Ukrainian prompts/day through GPT → 60-73% savings on input token costs.
AI agents — Compress Ukrainian context for longer agent memory. 73% compression = 73% more context window.
Local search & classification — The lexicon.db enables offline Ukrainian text indexing, semantic search, and topic classification without any API calls.
Full Pipeline
graph LR
A[UA text] --> B[crack_open<br/>360 rules + pymorphy3]
B --> C[compress<br/>remove fillers]
C --> D[seq2seq<br/>this model]
C --> E[lexicon.db<br/>word-by-word]
D --> F[EN compressed]
E --> F
style A fill:#fdd,stroke:#c33
style F fill:#dfd,stroke:#3a3
style D fill:#def,stroke:#38a
Comparison
| Approach | Ukrainian support | Token savings | Quality impact |
|---|---|---|---|
| dormouse (this model) | native | 73% | +50% |
| LLMLingua | no | up to 20x | -5-15% |
| Selective Context | no | 40-50% | -10-20% |
| Google Translate | partial | 30-40% | variable |
Research paper on Ukrainian tokenization inefficiency (Frontiers in AI, 2025)
Links
- PyPI: dormouse-ua
- GitHub: ChuprinaDaria/dormouse
- Author: Daria Chuprina | Lazysoft | dchuprina@lazysoft.pl
License
MIT
Citation
@software{dormouse2026,
author = {Chuprina, Daria},
title = {dormouse: Ukrainian Text Optimizer for LLMs},
year = {2026},
url = {https://github.com/ChuprinaDaria/dormouse},
}