| --- |
| library_name: dormouse |
| tags: |
| - ukrainian |
| - nlp |
| - tokenization |
| - text-optimization |
| - seq2seq |
| - translation |
| - ua-en |
| language: |
| - uk |
| - en |
| license: mit |
| pipeline_tag: translation |
| datasets: |
| - Dariachup/dormouse-corpus |
| --- |
| |
| # dormouse — Ukrainian Text Optimizer for LLMs |
|
|
| **Seq2seq expression translator (UA→EN)** trained on 28,149 parallel pairs for token-efficient Ukrainian text compression. |
|
|
| This repository contains model weights and lexicon data for the [dormouse-ua](https://pypi.org/project/dormouse-ua/) Python library. |
|
|
| ## What this model does |
|
|
| Translates Ukrainian multi-word expressions into compact English equivalents for LLM consumption: |
|
|
| ``` |
| "немає резюме" → "no summary given" |
| "запустити програму" → "execute the program" |
| "повна синхронізація" → "full synchronization" |
| "горить дедлайн" → "deadline approaching" |
| "зберегти закладки" → "save bookmarks" |
| ``` |
|
|
| This is **not** a general-purpose translator. It's a specialized compression model that maps Ukrainian expressions (2-4 words) to minimal English while preserving meaning for LLM understanding. |
|
|
| ## Model Details |
|
|
| | Parameter | Value | |
| |-----------|-------| |
| | Architecture | GRU Encoder-Decoder with Attention | |
| | Parameters | **7.3M** | |
| | Encoder | Bidirectional GRU, hidden=256, embed=128 | |
| | Decoder | GRU with Bahdanau attention | |
| | Source vocab | 15,679 tokens (Ukrainian) | |
| | Target vocab | 9,608 tokens (English) | |
| | Dropout | 0.0 (inference) | |
| | Training pairs | 28,149 | |
| | Validation set | 500 pairs | |
| | Framework | PyTorch | |
|
|
| ## Performance |
|
|
| | Metric | Value | |
| |--------|-------| |
| | Exact match (val) | **98.2%** | |
| | Word overlap (val) | **99.33%** | |
| | Token savings (full pipeline) | **73%** | |
| | GPT quality preservation | **150%** (squeezed > original) | |
|
|
| Evaluated on 53,351 texts (Telegram corpus + Ukrainian literature). Full pipeline with lexicon + seq2seq achieves 73% token reduction while GPT-4 understands squeezed text **better** than original Ukrainian (100% vs 67% accuracy on IT prompts). |
|
|
| ## Training |
|
|
| **Data sources:** |
| - OPUS parallel corpus (UA-EN): 38K cleaned entries from KDE/GNOME/documentation |
| - Auto-generated expression pairs via LLM: 7.7K entries |
| - Telegram slang/surzhyk: 802 entries |
| - Manual UA→EN mappings: 208 entries |
|
|
| **Training configuration:** |
| - Optimizer: Adam |
| - Loss: CrossEntropyLoss (ignore padding) |
| - Label smoothing: applied during training |
| - Anti-overfitting: dropout in encoder/decoder during training, smaller model size |
| - Hardware: HuggingFace Spaces (free tier CPU) |
|
|
| **Data pipeline:** |
| ``` |
| Telegram corpus → crack_open (normalize) → generate pairs (LLM) → train seq2seq |
| ``` |
|
|
| ## Files |
|
|
| | File | Size | Description | |
| |------|------|-------------| |
| | `expr_seq2seq.pt` | 28MB | Model weights (PyTorch state_dict) | |
| | `expr_vocab_src.json` | 396KB | Source vocabulary (Ukrainian, 15.6K tokens) | |
| | `expr_vocab_tgt.json` | 164KB | Target vocabulary (English, 9.6K tokens) | |
| | `expr_config.json` | 108B | Model hyperparameters | |
| | `lexicon.db` | 12MB | SQLite lexicon (47K UA→EN word mappings) | |
|
|
| ## Usage |
|
|
| ### Via pip (recommended) |
|
|
| ```bash |
| pip install dormouse-ua |
| ``` |
|
|
| ```python |
| from dormouse import squeeze |
| |
| # Full pipeline: normalize → compress → translate (uses this model) |
| squeeze("блін продакшн впав після деплою", target="cloud") |
| # → "damn production crashed after deploy" |
| # Tokens: 45 → 12 (-73%) |
| ``` |
|
|
| Assets download automatically on first use to `~/.cache/dormouse/v0.3.0/`. |
|
|
| ### Direct model usage |
|
|
| ```python |
| import torch |
| from dormouse.seq2seq import wake_up_expr |
| |
| model, src_vocab, tgt_vocab = wake_up_expr() |
| |
| text = "запустити програму" |
| src_ids = torch.tensor(src_vocab.encode(text)) |
| result = model.translate(src_ids, tgt_vocab) |
| print(result) # "execute the program" |
| ``` |
|
|
| ## Use Cases |
|
|
| 1. **LLM token optimization** — Ukrainian Cyrillic costs 3-4x more tokens than English. This model is part of a pipeline that saves 73% tokens. |
|
|
| 2. **Chatbot preprocessing** — Normalize surzhyk/slang before sending to GPT/Claude. Response quality improves from 67% to 100%. |
|
|
| 3. **Cost reduction** — 10K Ukrainian prompts/day through GPT → 60-73% savings on input token costs. |
|
|
| 4. **AI agents** — Compress Ukrainian context for longer agent memory. 73% compression = 73% more context window. |
|
|
| 5. **Local search & classification** — The lexicon.db enables offline Ukrainian text indexing, semantic search, and topic classification without any API calls. |
|
|
| ## Full Pipeline |
|
|
| ```mermaid |
| graph LR |
| A[UA text] --> B[crack_open<br/>360 rules + pymorphy3] |
| B --> C[compress<br/>remove fillers] |
| C --> D[seq2seq<br/>this model] |
| C --> E[lexicon.db<br/>word-by-word] |
| D --> F[EN compressed] |
| E --> F |
| |
| style A fill:#fdd,stroke:#c33 |
| style F fill:#dfd,stroke:#3a3 |
| style D fill:#def,stroke:#38a |
| ``` |
|
|
| ## Comparison |
|
|
| | Approach | Ukrainian support | Token savings | Quality impact | |
| |----------|:-----------------:|:------------:|:--------------:| |
| | **dormouse (this model)** | native | **73%** | **+50%** | |
| | LLMLingua | no | up to 20x | -5-15% | |
| | Selective Context | no | 40-50% | -10-20% | |
| | Google Translate | partial | 30-40% | variable | |
|
|
| [Research paper on Ukrainian tokenization inefficiency (Frontiers in AI, 2025)](https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2025.1538165/full) |
|
|
| ## Links |
|
|
| - **PyPI:** [dormouse-ua](https://pypi.org/project/dormouse-ua/) |
| - **GitHub:** [ChuprinaDaria/dormouse](https://github.com/ChuprinaDaria/dormouse) |
| - **Author:** [Daria Chuprina](https://www.linkedin.com/in/dchuprina/) | [Lazysoft](https://lazysoft.pl/) | dchuprina@lazysoft.pl |
|
|
| ## License |
|
|
| MIT |
|
|
| ## Citation |
|
|
| ```bibtex |
| @software{dormouse2026, |
| author = {Chuprina, Daria}, |
| title = {dormouse: Ukrainian Text Optimizer for LLMs}, |
| year = {2026}, |
| url = {https://github.com/ChuprinaDaria/dormouse}, |
| } |
| ``` |
|
|