dormouse — Ukrainian Text Optimizer for LLMs

Seq2seq expression translator (UA→EN) trained on 28,149 parallel pairs for token-efficient Ukrainian text compression.

This repository contains model weights and lexicon data for the dormouse-ua Python library.

What this model does

Translates Ukrainian multi-word expressions into compact English equivalents for LLM consumption:

"немає резюме"          → "no summary given"
"запустити програму"    → "execute the program"
"повна синхронізація"   → "full synchronization"
"горить дедлайн"        → "deadline approaching"
"зберегти закладки"     → "save bookmarks"

This is not a general-purpose translator. It's a specialized compression model that maps Ukrainian expressions (2-4 words) to minimal English while preserving meaning for LLM understanding.

Model Details

Parameter	Value
Architecture	GRU Encoder-Decoder with Attention
Parameters	7.3M
Encoder	Bidirectional GRU, hidden=256, embed=128
Decoder	GRU with Bahdanau attention
Source vocab	15,679 tokens (Ukrainian)
Target vocab	9,608 tokens (English)
Dropout	0.0 (inference)
Training pairs	28,149
Validation set	500 pairs
Framework	PyTorch

Performance

Metric	Value
Exact match (val)	98.2%
Word overlap (val)	99.33%
Token savings (full pipeline)	73%
GPT quality preservation	150% (squeezed > original)

Evaluated on 53,351 texts (Telegram corpus + Ukrainian literature). Full pipeline with lexicon + seq2seq achieves 73% token reduction while GPT-4 understands squeezed text better than original Ukrainian (100% vs 67% accuracy on IT prompts).

Training

Data sources:

OPUS parallel corpus (UA-EN): 38K cleaned entries from KDE/GNOME/documentation
Auto-generated expression pairs via LLM: 7.7K entries
Telegram slang/surzhyk: 802 entries
Manual UA→EN mappings: 208 entries

Training configuration:

Optimizer: Adam
Loss: CrossEntropyLoss (ignore padding)
Label smoothing: applied during training
Anti-overfitting: dropout in encoder/decoder during training, smaller model size
Hardware: HuggingFace Spaces (free tier CPU)

Data pipeline:

Telegram corpus → crack_open (normalize) → generate pairs (LLM) → train seq2seq

Files

File	Size	Description
`expr_seq2seq.pt`	28MB	Model weights (PyTorch state_dict)
`expr_vocab_src.json`	396KB	Source vocabulary (Ukrainian, 15.6K tokens)
`expr_vocab_tgt.json`	164KB	Target vocabulary (English, 9.6K tokens)
`expr_config.json`	108B	Model hyperparameters
`lexicon.db`	12MB	SQLite lexicon (47K UA→EN word mappings)

Usage

Via pip (recommended)

pip install dormouse-ua

from dormouse import squeeze

# Full pipeline: normalize → compress → translate (uses this model)
squeeze("блін продакшн впав після деплою", target="cloud")
# → "damn production crashed after deploy"
# Tokens: 45 → 12 (-73%)

Assets download automatically on first use to ~/.cache/dormouse/v0.3.0/.

Direct model usage

import torch
from dormouse.seq2seq import wake_up_expr

model, src_vocab, tgt_vocab = wake_up_expr()

text = "запустити програму"
src_ids = torch.tensor(src_vocab.encode(text))
result = model.translate(src_ids, tgt_vocab)
print(result)  # "execute the program"

Use Cases

LLM token optimization — Ukrainian Cyrillic costs 3-4x more tokens than English. This model is part of a pipeline that saves 73% tokens.
Chatbot preprocessing — Normalize surzhyk/slang before sending to GPT/Claude. Response quality improves from 67% to 100%.
Cost reduction — 10K Ukrainian prompts/day through GPT → 60-73% savings on input token costs.
AI agents — Compress Ukrainian context for longer agent memory. 73% compression = 73% more context window.
Local search & classification — The lexicon.db enables offline Ukrainian text indexing, semantic search, and topic classification without any API calls.

Full Pipeline

graph LR
    A[UA text] --> B[crack_open<br/>360 rules + pymorphy3]
    B --> C[compress<br/>remove fillers]
    C --> D[seq2seq<br/>this model]
    C --> E[lexicon.db<br/>word-by-word]
    D --> F[EN compressed]
    E --> F

    style A fill:#fdd,stroke:#c33
    style F fill:#dfd,stroke:#3a3
    style D fill:#def,stroke:#38a

Comparison

Approach	Ukrainian support	Token savings	Quality impact
dormouse (this model)	native	73%	+50%
LLMLingua	no	up to 20x	-5-15%
Selective Context	no	40-50%	-10-20%
Google Translate	partial	30-40%	variable

Research paper on Ukrainian tokenization inefficiency (Frontiers in AI, 2025)

License

MIT

Citation

@software{dormouse2026,
  author = {Chuprina, Daria},
  title = {dormouse: Ukrainian Text Optimizer for LLMs},
  year = {2026},
  url = {https://github.com/ChuprinaDaria/dormouse},
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Dariachup
/

dormouse