dormouse — Ukrainian Text Optimizer for LLMs

Seq2seq expression translator (UA→EN) trained on 28,149 parallel pairs for token-efficient Ukrainian text compression.

This repository contains model weights and lexicon data for the dormouse-ua Python library.

What this model does

Translates Ukrainian multi-word expressions into compact English equivalents for LLM consumption:

"немає резюме"          → "no summary given"
"запустити програму"    → "execute the program"
"повна синхронізація"   → "full synchronization"
"горить дедлайн"        → "deadline approaching"
"зберегти закладки"     → "save bookmarks"

This is not a general-purpose translator. It's a specialized compression model that maps Ukrainian expressions (2-4 words) to minimal English while preserving meaning for LLM understanding.

Model Details

Parameter Value
Architecture GRU Encoder-Decoder with Attention
Parameters 7.3M
Encoder Bidirectional GRU, hidden=256, embed=128
Decoder GRU with Bahdanau attention
Source vocab 15,679 tokens (Ukrainian)
Target vocab 9,608 tokens (English)
Dropout 0.0 (inference)
Training pairs 28,149
Validation set 500 pairs
Framework PyTorch

Performance

Metric Value
Exact match (val) 98.2%
Word overlap (val) 99.33%
Token savings (full pipeline) 73%
GPT quality preservation 150% (squeezed > original)

Evaluated on 53,351 texts (Telegram corpus + Ukrainian literature). Full pipeline with lexicon + seq2seq achieves 73% token reduction while GPT-4 understands squeezed text better than original Ukrainian (100% vs 67% accuracy on IT prompts).

Training

Data sources:

  • OPUS parallel corpus (UA-EN): 38K cleaned entries from KDE/GNOME/documentation
  • Auto-generated expression pairs via LLM: 7.7K entries
  • Telegram slang/surzhyk: 802 entries
  • Manual UA→EN mappings: 208 entries

Training configuration:

  • Optimizer: Adam
  • Loss: CrossEntropyLoss (ignore padding)
  • Label smoothing: applied during training
  • Anti-overfitting: dropout in encoder/decoder during training, smaller model size
  • Hardware: HuggingFace Spaces (free tier CPU)

Data pipeline:

Telegram corpus → crack_open (normalize) → generate pairs (LLM) → train seq2seq

Files

File Size Description
expr_seq2seq.pt 28MB Model weights (PyTorch state_dict)
expr_vocab_src.json 396KB Source vocabulary (Ukrainian, 15.6K tokens)
expr_vocab_tgt.json 164KB Target vocabulary (English, 9.6K tokens)
expr_config.json 108B Model hyperparameters
lexicon.db 12MB SQLite lexicon (47K UA→EN word mappings)

Usage

Via pip (recommended)

pip install dormouse-ua
from dormouse import squeeze

# Full pipeline: normalize → compress → translate (uses this model)
squeeze("блін продакшн впав після деплою", target="cloud")
# → "damn production crashed after deploy"
# Tokens: 45 → 12 (-73%)

Assets download automatically on first use to ~/.cache/dormouse/v0.3.0/.

Direct model usage

import torch
from dormouse.seq2seq import wake_up_expr

model, src_vocab, tgt_vocab = wake_up_expr()

text = "запустити програму"
src_ids = torch.tensor(src_vocab.encode(text))
result = model.translate(src_ids, tgt_vocab)
print(result)  # "execute the program"

Use Cases

  1. LLM token optimization — Ukrainian Cyrillic costs 3-4x more tokens than English. This model is part of a pipeline that saves 73% tokens.

  2. Chatbot preprocessing — Normalize surzhyk/slang before sending to GPT/Claude. Response quality improves from 67% to 100%.

  3. Cost reduction — 10K Ukrainian prompts/day through GPT → 60-73% savings on input token costs.

  4. AI agents — Compress Ukrainian context for longer agent memory. 73% compression = 73% more context window.

  5. Local search & classification — The lexicon.db enables offline Ukrainian text indexing, semantic search, and topic classification without any API calls.

Full Pipeline

graph LR
    A[UA text] --> B[crack_open<br/>360 rules + pymorphy3]
    B --> C[compress<br/>remove fillers]
    C --> D[seq2seq<br/>this model]
    C --> E[lexicon.db<br/>word-by-word]
    D --> F[EN compressed]
    E --> F

    style A fill:#fdd,stroke:#c33
    style F fill:#dfd,stroke:#3a3
    style D fill:#def,stroke:#38a

Comparison

Approach Ukrainian support Token savings Quality impact
dormouse (this model) native 73% +50%
LLMLingua no up to 20x -5-15%
Selective Context no 40-50% -10-20%
Google Translate partial 30-40% variable

Research paper on Ukrainian tokenization inefficiency (Frontiers in AI, 2025)

Links

License

MIT

Citation

@software{dormouse2026,
  author = {Chuprina, Daria},
  title = {dormouse: Ukrainian Text Optimizer for LLMs},
  year = {2026},
  url = {https://github.com/ChuprinaDaria/dormouse},
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support