dormouse / README.md

Upload README.md with huggingface_hub

eef8f3b verified 6 days ago

5.98 kB

	---
	library_name: dormouse
	tags:
	- ukrainian
	- nlp
	- tokenization
	- text-optimization
	- seq2seq
	- translation
	- ua-en
	language:
	- uk
	- en
	license: mit
	pipeline_tag: translation
	datasets:
	- Dariachup/dormouse-corpus
	---

	# dormouse — Ukrainian Text Optimizer for LLMs

	Seq2seq expression translator (UA→EN) trained on 28,149 parallel pairs for token-efficient Ukrainian text compression.

	This repository contains model weights and lexicon data for the [dormouse-ua](https://pypi.org/project/dormouse-ua/) Python library.

	## What this model does

	Translates Ukrainian multi-word expressions into compact English equivalents for LLM consumption:

	```
	"немає резюме" → "no summary given"
	"запустити програму" → "execute the program"
	"повна синхронізація" → "full synchronization"
	"горить дедлайн" → "deadline approaching"
	"зберегти закладки" → "save bookmarks"
	```

	This is not a general-purpose translator. It's a specialized compression model that maps Ukrainian expressions (2-4 words) to minimal English while preserving meaning for LLM understanding.

	## Model Details

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Architecture \| GRU Encoder-Decoder with Attention \|
	\| Parameters \| 7.3M \|
	\| Encoder \| Bidirectional GRU, hidden=256, embed=128 \|
	\| Decoder \| GRU with Bahdanau attention \|
	\| Source vocab \| 15,679 tokens (Ukrainian) \|
	\| Target vocab \| 9,608 tokens (English) \|
	\| Dropout \| 0.0 (inference) \|
	\| Training pairs \| 28,149 \|
	\| Validation set \| 500 pairs \|
	\| Framework \| PyTorch \|

	## Performance

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Exact match (val) \| 98.2% \|
	\| Word overlap (val) \| 99.33% \|
	\| Token savings (full pipeline) \| 73% \|
	\| GPT quality preservation \| 150% (squeezed > original) \|

	Evaluated on 53,351 texts (Telegram corpus + Ukrainian literature). Full pipeline with lexicon + seq2seq achieves 73% token reduction while GPT-4 understands squeezed text better than original Ukrainian (100% vs 67% accuracy on IT prompts).

	## Training

	Data sources:
	- OPUS parallel corpus (UA-EN): 38K cleaned entries from KDE/GNOME/documentation
	- Auto-generated expression pairs via LLM: 7.7K entries
	- Telegram slang/surzhyk: 802 entries
	- Manual UA→EN mappings: 208 entries

	Training configuration:
	- Optimizer: Adam
	- Loss: CrossEntropyLoss (ignore padding)
	- Label smoothing: applied during training
	- Anti-overfitting: dropout in encoder/decoder during training, smaller model size
	- Hardware: HuggingFace Spaces (free tier CPU)

	Data pipeline:
	```
	Telegram corpus → crack_open (normalize) → generate pairs (LLM) → train seq2seq
	```

	## Files

	\| File \| Size \| Description \|
	\|------\|------\|-------------\|
	\| `expr_seq2seq.pt` \| 28MB \| Model weights (PyTorch state_dict) \|
	\| `expr_vocab_src.json` \| 396KB \| Source vocabulary (Ukrainian, 15.6K tokens) \|
	\| `expr_vocab_tgt.json` \| 164KB \| Target vocabulary (English, 9.6K tokens) \|
	\| `expr_config.json` \| 108B \| Model hyperparameters \|
	\| `lexicon.db` \| 12MB \| SQLite lexicon (47K UA→EN word mappings) \|

	## Usage

	### Via pip (recommended)

	```bash
	pip install dormouse-ua
	```

	```python
	from dormouse import squeeze

	# Full pipeline: normalize → compress → translate (uses this model)
	squeeze("блін продакшн впав після деплою", target="cloud")
	# → "damn production crashed after deploy"
	# Tokens: 45 → 12 (-73%)
	```

	Assets download automatically on first use to `~/.cache/dormouse/v0.3.0/`.

	### Direct model usage

	```python
	import torch
	from dormouse.seq2seq import wake_up_expr

	model, src_vocab, tgt_vocab = wake_up_expr()

	text = "запустити програму"
	src_ids = torch.tensor(src_vocab.encode(text))
	result = model.translate(src_ids, tgt_vocab)
	print(result) # "execute the program"
	```

	## Use Cases

	1. LLM token optimization — Ukrainian Cyrillic costs 3-4x more tokens than English. This model is part of a pipeline that saves 73% tokens.

	2. Chatbot preprocessing — Normalize surzhyk/slang before sending to GPT/Claude. Response quality improves from 67% to 100%.

	3. Cost reduction — 10K Ukrainian prompts/day through GPT → 60-73% savings on input token costs.

	4. AI agents — Compress Ukrainian context for longer agent memory. 73% compression = 73% more context window.

	5. Local search & classification — The lexicon.db enables offline Ukrainian text indexing, semantic search, and topic classification without any API calls.

	## Full Pipeline

	```mermaid
	graph LR
	A[UA text] --> B[crack_open<br/>360 rules + pymorphy3]
	B --> C[compress<br/>remove fillers]
	C --> D[seq2seq<br/>this model]
	C --> E[lexicon.db<br/>word-by-word]
	D --> F[EN compressed]
	E --> F

	style A fill:#fdd,stroke:#c33
	style F fill:#dfd,stroke:#3a3
	style D fill:#def,stroke:#38a
	```

	## Comparison

	\| Approach \| Ukrainian support \| Token savings \| Quality impact \|
	\|----------\|:-----------------:\|:------------:\|:--------------:\|
	\| dormouse (this model) \| native \| 73% \| +50% \|
	\| LLMLingua \| no \| up to 20x \| -5-15% \|
	\| Selective Context \| no \| 40-50% \| -10-20% \|
	\| Google Translate \| partial \| 30-40% \| variable \|

	[Research paper on Ukrainian tokenization inefficiency (Frontiers in AI, 2025)](https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2025.1538165/full)

	## Links

	- PyPI: [dormouse-ua](https://pypi.org/project/dormouse-ua/)
	- GitHub: [ChuprinaDaria/dormouse](https://github.com/ChuprinaDaria/dormouse)
	- Author: [Daria Chuprina](https://www.linkedin.com/in/dchuprina/) \| [Lazysoft](https://lazysoft.pl/) \| dchuprina@lazysoft.pl

	## License

	MIT

	## Citation

	```bibtex
	@software{dormouse2026,
	author = {Chuprina, Daria},
	title = {dormouse: Ukrainian Text Optimizer for LLMs},
	year = {2026},
	url = {https://github.com/ChuprinaDaria/dormouse},
	}
	```