Character BiLSTM for Romanian Diacritic Restoration
2-layer bidirectional LSTM (2.4M parameters) for character-level diacritic restoration.
Results (CRAWLER-1000 clean)
| Metric | Value |
|---|---|
| Word Accuracy | 96.23% |
| DER | 0.033 |
| Speed | 111 sent/s |
Noise robustness: Collapses under noise (96% to 37% at high noise). Best for clean text only.
Training
- Data: 299,633 examples from dexonline corpus
- Hardware: Apple M3 Ultra (MPS)
- Epochs: 5, batch size 256, Adam lr=1e-3
- Training time: ~45 minutes
Resources
- Paper: "Comparing Five Model Families for Romanian Diacritic Restoration"
- Dataset: klusai/diacritics-ro
- Code: github.com/klusai/diacritics-finetuning-code