ByT5-small for Romanian Diacritic Restoration
ByT5-small (300M) fine-tuned for byte-level diacritic restoration.
Note: Trained on a 50k-example subset (not the full 300k) due to computational constraints.
Results (CRAWLER-1000 clean)
| Metric | Value |
|---|---|
| Word Accuracy | 92.00% |
| DER | 0.081 |
| Speed | 0.48 sent/s |
Training
- Base model: google/byt5-small
- Data: 50,000 examples from dexonline corpus
- Hardware: Apple M3 Ultra (CPU)
- Epochs: 5, batch size 8, Adam lr=1e-4
Resources
- Dataset: klusai/diacritics-ro
- Code: github.com/klusai/diacritics-finetuning-code
Model tree for klusai/diacritics-byt5-small-ro
Base model
google/byt5-small