| # t5-small-wikitext | |
| t5-small trained on [wikitext/wikitest-103-raw-v1](wikitext/wikitest-103-raw-v1) over 50k steps (around 2 hours of training) following [T5 paper](https://arxiv.org/pdf/1910.10683.pdf) training procedure. | |
| * batch_size: 32 | |
| * max_seq_length: 128 | |
| * optim: Adafactor | |
| * sheduler: inverse square root (10k warm-up steps) |