Model Card for Model ID
T5 model trained on Bulgarian literature, Web, and other datasets, tokenized on character level.
Model Details
470M parameter T5 model trained on 10B words (54B characters) for 3 epochs with T5 Span Corruption objective on character level.
Tokenizer vocabulary size is 512.
Model hidden dimension is 1024.
Feed-Forward dimension is 4096.
Hidden layer count is 16 for both the encoder and the decoder.
Developed by: Artificial Inteligence and Language Technologies Department at Institute of Information and Communication Technologies - Bulgarian Academy of Sciences.
Funded by: The model is pretrained within the CLaDA-BG: National Interdisciplinary Research E-Infrastructure for Bulgarian Language and Cultural heritage - member of the pan-European research consortia CLARIN-ERIC & DARIAH-ERIC, funded by the Ministry of Education and Science of Bulgaria (support for the Bulgarian National Roadmap for Research Infrastructure). The training was performed at the supercomputer HEMUS at IICT-BAS, part of the RIs of the CoE on Informatics and ICT, financed by the OP SESG (2014–2020), and co-financed by the European Union through the ESIF.
Model type: T5
Language(s) (NLP): Bulgarian.
License: MIT
Uses
The model is intended to be used as a base model for fine-tuning tasks in NLP.
Direct Use
>>> import torch
>>> from transformers import (
>>> T5ForConditionalGeneration,
>>> PreTrainedTokenizerFast
>>> )
>>> model = T5ForConditionalGeneration.from_pretrained('AIaLT-IICT/t5_char_bg_base')
>>> tokenizer = PreTrainedTokenizerFast.from_pretrained('AIaLT-IICT/t5_char_bg_base')
>>> prompt = "Събудих се след[SEN_1]и отидох да си купя[SEN_2]."
>>> model_inputs = tokenizer([prompt], return_tensors="pt", add_special_tokens=True, return_token_type_ids=False)
>>> generated_ids = model.generate(**model_inputs)
>>> tokenizer.decode(generated_ids[0])
'[CLS][SEN_1] полунощ [SEN_2] кола.[SEN_3][SEP]'
Out-of-Scope Use
The model is trained on span corruption task. If you want to use the model for any other type of text generation it is recommended to fine-tune it.
Recommendations
It is recommended to use the model for text generation fine-tuning tasks that need character level modeling for example spelling correction. The encoder of the model alone can be used for text and token classification.
Training Details
Training Data
Trained on 10B tokens consisting of deduplicated union of:
- uonlp/CulturaX
- MaCoCu-bg 2.0
- HPLT 2.0 Bulgarian (Cyrillic) cleaned
- Literature
- Wikipedia
- others
Training Procedure
Trained with the T5 Span Corruption objective with 25% noise density, 7 characters mean noise span length for 3 epochs with bf16 mixed precision, 1024 tokens input length and batch size of 256*1024 tokens.
Evaluation
The model is evaluated on the T5 Span Corruption objective that it was trained on. It achieves test loss of 1.38 and test accuracy of 71.50%
Model Card Authors
Nikolay Paev, Kiril Simov
Model Card Contact
- Downloads last month
- -