Model Card for Model ID

T5 model trained on Bulgarian literature, Web, and other datasets, tokenized on character level.

Model Details

470M parameter T5 model trained on 10B words (54B characters) for 3 epochs with T5 Span Corruption objective on character level.

Tokenizer vocabulary size is 512.
Model hidden dimension is 1024.
Feed-Forward dimension is 4096.
Hidden layer count is 16 for both the encoder and the decoder.
Developed by: Artificial Inteligence and Language Technologies Department at Institute of Information and Communication Technologies - Bulgarian Academy of Sciences.
Funded by: The model is pretrained within the CLaDA-BG: National Interdisciplinary Research E-Infrastructure for Bulgarian Language and Cultural heritage - member of the pan-European research consortia CLARIN-ERIC & DARIAH-ERIC, funded by the Ministry of Education and Science of Bulgaria (support for the Bulgarian National Roadmap for Research Infrastructure). The training was performed at the supercomputer HEMUS at IICT-BAS, part of the RIs of the CoE on Informatics and ICT, financed by the OP SESG (2014–2020), and co-financed by the European Union through the ESIF.
Model type: T5
Language(s) (NLP): Bulgarian.
License: MIT

Uses

The model is intended to be used as a base model for fine-tuning tasks in NLP.

Direct Use

>>> import torch
>>> from transformers import (
>>>     T5ForConditionalGeneration,
>>>     PreTrainedTokenizerFast
>>> )

>>> model = T5ForConditionalGeneration.from_pretrained('AIaLT-IICT/t5_char_bg_base')
>>> tokenizer = PreTrainedTokenizerFast.from_pretrained('AIaLT-IICT/t5_char_bg_base')

>>> prompt = "Събудих се след[SEN_1]и отидох да си купя[SEN_2]."

>>> model_inputs = tokenizer([prompt], return_tensors="pt", add_special_tokens=True, return_token_type_ids=False)
>>> generated_ids = model.generate(**model_inputs)
>>> tokenizer.decode(generated_ids[0])

'[CLS][SEN_1] полунощ [SEN_2] кола.[SEN_3][SEP]'

Out-of-Scope Use

The model is trained on span corruption task. If you want to use the model for any other type of text generation it is recommended to fine-tune it.

Recommendations

It is recommended to use the model for text generation fine-tuning tasks that need character level modeling for example spelling correction. The encoder of the model alone can be used for text and token classification.

Training Details

Training Data

Trained on 10B tokens consisting of deduplicated union of:

uonlp/CulturaX
MaCoCu-bg 2.0
HPLT 2.0 Bulgarian (Cyrillic) cleaned
Literature
Wikipedia
others

Training Procedure

Trained with the T5 Span Corruption objective with 25% noise density, 7 characters mean noise span length for 3 epochs with bf16 mixed precision, 1024 tokens input length and batch size of 256*1024 tokens.

Evaluation

The model is evaluated on the T5 Span Corruption objective that it was trained on. It achieves test loss of 1.38 and test accuracy of 71.50%

Model Card Authors

Nikolay Paev, Kiril Simov

Model Card Contact

nikolay.paev@iict.bas.bg

Downloads last month: 1

Safetensors

Model size

0.5B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

AIaLT-IICT
/

t5_char_bg_base