Model Card for Model ID

T5 model trained on Bulgarian literature, Web, and other datasets, tokenized on character level.

Model Details

470M parameter T5 model trained on 10B words (54B characters) for 3 epochs with T5 Span Corruption objective on character level.

Uses

The model is intended to be used as a base model for fine-tuning tasks in NLP.

Direct Use

>>> import torch
>>> from transformers import (
>>>     T5ForConditionalGeneration,
>>>     PreTrainedTokenizerFast
>>> )

>>> model = T5ForConditionalGeneration.from_pretrained('AIaLT-IICT/t5_char_bg_base')
>>> tokenizer = PreTrainedTokenizerFast.from_pretrained('AIaLT-IICT/t5_char_bg_base')

>>> prompt = "Събудих се след[SEN_1]и отидох да си купя[SEN_2]."

>>> model_inputs = tokenizer([prompt], return_tensors="pt", add_special_tokens=True, return_token_type_ids=False)
>>> generated_ids = model.generate(**model_inputs)
>>> tokenizer.decode(generated_ids[0])

'[CLS][SEN_1] полунощ [SEN_2] кола.[SEN_3][SEP]'

Out-of-Scope Use

The model is trained on span corruption task. If you want to use the model for any other type of text generation it is recommended to fine-tune it.

Recommendations

It is recommended to use the model for text generation fine-tuning tasks that need character level modeling for example spelling correction. The encoder of the model alone can be used for text and token classification.

Training Details

Training Data

Trained on 10B tokens consisting of deduplicated union of:

Training Procedure

Trained with the T5 Span Corruption objective with 25% noise density, 7 characters mean noise span length for 3 epochs with bf16 mixed precision, 1024 tokens input length and batch size of 256*1024 tokens.

Evaluation

The model is evaluated on the T5 Span Corruption objective that it was trained on. It achieves test loss of 1.38 and test accuracy of 71.50%

Model Card Authors

Nikolay Paev, Kiril Simov

Model Card Contact

nikolay.paev@iict.bas.bg

Downloads last month
-
Safetensors
Model size
0.5B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train AIaLT-IICT/t5_char_bg_base