Model Card for Model ID

T5 model trained on Bulgarian literature, Web, Parallel English-Bulgarian texts, Bulgarian and English Wikipedia, and other datasets - uncased.

Model Details

1.1B parameter T5 model trained on 35B (41B depending on tokenization) tokens for 3 epochs with T5 Span Corruption objective.

Uses

The model is intended to be used as a base model for fine-tuning tasks in NLP.

Direct Use

>>> import torch
>>> from transformers import (
>>>     T5ForConditionalGeneration,
>>>     PreTrainedTokenizerFast
>>> )

>>> model = T5ForConditionalGeneration.from_pretrained('AIaLT-IICT/t5_bg_1B_uncased')
>>> tokenizer = PreTrainedTokenizerFast.from_pretrained('AIaLT-IICT/t5_bg_1B_uncased')

>>> prompt = "Събудих се след[SEN_0] и отидох да си купя[SEN_1]."

>>> model_inputs = tokenizer([prompt], return_tensors="pt", add_special_tokens=True, return_token_type_ids=False)
>>> generated_ids = model.generate(**model_inputs)
>>> tokenizer.decode(generated_ids[0])

'[CLS][SEN_0] тежък ден[SEN_1] сладолед[SEN_2][SEP]'

Out-of-Scope Use

The model is trained on span corruption task. If you want to use the model for any other type of text generation it is recommended to fine-tune it.

Recommendations

It is recommended to use the model for text generation fine-tuning tasks. The encoder of the model alone can be used for text and token classification.

Training Details

Training Data

Trained on 29B tokens consisting of deduplicated union of:

Training Procedure

Trained with the T5 Span Corruption objective with 25% noise density, 3 tokens mean noise span length for 3 epochs with bf16 mixed precision, 512 tokens input length and batch size of 256*512 tokens.

Evaluation

The model is evaluated on the T5 Span Corruption objective that it was trained on. It achieves test loss of 1.38 and test accuracy of 71.50%

Model Card Authors

Nikolay Paev, Kiril Simov

Model Card Contact

nikolay.paev@iict.bas.bg

Downloads last month
-
Safetensors
Model size
1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train AIaLT-IICT/t5_bg_1B_uncased