BanglaByT5: Byte-Level Modelling for Bangla is an encoder–decoder transformer model pretrained at the byte level specifically for Bangla language understanding and generation tasks. By operating on raw bytes rather than subword tokens, BanglaByT5 captures fine-grained morphological and orthographic patterns, making it highly effective in handling diverse Bangla text sources.

Usage Example

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("Vacaspati/BanglaByT5")
model = AutoModelForSeq2SeqLM.from_pretrained("Vacaspati/BanglaByT5")

# Tokenize input
input_text = "আমার নাম প্রমিত।"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

# Generate text
outputs = model.generate(input_ids, max_length=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Vācaspati dataset can be accessed from: https://huggingface.co/datasets/Vacaspati/Vacaspati

Citation

If you are using this model please cite:

@inproceedings{bhattacharyya-bhattacharya-2025-banglabyt5,
    title = "{B}angla{B}y{T}5: Byte-Level Modelling for {B}angla",
    author = "Bhattacharyya, Pramit  and
      Bhattacharya, Arnab",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-emnlp.297/",
    doi = "10.18653/v1/2025.findings-emnlp.297",
    pages = "5551--5560",
    ISBN = "979-8-89176-335-7",
}

Downloads last month: 18

Safetensors

Model size

0.3B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Vacaspati/BanglaByT5

Base model

google/byt5-small

Finetuned

(240)

this model