Model Card for Telugu BERT Model This model is a BERT-based language model trained for Masked Language Modeling (MLM) in Telugu. It is designed to understand and generate Telugu text effectively.
Model Details Model Description Developed by: MATHI Model type: Transformer-based Masked Language Model (MLM) Language(s) (NLP): Telugu License: MIT Model Sources Repository: Hugging Face Model Repo Demo : Colab Notebook
Uses Direct Use This model can be used for: Text completion in Telugu Fill-mask prediction (predict missing words in a sentence) Pretraining or fine-tuning for Telugu NLP tasks
Downstream Use Fine-tuned versions of this model can be used for: Named Entity Recognition (NER) Sentiment Analysis Machine Translation Text Summarization
Out-of-Scope Use Not suitable for real-time dialogue generation Not trained for code-mixing (Telugu + English)
Bias, Risks, and Limitations The model may reflect biases present in the training data. Accuracy may vary for dialectal variations of Telugu. May generate incorrect or misleading predictions.
Recommendations Users should verify the model's outputs before relying on them for critical applications.
How to Get Started with the Model
Use the code below to get started:
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
model_name = "Mathiarasi/TMod"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
print(fill_mask("మక్దూంపల్లి పేరుతో చాలా [MASK] ఉన్నాయి."))
Training Details Training Data The model is trained on a Telugu corpus containing diverse text sources. Data preprocessing included text normalization, cleaning, and tokenization. Training Procedure Preprocessing Used WordPiece Tokenizer with a vocabulary of 30,000 tokens.
Training Hyperparameters Batch Size: 16 Learning Rate: 5e-5 Epochs: 3 Optimizer: AdamW
Speeds, Sizes, Times Testing Data Evaluated on a held-out dataset of Telugu text.
Technical Specifications Model Architecture and Objective
Model Type: BERT (Bidirectional Encoder Representations from Transformers)
Training Objective: Masked Language Modeling (MLM)
Dataset library: datasets
Citation If you use this model, please cite:
@article{Mathiarasi2025, title={Telugu BERT: A Transformer-Based Language Model for Telugu}, author={Mathiarasi}, journal={Hugging Face Models}, year={2025} }
Model Card Authors : MATHIARASI
Model Card Contact For questions, contact mathiarasie1710@gmail.com