Model Card for Telugu BERT Model This model is a BERT-based language model trained for Masked Language Modeling (MLM) in Telugu. It is designed to understand and generate Telugu text effectively.

Model Details Model Description Developed by: MATHI Model type: Transformer-based Masked Language Model (MLM) Language(s) (NLP): Telugu License: MIT Model Sources Repository: Hugging Face Model Repo Demo : Colab Notebook

Uses Direct Use This model can be used for: Text completion in Telugu Fill-mask prediction (predict missing words in a sentence) Pretraining or fine-tuning for Telugu NLP tasks

Downstream Use Fine-tuned versions of this model can be used for: Named Entity Recognition (NER) Sentiment Analysis Machine Translation Text Summarization

Out-of-Scope Use Not suitable for real-time dialogue generation Not trained for code-mixing (Telugu + English)

Bias, Risks, and Limitations The model may reflect biases present in the training data. Accuracy may vary for dialectal variations of Telugu. May generate incorrect or misleading predictions.

Recommendations Users should verify the model's outputs before relying on them for critical applications.

How to Get Started with the Model Use the code below to get started: from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
model_name = "Mathiarasi/TMod" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForMaskedLM.from_pretrained(model_name) fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer) print(fill_mask("మక్దూంపల్లి పేరుతో చాలా [MASK] ఉన్నాయి."))

Training Details Training Data The model is trained on a Telugu corpus containing diverse text sources. Data preprocessing included text normalization, cleaning, and tokenization. Training Procedure Preprocessing Used WordPiece Tokenizer with a vocabulary of 30,000 tokens.

Training Hyperparameters Batch Size: 16 Learning Rate: 5e-5 Epochs: 3 Optimizer: AdamW

Speeds, Sizes, Times Testing Data Evaluated on a held-out dataset of Telugu text.

Technical Specifications Model Architecture and Objective

Model Type: BERT (Bidirectional Encoder Representations from Transformers)

Training Objective: Masked Language Modeling (MLM)

Dataset library: datasets

Citation If you use this model, please cite:

@article{Mathiarasi2025, title={Telugu BERT: A Transformer-Based Language Model for Telugu}, author={Mathiarasi}, journal={Hugging Face Models}, year={2025} }

Model Card Authors : MATHIARASI

Model Card Contact For questions, contact mathiarasie1710@gmail.com

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Mathiarasi/TMod