TMod

File size: 2,713 Bytes

---
license: mit
datasets:
- open-thoughts/OpenThoughts-114k
language:
- te
pipeline_tag: fill-mask
---

Model Card for Telugu BERT Model
This model is a BERT-based language model trained for Masked Language Modeling (MLM) in Telugu. It is designed to understand and generate Telugu text effectively.

Model Details
Model Description
Developed by: MATHI
Model type: Transformer-based Masked Language Model (MLM)
Language(s) (NLP): Telugu
License: MIT
Model Sources
Repository: Hugging Face Model Repo
Demo : Colab Notebook

Uses
Direct Use
This model can be used for: Text completion in Telugu
Fill-mask prediction (predict missing words in a sentence)
Pretraining or fine-tuning for Telugu NLP tasks

Downstream Use
Fine-tuned versions of this model can be used for:
Named Entity Recognition (NER)
Sentiment Analysis
Machine Translation
Text Summarization

Out-of-Scope Use
Not suitable for real-time dialogue generation
Not trained for code-mixing (Telugu + English)

Bias, Risks, and Limitations
The model may reflect biases present in the training data.
Accuracy may vary for dialectal variations of Telugu.
May generate incorrect or misleading predictions.

Recommendations
Users should verify the model's outputs before relying on them for critical applications.

How to Get Started with the Model
Use the code below to get started:
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline  
model_name = "Mathiarasi/TMod"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
print(fill_mask("మక్దూంపల్లి పేరుతో చాలా [MASK] ఉన్నాయి."))

Training Details
Training Data
The model is trained on a Telugu corpus containing diverse text sources.
Data preprocessing included text normalization, cleaning, and tokenization.
Training Procedure
Preprocessing
Used WordPiece Tokenizer with a vocabulary of 30,000 tokens.

Training Hyperparameters
Batch Size: 16
Learning Rate: 5e-5
Epochs: 3
Optimizer: AdamW

Speeds, Sizes, Times
Testing Data
Evaluated on a held-out dataset of Telugu text.

Technical Specifications
Model Architecture and Objective

Model Type: BERT (Bidirectional Encoder Representations from Transformers)

Training Objective: Masked Language Modeling (MLM)

Dataset library: datasets

Citation
If you use this model, please cite:

@article{Mathiarasi2025,
  title={Telugu BERT: A Transformer-Based Language Model for Telugu},
  author={Mathiarasi},
  journal={Hugging Face Models},
  year={2025}
}

Model Card Authors : MATHIARASI

Model Card Contact
For questions, contact mathiarasie1710@gmail.com