File size: 2,713 Bytes
9ea2b32 76b5267 2913f77 76b5267 2913f77 76b5267 2913f77 76b5267 2913f77 76b5267 2913f77 76b5267 2913f77 76b5267 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 |
---
license: mit
datasets:
- open-thoughts/OpenThoughts-114k
language:
- te
pipeline_tag: fill-mask
---
Model Card for Telugu BERT Model
This model is a BERT-based language model trained for Masked Language Modeling (MLM) in Telugu. It is designed to understand and generate Telugu text effectively.
Model Details
Model Description
Developed by: MATHI
Model type: Transformer-based Masked Language Model (MLM)
Language(s) (NLP): Telugu
License: MIT
Model Sources
Repository: Hugging Face Model Repo
Demo : Colab Notebook
Uses
Direct Use
This model can be used for: Text completion in Telugu
Fill-mask prediction (predict missing words in a sentence)
Pretraining or fine-tuning for Telugu NLP tasks
Downstream Use
Fine-tuned versions of this model can be used for:
Named Entity Recognition (NER)
Sentiment Analysis
Machine Translation
Text Summarization
Out-of-Scope Use
Not suitable for real-time dialogue generation
Not trained for code-mixing (Telugu + English)
Bias, Risks, and Limitations
The model may reflect biases present in the training data.
Accuracy may vary for dialectal variations of Telugu.
May generate incorrect or misleading predictions.
Recommendations
Users should verify the model's outputs before relying on them for critical applications.
How to Get Started with the Model
Use the code below to get started:
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
model_name = "Mathiarasi/TMod"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
print(fill_mask("మక్దూంపల్లి పేరుతో చాలా [MASK] ఉన్నాయి."))
Training Details
Training Data
The model is trained on a Telugu corpus containing diverse text sources.
Data preprocessing included text normalization, cleaning, and tokenization.
Training Procedure
Preprocessing
Used WordPiece Tokenizer with a vocabulary of 30,000 tokens.
Training Hyperparameters
Batch Size: 16
Learning Rate: 5e-5
Epochs: 3
Optimizer: AdamW
Speeds, Sizes, Times
Testing Data
Evaluated on a held-out dataset of Telugu text.
Technical Specifications
Model Architecture and Objective
Model Type: BERT (Bidirectional Encoder Representations from Transformers)
Training Objective: Masked Language Modeling (MLM)
Dataset library: datasets
Citation
If you use this model, please cite:
@article{Mathiarasi2025,
title={Telugu BERT: A Transformer-Based Language Model for Telugu},
author={Mathiarasi},
journal={Hugging Face Models},
year={2025}
}
Model Card Authors : MATHIARASI
Model Card Contact
For questions, contact mathiarasie1710@gmail.com |