|
|
--- |
|
|
license: mit |
|
|
datasets: |
|
|
- open-thoughts/OpenThoughts-114k |
|
|
language: |
|
|
- te |
|
|
pipeline_tag: fill-mask |
|
|
--- |
|
|
|
|
|
Model Card for Telugu BERT Model |
|
|
This model is a BERT-based language model trained for Masked Language Modeling (MLM) in Telugu. It is designed to understand and generate Telugu text effectively. |
|
|
|
|
|
Model Details |
|
|
Model Description |
|
|
Developed by: MATHI |
|
|
Model type: Transformer-based Masked Language Model (MLM) |
|
|
Language(s) (NLP): Telugu |
|
|
License: MIT |
|
|
Model Sources |
|
|
Repository: Hugging Face Model Repo |
|
|
Demo : Colab Notebook |
|
|
|
|
|
Uses |
|
|
Direct Use |
|
|
This model can be used for: Text completion in Telugu |
|
|
Fill-mask prediction (predict missing words in a sentence) |
|
|
Pretraining or fine-tuning for Telugu NLP tasks |
|
|
|
|
|
Downstream Use |
|
|
Fine-tuned versions of this model can be used for: |
|
|
Named Entity Recognition (NER) |
|
|
Sentiment Analysis |
|
|
Machine Translation |
|
|
Text Summarization |
|
|
|
|
|
Out-of-Scope Use |
|
|
Not suitable for real-time dialogue generation |
|
|
Not trained for code-mixing (Telugu + English) |
|
|
|
|
|
Bias, Risks, and Limitations |
|
|
The model may reflect biases present in the training data. |
|
|
Accuracy may vary for dialectal variations of Telugu. |
|
|
May generate incorrect or misleading predictions. |
|
|
|
|
|
Recommendations |
|
|
Users should verify the model's outputs before relying on them for critical applications. |
|
|
|
|
|
How to Get Started with the Model |
|
|
Use the code below to get started: |
|
|
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline |
|
|
model_name = "Mathiarasi/TMod" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForMaskedLM.from_pretrained(model_name) |
|
|
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer) |
|
|
print(fill_mask("మక్దూంపల్లి పేరుతో చాలా [MASK] ఉన్నాయి.")) |
|
|
|
|
|
Training Details |
|
|
Training Data |
|
|
The model is trained on a Telugu corpus containing diverse text sources. |
|
|
Data preprocessing included text normalization, cleaning, and tokenization. |
|
|
Training Procedure |
|
|
Preprocessing |
|
|
Used WordPiece Tokenizer with a vocabulary of 30,000 tokens. |
|
|
|
|
|
Training Hyperparameters |
|
|
Batch Size: 16 |
|
|
Learning Rate: 5e-5 |
|
|
Epochs: 3 |
|
|
Optimizer: AdamW |
|
|
|
|
|
Speeds, Sizes, Times |
|
|
Testing Data |
|
|
Evaluated on a held-out dataset of Telugu text. |
|
|
|
|
|
Technical Specifications |
|
|
Model Architecture and Objective |
|
|
|
|
|
Model Type: BERT (Bidirectional Encoder Representations from Transformers) |
|
|
|
|
|
Training Objective: Masked Language Modeling (MLM) |
|
|
|
|
|
Dataset library: datasets |
|
|
|
|
|
Citation |
|
|
If you use this model, please cite: |
|
|
|
|
|
@article{Mathiarasi2025, |
|
|
title={Telugu BERT: A Transformer-Based Language Model for Telugu}, |
|
|
author={Mathiarasi}, |
|
|
journal={Hugging Face Models}, |
|
|
year={2025} |
|
|
} |
|
|
|
|
|
Model Card Authors : MATHIARASI |
|
|
|
|
|
Model Card Contact |
|
|
For questions, contact mathiarasie1710@gmail.com |