Update README.md

Model Card for Telugu BERT Model

This model is a BERT-based language model trained for Masked Language Modeling (MLM) in Telugu. It is designed to understand and generate Telugu text effectively.

Model Details

Model Description

Developed by: MATHI

Model type: Transformer-based Masked Language Model (MLM)

Language(s) (NLP): Telugu

License: [MIT, Apache 2.0, or your chosen license]

Model Sources

Repository: [GitHub/Hugging Face Model Repo]

Paper [optional]: [If applicable]

Demo [optional]: Colab Notebook

Uses

Direct Use

This model can be used for:

Text completion in Telugu

Fill-mask prediction (predict missing words in a sentence)

Pretraining or fine-tuning for Telugu NLP tasks

Downstream Use

Fine-tuned versions of this model can be used for:

Named Entity Recognition (NER)

Sentiment Analysis

Machine Translation

Text Summarization

Out-of-Scope Use

Not suitable for real-time dialogue generation

Not trained for code-mixing (Telugu + English)

Bias, Risks, and Limitations

The model may reflect biases present in the training data.

Accuracy may vary for dialectal variations of Telugu.

May generate incorrect or misleading predictions.

Recommendations

Users should verify the model's outputs before relying on them for critical applications.

How to Get Started with the Model

Use the code below to get started:

from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline

model_name = "Mathiarasi/TMod"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
print(fill_mask("మక్దూంపల్లి పేరుతో చాలా [MASK] ఉన్నాయి."))

Training Details

Training Data

The model is trained on a Telugu corpus containing diverse text sources.

Data preprocessing included text normalization, cleaning, and tokenization.

Training Procedure

Preprocessing

Used WordPiece Tokenizer with a vocabulary of 30,000 tokens.

Training Hyperparameters

Batch Size: 16

Learning Rate: 5e-5

Epochs: 3

Optimizer: AdamW

Speeds, Sizes, Times

Testing Data

Evaluated on a held-out dataset of Telugu text.

Technical Specifications

Model Architecture and Objective

Model Type: BERT (Bidirectional Encoder Representations from Transformers)

Training Objective: Masked Language Modeling (MLM)

Compute Infrastructure

Hardware

Trained on [Hardware Details]
Dataset library: datasets

Citation

If you use this model, please cite:

@article {YourName2025,
title={Telugu BERT: A Transformer-Based Language Model for Telugu},
author={Your Name},
journal={Hugging Face Models},
year={2025}
}

Model Card Authors : MATHIARASI

Model Card Contact

For questions, contact mathiarasie1710@gmail.com

Files changed (1) hide show

README.md +143 -1

README.md CHANGED Viewed

@@ -5,4 +5,146 @@ datasets:
 language:
 - te
 pipeline_tag: fill-mask
----

 language:
 - te
 pipeline_tag: fill-mask
+---
+Model Card for Telugu BERT Model
+This model is a BERT-based language model trained for Masked Language Modeling (MLM) in Telugu. It is designed to understand and generate Telugu text effectively.
+Model Details
+Model Description
+Developed by: MATHI
+Model type: Transformer-based Masked Language Model (MLM)
+Language(s) (NLP): Telugu
+License: [MIT, Apache 2.0, or your chosen license]
+Model Sources
+Repository: [GitHub/Hugging Face Model Repo]
+Paper [optional]: [If applicable]
+Demo [optional]: Colab Notebook
+Uses
+Direct Use
+This model can be used for:
+Text completion in Telugu
+Fill-mask prediction (predict missing words in a sentence)
+Pretraining or fine-tuning for Telugu NLP tasks
+Downstream Use
+Fine-tuned versions of this model can be used for:
+Named Entity Recognition (NER)
+Sentiment Analysis
+Machine Translation
+Text Summarization
+Out-of-Scope Use
+Not suitable for real-time dialogue generation
+Not trained for code-mixing (Telugu + English)
+Bias, Risks, and Limitations
+The model may reflect biases present in the training data.
+Accuracy may vary for dialectal variations of Telugu.
+May generate incorrect or misleading predictions.
+Recommendations
+Users should verify the model's outputs before relying on them for critical applications.
+How to Get Started with the Model
+Use the code below to get started:
+from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
+  model_name = "Mathiarasi/TMod"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForMaskedLM.from_pretrained(model_name)
+fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
+print(fill_mask("మక్దూంపల్లి పేరుతో చాలా [MASK] ఉన్నాయి."))
+Training Details
+Training Data
+The model is trained on a Telugu corpus containing diverse text sources.
+Data preprocessing included text normalization, cleaning, and tokenization.
+Training Procedure
+Preprocessing
+Used WordPiece Tokenizer with a vocabulary of 30,000 tokens.
+Training Hyperparameters
+Batch Size: 16
+Learning Rate: 5e-5
+Epochs: 3
+Optimizer: AdamW
+Speeds, Sizes, Times
+Testing Data
+Evaluated on a held-out dataset of Telugu text.
+Technical Specifications
+Model Architecture and Objective
+Model Type: BERT (Bidirectional Encoder Representations from Transformers)
+Training Objective: Masked Language Modeling (MLM)
+Compute Infrastructure
+Hardware
+Trained on [Hardware Details]
+Dataset library: datasets
+Citation
+If you use this model, please cite:
+@article{YourName2025,
+  title={Telugu BERT: A Transformer-Based Language Model for Telugu},
+  author={Your Name},
+  journal={Hugging Face Models},
+  year={2025}
+}
+Model Card Authors : MATHIARASI
+Model Card Contact
+For questions, contact mathiarasie1710@gmail.com