TMod / README.md

Mathiarasi

Update README.md

76b5267 verified about 1 year ago

2.91 kB

license: mit
datasets:
  - open-thoughts/OpenThoughts-114k
language:
  - te
pipeline_tag: fill-mask

Model Card for Telugu BERT Model

This model is a BERT-based language model trained for Masked Language Modeling (MLM) in Telugu. It is designed to understand and generate Telugu text effectively.

Model Details

Model Description

Developed by: MATHI

Model type: Transformer-based Masked Language Model (MLM)

Language(s) (NLP): Telugu

License: [MIT, Apache 2.0, or your chosen license]

Model Sources

Repository: [GitHub/Hugging Face Model Repo]

Paper [optional]: [If applicable]

Demo [optional]: Colab Notebook

Uses

Direct Use

This model can be used for:

Text completion in Telugu

Fill-mask prediction (predict missing words in a sentence)

Pretraining or fine-tuning for Telugu NLP tasks

Downstream Use

Fine-tuned versions of this model can be used for:

Named Entity Recognition (NER)

Sentiment Analysis

Machine Translation

Text Summarization

Out-of-Scope Use

Not suitable for real-time dialogue generation

Not trained for code-mixing (Telugu + English)

Bias, Risks, and Limitations

The model may reflect biases present in the training data.

Accuracy may vary for dialectal variations of Telugu.

May generate incorrect or misleading predictions.

Recommendations

Users should verify the model's outputs before relying on them for critical applications.

How to Get Started with the Model

Use the code below to get started:

from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline

model_name = "Mathiarasi/TMod" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForMaskedLM.from_pretrained(model_name)

fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer) print(fill_mask("మక్దూంపల్లి పేరుతో చాలా [MASK] ఉన్నాయి."))

Training Details

Training Data

The model is trained on a Telugu corpus containing diverse text sources.

Data preprocessing included text normalization, cleaning, and tokenization.

Training Procedure

Preprocessing

Used WordPiece Tokenizer with a vocabulary of 30,000 tokens.

Training Hyperparameters

Batch Size: 16

Learning Rate: 5e-5

Epochs: 3

Optimizer: AdamW

Speeds, Sizes, Times

Testing Data

Evaluated on a held-out dataset of Telugu text.

Technical Specifications

Model Architecture and Objective

Model Type: BERT (Bidirectional Encoder Representations from Transformers)

Training Objective: Masked Language Modeling (MLM)

Compute Infrastructure

Hardware

Trained on [Hardware Details] Dataset library: datasets

Citation

If you use this model, please cite:

@article{YourName2025, title={Telugu BERT: A Transformer-Based Language Model for Telugu}, author={Your Name}, journal={Hugging Face Models}, year={2025} }

Model Card Authors : MATHIARASI

Model Card Contact

For questions, contact mathiarasie1710@gmail.com