Mathiarasi
/

TMod

Model card Files Files and versions

TMod / README.md

Mathiarasi's picture

Update README.md

2913f77 verified 12 months ago

|

history blame contribute delete

2.71 kB

	---
	license: mit
	datasets:
	- open-thoughts/OpenThoughts-114k
	language:
	- te
	pipeline_tag: fill-mask
	---

	Model Card for Telugu BERT Model
	This model is a BERT-based language model trained for Masked Language Modeling (MLM) in Telugu. It is designed to understand and generate Telugu text effectively.

	Model Details
	Model Description
	Developed by: MATHI
	Model type: Transformer-based Masked Language Model (MLM)
	Language(s) (NLP): Telugu
	License: MIT
	Model Sources
	Repository: Hugging Face Model Repo
	Demo : Colab Notebook

	Uses
	Direct Use
	This model can be used for: Text completion in Telugu
	Fill-mask prediction (predict missing words in a sentence)
	Pretraining or fine-tuning for Telugu NLP tasks

	Downstream Use
	Fine-tuned versions of this model can be used for:
	Named Entity Recognition (NER)
	Sentiment Analysis
	Machine Translation
	Text Summarization

	Out-of-Scope Use
	Not suitable for real-time dialogue generation
	Not trained for code-mixing (Telugu + English)

	Bias, Risks, and Limitations
	The model may reflect biases present in the training data.
	Accuracy may vary for dialectal variations of Telugu.
	May generate incorrect or misleading predictions.

	Recommendations
	Users should verify the model's outputs before relying on them for critical applications.

	How to Get Started with the Model
	Use the code below to get started:
	from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
	model_name = "Mathiarasi/TMod"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForMaskedLM.from_pretrained(model_name)
	fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
	print(fill_mask("మక్దూంపల్లి పేరుతో చాలా [MASK] ఉన్నాయి."))

	Training Details
	Training Data
	The model is trained on a Telugu corpus containing diverse text sources.
	Data preprocessing included text normalization, cleaning, and tokenization.
	Training Procedure
	Preprocessing
	Used WordPiece Tokenizer with a vocabulary of 30,000 tokens.

	Training Hyperparameters
	Batch Size: 16
	Learning Rate: 5e-5
	Epochs: 3
	Optimizer: AdamW

	Speeds, Sizes, Times
	Testing Data
	Evaluated on a held-out dataset of Telugu text.

	Technical Specifications
	Model Architecture and Objective

	Model Type: BERT (Bidirectional Encoder Representations from Transformers)

	Training Objective: Masked Language Modeling (MLM)

	Dataset library: datasets

	Citation
	If you use this model, please cite:

	@article{Mathiarasi2025,
	title={Telugu BERT: A Transformer-Based Language Model for Telugu},
	author={Mathiarasi},
	journal={Hugging Face Models},
	year={2025}
	}

	Model Card Authors : MATHIARASI

	Model Card Contact
	For questions, contact mathiarasie1710@gmail.com