Mathiarasi commited on
Commit
76b5267
·
verified ·
1 Parent(s): 9ea2b32

Update README.md

Browse files

Model Card for Telugu BERT Model

This model is a BERT-based language model trained for Masked Language Modeling (MLM) in Telugu. It is designed to understand and generate Telugu text effectively.

Model Details

Model Description

Developed by: MATHI

Model type: Transformer-based Masked Language Model (MLM)

Language(s) (NLP): Telugu

License: [MIT, Apache 2.0, or your chosen license]


Model Sources

Repository: [GitHub/Hugging Face Model Repo]

Paper [optional]: [If applicable]

Demo [optional]: Colab Notebook

Uses

Direct Use

This model can be used for:

Text completion in Telugu

Fill-mask prediction (predict missing words in a sentence)

Pretraining or fine-tuning for Telugu NLP tasks

Downstream Use

Fine-tuned versions of this model can be used for:

Named Entity Recognition (NER)

Sentiment Analysis

Machine Translation

Text Summarization

Out-of-Scope Use

Not suitable for real-time dialogue generation

Not trained for code-mixing (Telugu + English)

Bias, Risks, and Limitations

The model may reflect biases present in the training data.

Accuracy may vary for dialectal variations of Telugu.

May generate incorrect or misleading predictions.

Recommendations

Users should verify the model's outputs before relying on them for critical applications.

How to Get Started with the Model

Use the code below to get started:

from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline

model_name = "Mathiarasi/TMod"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
print(fill_mask("మక్దూంపల్లి పేరుతో చాలా [MASK] ఉన్నాయి."))

Training Details

Training Data

The model is trained on a Telugu corpus containing diverse text sources.

Data preprocessing included text normalization, cleaning, and tokenization.

Training Procedure

Preprocessing

Used WordPiece Tokenizer with a vocabulary of 30,000 tokens.

Training Hyperparameters

Batch Size: 16

Learning Rate: 5e-5

Epochs: 3

Optimizer: AdamW

Speeds, Sizes, Times

Testing Data

Evaluated on a held-out dataset of Telugu text.

Technical Specifications

Model Architecture and Objective

Model Type: BERT (Bidirectional Encoder Representations from Transformers)

Training Objective: Masked Language Modeling (MLM)

Compute Infrastructure

Hardware

Trained on [Hardware Details]
Dataset library: datasets

Citation

If you use this model, please cite:



@article
{YourName2025,
title={Telugu BERT: A Transformer-Based Language Model for Telugu},
author={Your Name},
journal={Hugging Face Models},
year={2025}
}

Model Card Authors : MATHIARASI

Model Card Contact

For questions, contact mathiarasie1710@gmail.com

Files changed (1) hide show
  1. README.md +143 -1
README.md CHANGED
@@ -5,4 +5,146 @@ datasets:
5
  language:
6
  - te
7
  pipeline_tag: fill-mask
8
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  language:
6
  - te
7
  pipeline_tag: fill-mask
8
+ ---
9
+
10
+ Model Card for Telugu BERT Model
11
+
12
+ This model is a BERT-based language model trained for Masked Language Modeling (MLM) in Telugu. It is designed to understand and generate Telugu text effectively.
13
+
14
+ Model Details
15
+
16
+ Model Description
17
+
18
+ Developed by: MATHI
19
+
20
+ Model type: Transformer-based Masked Language Model (MLM)
21
+
22
+ Language(s) (NLP): Telugu
23
+
24
+ License: [MIT, Apache 2.0, or your chosen license]
25
+
26
+
27
+ Model Sources
28
+
29
+ Repository: [GitHub/Hugging Face Model Repo]
30
+
31
+ Paper [optional]: [If applicable]
32
+
33
+ Demo [optional]: Colab Notebook
34
+
35
+ Uses
36
+
37
+ Direct Use
38
+
39
+ This model can be used for:
40
+
41
+ Text completion in Telugu
42
+
43
+ Fill-mask prediction (predict missing words in a sentence)
44
+
45
+ Pretraining or fine-tuning for Telugu NLP tasks
46
+
47
+ Downstream Use
48
+
49
+ Fine-tuned versions of this model can be used for:
50
+
51
+ Named Entity Recognition (NER)
52
+
53
+ Sentiment Analysis
54
+
55
+ Machine Translation
56
+
57
+ Text Summarization
58
+
59
+ Out-of-Scope Use
60
+
61
+ Not suitable for real-time dialogue generation
62
+
63
+ Not trained for code-mixing (Telugu + English)
64
+
65
+ Bias, Risks, and Limitations
66
+
67
+ The model may reflect biases present in the training data.
68
+
69
+ Accuracy may vary for dialectal variations of Telugu.
70
+
71
+ May generate incorrect or misleading predictions.
72
+
73
+ Recommendations
74
+
75
+ Users should verify the model's outputs before relying on them for critical applications.
76
+
77
+ How to Get Started with the Model
78
+
79
+ Use the code below to get started:
80
+
81
+ from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
82
+
83
+ model_name = "Mathiarasi/TMod"
84
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
85
+ model = AutoModelForMaskedLM.from_pretrained(model_name)
86
+
87
+ fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
88
+ print(fill_mask("మక్దూంపల్లి పేరుతో చాలా [MASK] ఉన్నాయి."))
89
+
90
+ Training Details
91
+
92
+ Training Data
93
+
94
+ The model is trained on a Telugu corpus containing diverse text sources.
95
+
96
+ Data preprocessing included text normalization, cleaning, and tokenization.
97
+
98
+ Training Procedure
99
+
100
+ Preprocessing
101
+
102
+ Used WordPiece Tokenizer with a vocabulary of 30,000 tokens.
103
+
104
+ Training Hyperparameters
105
+
106
+ Batch Size: 16
107
+
108
+ Learning Rate: 5e-5
109
+
110
+ Epochs: 3
111
+
112
+ Optimizer: AdamW
113
+
114
+ Speeds, Sizes, Times
115
+
116
+ Testing Data
117
+
118
+ Evaluated on a held-out dataset of Telugu text.
119
+
120
+ Technical Specifications
121
+
122
+ Model Architecture and Objective
123
+
124
+ Model Type: BERT (Bidirectional Encoder Representations from Transformers)
125
+
126
+ Training Objective: Masked Language Modeling (MLM)
127
+
128
+ Compute Infrastructure
129
+
130
+ Hardware
131
+
132
+ Trained on [Hardware Details]
133
+ Dataset library: datasets
134
+
135
+ Citation
136
+
137
+ If you use this model, please cite:
138
+
139
+ @article{YourName2025,
140
+ title={Telugu BERT: A Transformer-Based Language Model for Telugu},
141
+ author={Your Name},
142
+ journal={Hugging Face Models},
143
+ year={2025}
144
+ }
145
+
146
+ Model Card Authors : MATHIARASI
147
+
148
+ Model Card Contact
149
+
150
+ For questions, contact mathiarasie1710@gmail.com