vinshim commited on
Commit
b54c89c
·
verified ·
1 Parent(s): e461500

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +68 -24
README.md CHANGED
@@ -3,6 +3,9 @@ tags:
3
  - tangkhul
4
  - corpus
5
  - BERT
 
 
 
6
  license: apache-2.0
7
  base_model:
8
  - google-bert/bert-base-uncased
@@ -10,25 +13,26 @@ base_model:
10
 
11
  # Model Card for Model ID
12
 
13
- <!-- Provide a quick summary of what the model is/does. -->
14
 
15
 
16
 
17
  ## Model Details
18
-
 
19
  ### Model Description
20
 
21
  <!-- Provide a longer summary of what this model is. -->
22
 
23
- This repository contains TangkhulBERT, the first publicly available foundational language model for the Tangkhul language, a low-resource Tibeto-Burman language. The model was trained from scratch using a Masked Language Modeling (MLM) objective.
24
 
25
  - **Developed by:** Vinos shimray
26
  - **Funded by [optional]:** [More Information Needed]
27
  - **Shared by [optional]:** [More Information Needed]
28
  - **Model type:** BERT Base
29
- - **Language(s) (NLP):** [More Information Needed]
30
- - **License:** [More Information Needed]
31
- - **Finetuned from model [optional]:** [More Information Needed]
32
 
33
  ### Model Sources [optional]
34
 
@@ -44,27 +48,47 @@ This repository contains TangkhulBERT, the first publicly available foundational
44
 
45
  ### Direct Use
46
 
47
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
 
49
- [More Information Needed]
50
 
51
  ### Downstream Use [optional]
52
 
53
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
54
 
55
- [More Information Needed]
56
 
57
- ### Out-of-Scope Use
 
 
 
 
58
 
59
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
60
 
61
- [More Information Needed]
 
 
 
62
 
63
  ## Bias, Risks, and Limitations
64
 
65
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
66
 
67
- [More Information Needed]
68
 
69
  ### Recommendations
70
 
@@ -74,33 +98,55 @@ Users (both direct and downstream) should be made aware of the risks, biases and
74
 
75
  ## How to Get Started with the Model
76
 
77
- Use the code below to get started with the model.
 
78
 
79
- [More Information Needed]
 
 
 
 
 
80
 
 
 
 
 
 
 
81
  ## Training Details
82
 
83
  ### Training Data
84
 
85
  <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
 
86
 
87
- [More Information Needed]
88
 
89
  ### Training Procedure
90
 
91
  <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
92
 
93
  #### Preprocessing [optional]
 
94
 
 
 
 
 
 
95
  [More Information Needed]
96
 
97
 
98
  #### Training Hyperparameters
99
 
100
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
 
 
 
 
101
 
102
  #### Speeds, Sizes, Times [optional]
103
-
104
  <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
105
 
106
  [More Information Needed]
@@ -124,14 +170,12 @@ Use the code below to get started with the model.
124
  [More Information Needed]
125
 
126
  #### Metrics
127
-
128
  <!-- These are the evaluation metrics being used, ideally with a description of why. -->
129
 
130
- [More Information Needed]
131
 
132
  ### Results
133
-
134
- [More Information Needed]
135
 
136
  #### Summary
137
 
 
3
  - tangkhul
4
  - corpus
5
  - BERT
6
+ - fill-mask
7
+ - text-generation
8
+ - low-resource-language
9
  license: apache-2.0
10
  base_model:
11
  - google-bert/bert-base-uncased
 
13
 
14
  # Model Card for Model ID
15
 
16
+ This repository contains TangkhulBERT, the first publicly available foundational language model for the Tangkhul language, a low-resource Tibeto-Burman language. The model was trained from scratch using a Masked Language Modeling (MLM) objective.
17
 
18
 
19
 
20
  ## Model Details
21
+ Model Description
22
+ TangkhulBERT is a transformer-based model with a BERT-base architecture. It was developed to provide a crucial NLP resource for the Tangkhul language community and to serve as a starting point for various downstream tasks.
23
  ### Model Description
24
 
25
  <!-- Provide a longer summary of what this model is. -->
26
 
27
+
28
 
29
  - **Developed by:** Vinos shimray
30
  - **Funded by [optional]:** [More Information Needed]
31
  - **Shared by [optional]:** [More Information Needed]
32
  - **Model type:** BERT Base
33
+ - **Language(s) (NLP):** Tangkhul
34
+ - **License:** apache-2.0
35
+ - **Finetuned from model [optional]:** This model was trained from scratch and not fine-tuned from any other model.
36
 
37
  ### Model Sources [optional]
38
 
 
48
 
49
  ### Direct Use
50
 
51
+ The model is intended for direct use in Masked Language Modeling tasks.
52
+
53
+ from transformers import pipeline
54
+
55
+ fill_mask = pipeline(
56
+ "fill-mask",
57
+ model="vinshim/TangkhulBERT",
58
+ tokenizer="VinosShimray/TangkhulBERT"
59
+ )
60
+
61
+ # Test with a Tangkhul sentence
62
+ result = fill_mask(" [MASK].")
63
+
64
+ # Print the top predictions
65
+ for prediction in result:
66
+ print(prediction)
67
+
68
 
 
69
 
70
  ### Downstream Use [optional]
71
 
72
+ This model is designed to be a foundational, pre-trained model for fine-tuning on specific downstream tasks such as:
73
 
74
+ Text Classification (e.g., sentiment analysis, topic categorization)
75
 
76
+ Named Entity Recognition (NER)
77
+
78
+ Question Answering
79
+
80
+ Machine Translation (as an encoder)
81
 
 
82
 
83
+
84
+ ### Out-of-Scope Use
85
+
86
+ This model is not intended for generating long-form, coherent text. Due to the limited size of the training corpus, it should not be used in safety-critical applications or for tasks requiring deep, nuanced world knowledge. The model only understands Tangkhul and will not perform well on other languages.
87
 
88
  ## Bias, Risks, and Limitations
89
 
90
+ The primary limitation is the size of the pre-training corpus (4 MB). While significant for a low-resource language, this is small compared to models for high-resource languages. The model will reflect any biases present in the source text data. Its knowledge is confined to the domains covered in the training corpus and may not generalize well to other contexts.
91
 
 
92
 
93
  ### Recommendations
94
 
 
98
 
99
  ## How to Get Started with the Model
100
 
101
+ Use the code below to get started with the model for Masked Language Modeling.
102
+ from transformers import pipeline, AutoTokenizer, AutoModelForMaskedLM
103
 
104
+ # Replace with your Hugging Face username and repo name
105
+ repo_id = "vinshim/TangkhulBERT"
106
+
107
+ # Load tokenizer and model
108
+ tokenizer = AutoTokenizer.from_pretrained(repo_id)
109
+ model = AutoModelForMaskedLM.from_pretrained(repo_id)
110
 
111
+ # Create the pipeline
112
+ fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
113
+
114
+ # Use the model
115
+ result = fill_mask("Kazing eina ngalei [MASK].")
116
+ print(result)
117
  ## Training Details
118
 
119
  ### Training Data
120
 
121
  <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
122
+ The model was pre-trained on a 4 MB plain-text corpus of the Tangkhul language, collected from various digital sources. This data is not available for download but can be described as general-purpose text.
123
 
 
124
 
125
  ### Training Procedure
126
 
127
  <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
128
 
129
  #### Preprocessing [optional]
130
+ The text was preprocessed by:
131
 
132
+ 1.Converting all text to lowercase.
133
+
134
+ 2. Ensuring a sentence-per-line format.
135
+
136
+ 3. Programmatically adding a full stop (.) to every line that lacked sentence-ending punctuation.
137
  [More Information Needed]
138
 
139
 
140
  #### Training Hyperparameters
141
 
142
+ - **Training regime:** fp16 mixed precision
143
+ Epochs: 500
144
+ Batch Size: 128
145
+ Optimizer: AdamW with default settings
146
+ Learning Rate: 5e-5 <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
147
 
148
  #### Speeds, Sizes, Times [optional]
149
+ The pre-training was conducted over approximately 3 hours on a single NVIDIA A100 GPU.
150
  <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
151
 
152
  [More Information Needed]
 
170
  [More Information Needed]
171
 
172
  #### Metrics
173
+ The primary evaluation metric during pre-training was Training Loss, which is a measure of the model's perplexity on the Masked Language Modeling task.
174
  <!-- These are the evaluation metrics being used, ideally with a description of why. -->
175
 
 
176
 
177
  ### Results
178
+ The model achieved a final pre-training loss of 2.9969 after 22,000 training steps.
 
179
 
180
  #### Summary
181