Badnyal commited on
Commit
d53ae91
·
verified ·
1 Parent(s): dfb9780

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -12
README.md CHANGED
@@ -20,7 +20,7 @@ model-index:
20
  metrics:
21
  - name: Perplexity (Training Domain)
22
  type: perplexity
23
- value: 1.5738
24
  - name: Perplexity (Unseen Text)
25
  type: perplexity
26
  value: 5.9281
@@ -39,11 +39,11 @@ This model was developed by [MWire Labs](https://mwirelabs.com), an AI research
39
  - **Model Type:** RoBERTa (Robustly Optimized BERT Pretraining Approach)
40
  - **Language:** Assamese (as)
41
  - **Training Data:** 1.6M Assamese sentences from diverse sources
42
- - **Parameters:** ~125M
43
  - **Training Epochs:** 10
44
- - **Training Duration:** 8 hours on A40 GPU
45
- - **Vocabulary Size:** ~50,000 tokens
46
- - **Max Sequence Length:** 512 tokens
47
 
48
  ## Performance
49
 
@@ -53,7 +53,7 @@ The model achieves strong performance on both in-domain and out-of-domain evalua
53
 
54
  | Model | Training Domain PPL | Unseen Text PPL |
55
  |-------|---------------------|-----------------|
56
- | **AssameseRoBERTa (Ours)** | **1.5738** | **5.9281** |
57
  | mBERT | 29.8206 | 9.9891 |
58
  | MuRIL | 27.3264 | 14.2509 |
59
  | Assamese-BERT | 12.1166 | 22.6595 |
@@ -84,7 +84,7 @@ This model should not be used for:
84
 
85
  ## Training Data
86
 
87
- The model was trained on the [MWirelabs/assamese-monolingual-corpus](https://huggingface.co/datasets/MWirelabs/assamese-monolingual-corpus) dataset, which contains approximately 1.6 million Assamese sentences from diverse sources including:
88
 
89
  - News articles
90
  - Literature
@@ -99,19 +99,26 @@ The diverse nature of the training data helps the model generalize across differ
99
  ### Preprocessing
100
 
101
  - Text normalization for Assamese script (Bengali script)
102
- - Tokenization using SentencePiece
103
  - Vocabulary built specifically for Assamese
104
 
 
 
 
 
 
 
 
105
  ### Training Hyperparameters
106
 
107
  - **Architecture:** RoBERTa-base
108
  - **Optimizer:** AdamW
109
- - **Learning Rate:** Peak LR with warmup and linear decay
110
  - **Batch Size:** Optimized for A40 GPU
111
  - **Training Epochs:** 10
112
  - **Hardware:** NVIDIA A40 (48GB)
113
  - **Precision:** Mixed precision (BF16)
114
- - **Training Time:** ~8 hours
115
 
116
  ## Evaluation
117
 
@@ -178,10 +185,10 @@ print(f"Embeddings shape: {embeddings.shape}")
178
  If you use this model in your research, please cite:
179
 
180
  ```bibtex
181
- @misc{assamese-roberta-2024,
182
  author = {MWire Labs},
183
  title = {AssameseRoBERTa: A RoBERTa Model for Assamese Language},
184
- year = {2024},
185
  publisher = {HuggingFace},
186
  howpublished = {\url{https://huggingface.co/MWirelabs/assamese-roberta}}
187
  }
 
20
  metrics:
21
  - name: Perplexity (Training Domain)
22
  type: perplexity
23
+ value: 2.2547
24
  - name: Perplexity (Unseen Text)
25
  type: perplexity
26
  value: 5.9281
 
39
  - **Model Type:** RoBERTa (Robustly Optimized BERT Pretraining Approach)
40
  - **Language:** Assamese (as)
41
  - **Training Data:** 1.6M Assamese sentences from diverse sources
42
+ - **Parameters:** ~110M
43
  - **Training Epochs:** 10
44
+ - **Training Duration:** 12 hours on A40 GPU
45
+ - **Vocabulary Size:** ~50,265 tokens
46
+ - **Max Sequence Length:** 128 tokens (Note: Pretraining was done with max_length=128 for efficiency.)
47
 
48
  ## Performance
49
 
 
53
 
54
  | Model | Training Domain PPL | Unseen Text PPL |
55
  |-------|---------------------|-----------------|
56
+ | **AssameseRoBERTa (Ours)** | **2.2547** | **5.9281** |
57
  | mBERT | 29.8206 | 9.9891 |
58
  | MuRIL | 27.3264 | 14.2509 |
59
  | Assamese-BERT | 12.1166 | 22.6595 |
 
84
 
85
  ## Training Data
86
 
87
+ The model was trained on the [MWirelabs/assamese-monolingual-corpus](https://huggingface.co/datasets/MWirelabs/assamese-monolingual-corpus) dataset, which contains approximately ~1.6 million Assamese sentences from diverse sources including:
88
 
89
  - News articles
90
  - Literature
 
99
  ### Preprocessing
100
 
101
  - Text normalization for Assamese script (Bengali script)
102
+ - Tokenization using Byte-level (BPE)
103
  - Vocabulary built specifically for Assamese
104
 
105
+ ### Tokenizer
106
+
107
+ - **Type:** Byte-Level BPE (RoBERTa standard)
108
+ - **Vocab Size:** 50,265
109
+ - **Special Tokens:** `<s>`, `</s>`, `<pad>`, `<unk>`, `<mask>`
110
+
111
+
112
  ### Training Hyperparameters
113
 
114
  - **Architecture:** RoBERTa-base
115
  - **Optimizer:** AdamW
116
+ - **Learning Rate:** Linear decay with warmup (Transformers default)
117
  - **Batch Size:** Optimized for A40 GPU
118
  - **Training Epochs:** 10
119
  - **Hardware:** NVIDIA A40 (48GB)
120
  - **Precision:** Mixed precision (BF16)
121
+ - **Training Time:** ~12 hours
122
 
123
  ## Evaluation
124
 
 
185
  If you use this model in your research, please cite:
186
 
187
  ```bibtex
188
+ @misc{assamese-roberta-2025,
189
  author = {MWire Labs},
190
  title = {AssameseRoBERTa: A RoBERTa Model for Assamese Language},
191
+ year = {2025},
192
  publisher = {HuggingFace},
193
  howpublished = {\url{https://huggingface.co/MWirelabs/assamese-roberta}}
194
  }