Update README.md
Browse files
README.md
CHANGED
|
@@ -20,7 +20,7 @@ model-index:
|
|
| 20 |
metrics:
|
| 21 |
- name: Perplexity (Training Domain)
|
| 22 |
type: perplexity
|
| 23 |
-
value:
|
| 24 |
- name: Perplexity (Unseen Text)
|
| 25 |
type: perplexity
|
| 26 |
value: 5.9281
|
|
@@ -39,11 +39,11 @@ This model was developed by [MWire Labs](https://mwirelabs.com), an AI research
|
|
| 39 |
- **Model Type:** RoBERTa (Robustly Optimized BERT Pretraining Approach)
|
| 40 |
- **Language:** Assamese (as)
|
| 41 |
- **Training Data:** 1.6M Assamese sentences from diverse sources
|
| 42 |
-
- **Parameters:** ~
|
| 43 |
- **Training Epochs:** 10
|
| 44 |
-
- **Training Duration:**
|
| 45 |
-
- **Vocabulary Size:** ~50,
|
| 46 |
-
- **Max Sequence Length:**
|
| 47 |
|
| 48 |
## Performance
|
| 49 |
|
|
@@ -53,7 +53,7 @@ The model achieves strong performance on both in-domain and out-of-domain evalua
|
|
| 53 |
|
| 54 |
| Model | Training Domain PPL | Unseen Text PPL |
|
| 55 |
|-------|---------------------|-----------------|
|
| 56 |
-
| **AssameseRoBERTa (Ours)** | **
|
| 57 |
| mBERT | 29.8206 | 9.9891 |
|
| 58 |
| MuRIL | 27.3264 | 14.2509 |
|
| 59 |
| Assamese-BERT | 12.1166 | 22.6595 |
|
|
@@ -84,7 +84,7 @@ This model should not be used for:
|
|
| 84 |
|
| 85 |
## Training Data
|
| 86 |
|
| 87 |
-
The model was trained on the [MWirelabs/assamese-monolingual-corpus](https://huggingface.co/datasets/MWirelabs/assamese-monolingual-corpus) dataset, which contains approximately 1.6 million Assamese sentences from diverse sources including:
|
| 88 |
|
| 89 |
- News articles
|
| 90 |
- Literature
|
|
@@ -99,19 +99,26 @@ The diverse nature of the training data helps the model generalize across differ
|
|
| 99 |
### Preprocessing
|
| 100 |
|
| 101 |
- Text normalization for Assamese script (Bengali script)
|
| 102 |
-
- Tokenization using
|
| 103 |
- Vocabulary built specifically for Assamese
|
| 104 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 105 |
### Training Hyperparameters
|
| 106 |
|
| 107 |
- **Architecture:** RoBERTa-base
|
| 108 |
- **Optimizer:** AdamW
|
| 109 |
-
- **Learning Rate:**
|
| 110 |
- **Batch Size:** Optimized for A40 GPU
|
| 111 |
- **Training Epochs:** 10
|
| 112 |
- **Hardware:** NVIDIA A40 (48GB)
|
| 113 |
- **Precision:** Mixed precision (BF16)
|
| 114 |
-
- **Training Time:** ~
|
| 115 |
|
| 116 |
## Evaluation
|
| 117 |
|
|
@@ -178,10 +185,10 @@ print(f"Embeddings shape: {embeddings.shape}")
|
|
| 178 |
If you use this model in your research, please cite:
|
| 179 |
|
| 180 |
```bibtex
|
| 181 |
-
@misc{assamese-roberta-
|
| 182 |
author = {MWire Labs},
|
| 183 |
title = {AssameseRoBERTa: A RoBERTa Model for Assamese Language},
|
| 184 |
-
year = {
|
| 185 |
publisher = {HuggingFace},
|
| 186 |
howpublished = {\url{https://huggingface.co/MWirelabs/assamese-roberta}}
|
| 187 |
}
|
|
|
|
| 20 |
metrics:
|
| 21 |
- name: Perplexity (Training Domain)
|
| 22 |
type: perplexity
|
| 23 |
+
value: 2.2547
|
| 24 |
- name: Perplexity (Unseen Text)
|
| 25 |
type: perplexity
|
| 26 |
value: 5.9281
|
|
|
|
| 39 |
- **Model Type:** RoBERTa (Robustly Optimized BERT Pretraining Approach)
|
| 40 |
- **Language:** Assamese (as)
|
| 41 |
- **Training Data:** 1.6M Assamese sentences from diverse sources
|
| 42 |
+
- **Parameters:** ~110M
|
| 43 |
- **Training Epochs:** 10
|
| 44 |
+
- **Training Duration:** 12 hours on A40 GPU
|
| 45 |
+
- **Vocabulary Size:** ~50,265 tokens
|
| 46 |
+
- **Max Sequence Length:** 128 tokens (Note: Pretraining was done with max_length=128 for efficiency.)
|
| 47 |
|
| 48 |
## Performance
|
| 49 |
|
|
|
|
| 53 |
|
| 54 |
| Model | Training Domain PPL | Unseen Text PPL |
|
| 55 |
|-------|---------------------|-----------------|
|
| 56 |
+
| **AssameseRoBERTa (Ours)** | **2.2547** | **5.9281** |
|
| 57 |
| mBERT | 29.8206 | 9.9891 |
|
| 58 |
| MuRIL | 27.3264 | 14.2509 |
|
| 59 |
| Assamese-BERT | 12.1166 | 22.6595 |
|
|
|
|
| 84 |
|
| 85 |
## Training Data
|
| 86 |
|
| 87 |
+
The model was trained on the [MWirelabs/assamese-monolingual-corpus](https://huggingface.co/datasets/MWirelabs/assamese-monolingual-corpus) dataset, which contains approximately ~1.6 million Assamese sentences from diverse sources including:
|
| 88 |
|
| 89 |
- News articles
|
| 90 |
- Literature
|
|
|
|
| 99 |
### Preprocessing
|
| 100 |
|
| 101 |
- Text normalization for Assamese script (Bengali script)
|
| 102 |
+
- Tokenization using Byte-level (BPE)
|
| 103 |
- Vocabulary built specifically for Assamese
|
| 104 |
|
| 105 |
+
### Tokenizer
|
| 106 |
+
|
| 107 |
+
- **Type:** Byte-Level BPE (RoBERTa standard)
|
| 108 |
+
- **Vocab Size:** 50,265
|
| 109 |
+
- **Special Tokens:** `<s>`, `</s>`, `<pad>`, `<unk>`, `<mask>`
|
| 110 |
+
|
| 111 |
+
|
| 112 |
### Training Hyperparameters
|
| 113 |
|
| 114 |
- **Architecture:** RoBERTa-base
|
| 115 |
- **Optimizer:** AdamW
|
| 116 |
+
- **Learning Rate:** Linear decay with warmup (Transformers default)
|
| 117 |
- **Batch Size:** Optimized for A40 GPU
|
| 118 |
- **Training Epochs:** 10
|
| 119 |
- **Hardware:** NVIDIA A40 (48GB)
|
| 120 |
- **Precision:** Mixed precision (BF16)
|
| 121 |
+
- **Training Time:** ~12 hours
|
| 122 |
|
| 123 |
## Evaluation
|
| 124 |
|
|
|
|
| 185 |
If you use this model in your research, please cite:
|
| 186 |
|
| 187 |
```bibtex
|
| 188 |
+
@misc{assamese-roberta-2025,
|
| 189 |
author = {MWire Labs},
|
| 190 |
title = {AssameseRoBERTa: A RoBERTa Model for Assamese Language},
|
| 191 |
+
year = {2025},
|
| 192 |
publisher = {HuggingFace},
|
| 193 |
howpublished = {\url{https://huggingface.co/MWirelabs/assamese-roberta}}
|
| 194 |
}
|