MWirelabs
/

assamese-roberta

@@ -20,7 +20,7 @@ model-index:
     metrics:
     - name: Perplexity (Training Domain)
       type: perplexity
-      value: 1.5738
     - name: Perplexity (Unseen Text)
       type: perplexity
       value: 5.9281
@@ -39,11 +39,11 @@ This model was developed by [MWire Labs](https://mwirelabs.com), an AI research
 - **Model Type:** RoBERTa (Robustly Optimized BERT Pretraining Approach)
 - **Language:** Assamese (as)
 - **Training Data:** 1.6M Assamese sentences from diverse sources
-- **Parameters:** ~125M
 - **Training Epochs:** 10
-- **Training Duration:** 8 hours on A40 GPU
-- **Vocabulary Size:** ~50,000 tokens
-- **Max Sequence Length:** 512 tokens
 ## Performance
@@ -53,7 +53,7 @@ The model achieves strong performance on both in-domain and out-of-domain evalua
 | Model | Training Domain PPL | Unseen Text PPL |
 |-------|---------------------|-----------------|
-| **AssameseRoBERTa (Ours)** | **1.5738** | **5.9281** |
 | mBERT | 29.8206 | 9.9891 |
 | MuRIL | 27.3264 | 14.2509 |
 | Assamese-BERT | 12.1166 | 22.6595 |
@@ -84,7 +84,7 @@ This model should not be used for:
 ## Training Data
-The model was trained on the [MWirelabs/assamese-monolingual-corpus](https://huggingface.co/datasets/MWirelabs/assamese-monolingual-corpus) dataset, which contains approximately 1.6 million Assamese sentences from diverse sources including:
 - News articles
 - Literature
@@ -99,19 +99,26 @@ The diverse nature of the training data helps the model generalize across differ
 ### Preprocessing
 - Text normalization for Assamese script (Bengali script)
-- Tokenization using SentencePiece
 - Vocabulary built specifically for Assamese
 ### Training Hyperparameters
 - **Architecture:** RoBERTa-base
 - **Optimizer:** AdamW
-- **Learning Rate:** Peak LR with warmup and linear decay
 - **Batch Size:** Optimized for A40 GPU
 - **Training Epochs:** 10
 - **Hardware:** NVIDIA A40 (48GB)
 - **Precision:** Mixed precision (BF16)
-- **Training Time:** ~8 hours
 ## Evaluation
@@ -178,10 +185,10 @@ print(f"Embeddings shape: {embeddings.shape}")
 If you use this model in your research, please cite:
 ```bibtex
-@misc{assamese-roberta-2024,
   author = {MWire Labs},
   title = {AssameseRoBERTa: A RoBERTa Model for Assamese Language},
-  year = {2024},
   publisher = {HuggingFace},
   howpublished = {\url{https://huggingface.co/MWirelabs/assamese-roberta}}
 }

     metrics:
     - name: Perplexity (Training Domain)
       type: perplexity
+      value: 2.2547
     - name: Perplexity (Unseen Text)
       type: perplexity
       value: 5.9281
 - **Model Type:** RoBERTa (Robustly Optimized BERT Pretraining Approach)
 - **Language:** Assamese (as)
 - **Training Data:** 1.6M Assamese sentences from diverse sources
+- **Parameters:** ~110M
 - **Training Epochs:** 10
+- **Training Duration:** 12 hours on A40 GPU
+- **Vocabulary Size:** ~50,265 tokens
+- **Max Sequence Length:** 128 tokens (Note: Pretraining was done with max_length=128 for efficiency.)
 ## Performance
 | Model | Training Domain PPL | Unseen Text PPL |
 |-------|---------------------|-----------------|
+| **AssameseRoBERTa (Ours)** | **2.2547** | **5.9281** |
 | mBERT | 29.8206 | 9.9891 |
 | MuRIL | 27.3264 | 14.2509 |
 | Assamese-BERT | 12.1166 | 22.6595 |
 ## Training Data
+The model was trained on the [MWirelabs/assamese-monolingual-corpus](https://huggingface.co/datasets/MWirelabs/assamese-monolingual-corpus) dataset, which contains approximately ~1.6 million Assamese sentences from diverse sources including:
 - News articles
 - Literature
 ### Preprocessing
 - Text normalization for Assamese script (Bengali script)
+- Tokenization using Byte-level (BPE)
 - Vocabulary built specifically for Assamese
+### Tokenizer
+- **Type:** Byte-Level BPE (RoBERTa standard)
+- **Vocab Size:** 50,265
+- **Special Tokens:** `<s>`, `</s>`, `<pad>`, `<unk>`, `<mask>`
 ### Training Hyperparameters
 - **Architecture:** RoBERTa-base
 - **Optimizer:** AdamW
+- **Learning Rate:** Linear decay with warmup (Transformers default)
 - **Batch Size:** Optimized for A40 GPU
 - **Training Epochs:** 10
 - **Hardware:** NVIDIA A40 (48GB)
 - **Precision:** Mixed precision (BF16)
+- **Training Time:** ~12 hours
 ## Evaluation
 If you use this model in your research, please cite:
 ```bibtex
+@misc{assamese-roberta-2025,
   author = {MWire Labs},
   title = {AssameseRoBERTa: A RoBERTa Model for Assamese Language},
+  year = {2025},
   publisher = {HuggingFace},
   howpublished = {\url{https://huggingface.co/MWirelabs/assamese-roberta}}
 }