File size: 6,317 Bytes
c75ec1c dfb9780 d53ae91 dfb9780 c75ec1c 28e6c24 dfb9780 c75ec1c dfb9780 c75ec1c dfb9780 c75ec1c dfb9780 c75ec1c dfb9780 d53ae91 dfb9780 01530f9 c75ec1c dfb9780 c75ec1c 01530f9 c75ec1c dfb9780 01530f9 c75ec1c 01530f9 c75ec1c dfb9780 c75ec1c 01530f9 c75ec1c 01530f9 c75ec1c dfb9780 c75ec1c 01530f9 c75ec1c 01530f9 c75ec1c dfb9780 c75ec1c dfb9780 01530f9 c75ec1c d53ae91 01530f9 d53ae91 dfb9780 01530f9 c75ec1c dfb9780 c75ec1c 01530f9 c75ec1c dfb9780 c75ec1c dfb9780 c75ec1c dfb9780 c75ec1c dfb9780 01530f9 dfb9780 01530f9 dfb9780 c75ec1c dfb9780 c75ec1c dfb9780 c75ec1c dfb9780 c75ec1c dfb9780 c75ec1c dfb9780 c75ec1c dfb9780 c75ec1c dfb9780 c75ec1c dfb9780 c75ec1c dfb9780 c75ec1c dfb9780 c75ec1c dfb9780 c75ec1c dfb9780 c75ec1c dfb9780 d53ae91 dfb9780 d53ae91 dfb9780 c75ec1c dfb9780 c75ec1c dfb9780 c75ec1c dfb9780 c75ec1c dfb9780 c75ec1c dfb9780 c75ec1c dfb9780 c75ec1c dfb9780 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 |
---
language:
- as
license: cc-by-4.0
tags:
- assamese
- roberta
- masked-lm
- fill-mask
datasets:
- MWirelabs/assamese-monolingual-corpus
metrics:
- perplexity
model-index:
- name: AssameseRoBERTa
results:
- task:
type: fill-mask
name: Masked Language Modeling
metrics:
- name: Perplexity (Training Domain)
type: perplexity
value: 2.2547
- name: Perplexity (Unseen Text)
type: perplexity
value: 5.9281
---




# AssameseRoBERTa
## Model Description
AssameseRoBERTa is a RoBERTa-based language model trained from scratch on Assamese monolingual text. The model is designed to provide robust language understanding capabilities for the Assamese language, which is spoken by over 15 million people primarily in the Indian state of Assam.
This model was developed by [MWire Labs](https://mwirelabs.com), an AI research organization focused on building language technologies for Northeast Indian languages.
## Model Details
- **Model Type:** RoBERTa (Robustly Optimized BERT Pretraining Approach)
- **Language:** Assamese (as)
- **Training Data:** 1.6M Assamese sentences from diverse sources
- **Parameters:** ~110M
- **Training Epochs:** 10
- **Training Duration:** ~12 hours on A40 GPU
- **Vocabulary Size:** 50,265 tokens
- **Max Sequence Length:** 128 tokens
## Performance
### Perplexity Scores (Final Evaluation)
| Model | Training Domain PPL | Unseen Text PPL |
|-------|---------------------|-----------------|
| **AssameseRoBERTa (Ours)** | **1.7819** | **2.5332** |
| Assamese-BERT | 48.8211 | 12.5911 |
| MuRIL | 85.7272 | 8.7032 |
| mBERT | 26.7085 | 18.1564 |
| IndicBERT | 3194.1843 | 595.4611 |
| AxomiyaBERTa | 83615627.1696 | 30861455.2924 |
📄 **Unseen evaluation set (10 Assamese sentences):**
https://huggingface.co/MWirelabs/assamese-roberta/blob/main/assamese_unseen_eval_10.txt
The model significantly outperforms existing multilingual and Assamese models on both seen and unseen Assamese text.
## Intended Use
### Direct Use
- Masked language modeling
- Feature extraction
- Downstream Assamese NLP tasks such as:
- Text classification
- NER
- Sentiment analysis
- Question answering
- Token classification
### Out-of-Scope Use
- Generating factual information without verification
- High-risk decision making
- Real-time critical systems
## Training Data
The model was trained on the [MWirelabs/assamese-monolingual-corpus](https://huggingface.co/datasets/MWirelabs/assamese-monolingual-corpus) dataset (~1.6M sentences), sourced from:
- News
- Web crawl
- Literature
- Government text
- Social media
## Training Procedure
### Preprocessing
- Assamese script normalization
- Byte-Level BPE tokenization
- Custom Assamese vocabulary
### Tokenizer
- **Type:** Byte-Level BPE
- **Vocab Size:** 50,265
- **Special Tokens:** `<s>`, `</s>`, `<pad>`, `<unk>`, `<mask>`
### Training Hyperparameters
- **Architecture:** RoBERTa-base
- **Optimizer:** AdamW
- **Scheduler:** Warmup + Linear decay
- **Precision:** BF16
- **Device:** NVIDIA A40 (48GB)
- **Epochs:** 10
## Usage
### Masked LM Example
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("MWirelabs/assamese-roberta")
model = AutoModelForMaskedLM.from_pretrained("MWirelabs/assamese-roberta")
text = "অসম হৈছে [MASK] এখন সুন্দৰ ৰাজ্য।"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
masked_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
predicted_token_id = outputs.logits[0, masked_index].argmax(-1)
predicted_token = tokenizer.decode(predicted_token_id)
print("Predicted:", predicted_token)
```
### Feature Extraction
```python
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("MWirelabs/assamese-roberta")
model = AutoModel.from_pretrained("MWirelabs/assamese-roberta")
text = "অসমীয়া ভাষা অতি সুন্দৰ।"
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state
print(f"Embeddings shape: {embeddings.shape}")
```
## Limitations
- The model is trained exclusively on Assamese text and does not perform well on other languages
- Performance may vary on specialized domains not well-represented in the training data
- The model inherits biases present in the training data
- Code-mixed text (Assamese-English) may not be handled optimally
## Ethical Considerations
- This model may reflect biases present in the training corpus
- Users should evaluate the model's outputs in their specific context before deployment
- The model should not be used for generating harmful or misleading content
- Consider fairness implications when deploying in real-world applications
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{assamese-roberta-2025,
author = {MWire Labs},
title = {AssameseRoBERTa: A RoBERTa Model for Assamese Language},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/MWirelabs/assamese-roberta}}
}
```
## Contact
For questions or feedback, please contact:
- Website: https://mwirelabs.com
- Email: connect@mwirelabs.com
## License
This model is released under the **Creative Commons Attribution 4.0 International License (CC-BY-4.0)**.
You are free to:
- **Share** — copy and redistribute the material in any medium or format
- **Adapt** — remix, transform, and build upon the material for any purpose, even commercially
Under the following terms:
- **Attribution** — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
See the full license at: https://creativecommons.org/licenses/by/4.0/
|